[Bug] tpvformer train long waiting no log #3003

zkailinzhang · 2024-07-04T08:34:59Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

q

Reproduces the problem - code sample

    type='LoadPointsFromFile',
    use_dim=3),
dict(
    seg_3d_dtype='np.uint8',
    type='LoadAnnotations3D',
    with_attr_label=False,
    with_bbox_3d=False,
    with_label_3d=False,
    with_seg_3d=True),
dict(type='SegLabelMapping'),
dict(
    keys=[
        'img',
        'points',
        'pts_semantic_mask',
    ],
    meta_keys=[
        'lidar2img',
    ],
    type='Pack3DDetInputs'),

]
vis_backends = [
dict(type='LocalVisBackend'),
]
visualizer = dict(
name='visualizer',
type='Det3DLocalVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
])
work_dir = './work_dirs/tpvformer_8xb1-2x_nus-seg'

/home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments feedforward_channels in BaseTransformerLayer has been deprecated, now you should set feedforward_channels and other FFN related arguments to a dict named ffn_cfgs.
warnings.warn(
/home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments ffn_dropout in BaseTransformerLayer has been deprecated, now you should set ffn_drop and other FFN related arguments to a dict named ffn_cfgs.
warnings.warn(

Reproduces the problem - command or script

bash tools/dist_train.sh projects/TPVFormer/configs/tpvformer_8xb1-2x_nus-seg.py 2

Reproduces the problem - error message

    type='LoadPointsFromFile',
    use_dim=3),
dict(
    seg_3d_dtype='np.uint8',
    type='LoadAnnotations3D',
    with_attr_label=False,
    with_bbox_3d=False,
    with_label_3d=False,
    with_seg_3d=True),
dict(type='SegLabelMapping'),
dict(
    keys=[
        'img',
        'points',
        'pts_semantic_mask',
    ],
    meta_keys=[
        'lidar2img',
    ],
    type='Pack3DDetInputs'),

]
vis_backends = [
dict(type='LocalVisBackend'),
]
visualizer = dict(
name='visualizer',
type='Det3DLocalVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
])
work_dir = './work_dirs/tpvformer_8xb1-2x_nus-seg'

/home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments feedforward_channels in BaseTransformerLayer has been deprecated, now you should set feedforward_channels and other FFN related arguments to a dict named ffn_cfgs.
warnings.warn(
/home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments ffn_dropout in BaseTransformerLayer has been deprecated, now you should set ffn_drop and other FFN related arguments to a dict named ffn_cfgs.
warnings.warn(

Additional information

q

The text was updated successfully, but these errors were encountered:

zkailinzhang · 2024-07-04T08:59:03Z

以上为多卡训练卡住了
改单卡训练也卡主了

07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv2.conv_offset.bias:lr=2e-05
07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv2.conv_offset.bias:weight_decay=0.01
07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv2.conv_offset.bias:lr_mult=0.1
07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn2.weight is skipped since its requires_grad=False
07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn2.bias is skipped since its requires_grad=False
07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv3.weight:lr=2e-05
07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv3.weight:weight_decay=0.01
07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv3.weight:lr_mult=0.1
07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn3.weight is skipped since its requires_grad=False
07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn3.bias is skipped since its requires_grad=False
/home/zkl/code/det3d_demo/mmdetection3d/mmdet3d/evaluation/functional/kitti_utils/eval.py:10: NumbaDeprecationWarning: The 'nopython' keyword argument was no t supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def get_thresholds(scores: np.ndarray, num_gt, num_sample_pts=41):
07/04 16:52:57 - mmengine - WARNING - The prefix is not set in metric class SegMetric.
07/04 16:52:59 - mmengine - INFO - load backbone. in model from: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth
Loads checkpoint by local backend from path: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth
07/04 16:52:59 - mmengine - INFO - load neck. in model from: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth
Loads checkpoint by local backend from path: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth
07/04 16:52:59 - mmengine - WARNING - The model and loaded state dict do not match exactly

size mismatch for lateral_convs.0.conv.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.S ize([128, 512, 1, 1]).
size mismatch for lateral_convs.0.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for lateral_convs.1.conv.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch. Size([128, 1024, 1, 1]).
size mismatch for lateral_convs.1.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for lateral_convs.2.conv.weight: copying a param with shape torch.Size([256, 2048, 1, 1]) from checkpoint, the shape in current model is torch. Size([128, 2048, 1, 1]).
size mismatch for lateral_convs.2.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for fpn_convs.0.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]).
size mismatch for fpn_convs.0.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for fpn_convs.1.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]).
size mismatch for fpn_convs.1.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for fpn_convs.2.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]).
size mismatch for fpn_convs.2.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for fpn_convs.3.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]).
size mismatch for fpn_convs.3.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
unexpected key in source state_dict: fpn_convs.4.conv.weight, fpn_convs.4.conv.bias

07/04 16:52:59 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fil eio.html#file-io
07/04 16:52:59 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
07/04 16:52:59 - mmengine - INFO - Checkpoints will be saved to /home/zkl/code/det3d_demo/mmdetection3d/work_dirs/tpvformer_8xb1-2x_nus-seg.

zkailinzhang · 2024-07-04T09:00:06Z

但是显存一直在变

zkailinzhang · 2024-07-04T09:14:07Z

单卡训练的日志有了，

先跑一晚上吧明天试试多卡的

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] tpvformer train long waiting no log #3003

[Bug] tpvformer train long waiting no log #3003

zkailinzhang commented Jul 4, 2024

zkailinzhang commented Jul 4, 2024

zkailinzhang commented Jul 4, 2024 •

edited

Loading

zkailinzhang commented Jul 4, 2024

[Bug] tpvformer train long waiting no log #3003

[Bug] tpvformer train long waiting no log #3003

Comments

zkailinzhang commented Jul 4, 2024

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

zkailinzhang commented Jul 4, 2024

zkailinzhang commented Jul 4, 2024 • edited Loading

zkailinzhang commented Jul 4, 2024

zkailinzhang commented Jul 4, 2024 •

edited

Loading