loss becomes nan #49

erzhu222 · 2022-08-23T01:57:53Z

lib.utils.logging INFO: [Step 10470/182650] [Epoch 2/50] [multi]
loss: nan, time: 5.862533, eta: 11 days, 16:23:31
meanstd-tanh_auxiloss: nan, meanstd-tanh_loss: nan, msg_normal_loss: nan, pairwise-normal-regress-edge_loss: nan, pairwise-normal-regress-plane_loss: nan, ranking-edge_auxiloss: nan, ranking-edge_loss: nan, abs_rel: 0.211080, whdr: 0.087764,
group0_lr: 0.001000, group1_lr: 0.001000,
您好，当我在用taskonomy DiverseDepth HRWSI Holopix50k这四个数据集训练的时候，loss变成了nan，请问您在训练的时候有遇到这样的问题吗？如果有应该怎么解决呢？谢谢！下面是我输入的参数
--backbone resnext101
--dataset_list taskonomy DiverseDepth HRWSI Holopix50k
--batchsize 16
--base_lr 0.001
--use_tfboard
--thread 8
--loss_mode ranking-edge_pairwise-normal-regress-edge_msgil-normal_meanstd-tanh_pairwise-normal-regress-plane_ranking-edge-auxi_meanstd-tanh-auxi
--epoch 50
--lr_scheduler_multiepochs 10 25 40
--val_step 5000
--snapshot_iters 5000
--log_interval 10 \

YvanYin · 2022-08-23T03:18:06Z

I didn't face this issue. You can clip your gradient to avoid this issue.

erzhu222 · 2022-08-23T03:36:45Z

Thanks very much, I will try! However, I didn't change the code (the latest) and only change the batchsize and thread and use 8 nvidia V100 to train, what batchsize and thread did you set as you train?

guangkaixu · 2022-08-27T12:55:04Z

The change of batchsize will not cause the loss nan. I ever faced the "loss nan" problem due to the crop operation. If the depth image becomes invalid(0) for the whole image after cropping, the loss will be nan. I will try to debug and avoid it but it may be time-consuming due to the need for 8 nvidia V100 GPUs.

How many iterations have you trained before the loss nan? You can try to clip the gradient to avoid it, or wait for my debugging. Thank you!

erzhu222 · 2022-08-29T03:05:47Z

Thanks for your reply！The loss became nan after I trained about 12000 iterations (the 3rd epoch), and I see the code you released contains gradient clip, it seems not work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss becomes nan #49

loss becomes nan #49

erzhu222 commented Aug 23, 2022

YvanYin commented Aug 23, 2022

erzhu222 commented Aug 23, 2022

guangkaixu commented Aug 27, 2022

erzhu222 commented Aug 29, 2022

loss becomes nan #49

loss becomes nan #49

Comments

erzhu222 commented Aug 23, 2022

YvanYin commented Aug 23, 2022

erzhu222 commented Aug 23, 2022

guangkaixu commented Aug 27, 2022

erzhu222 commented Aug 29, 2022