Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss becomes nan #49

Open
erzhu222 opened this issue Aug 23, 2022 · 4 comments
Open

loss becomes nan #49

erzhu222 opened this issue Aug 23, 2022 · 4 comments

Comments

@erzhu222
Copy link

lib.utils.logging INFO: [Step 10470/182650] [Epoch 2/50] [multi]
loss: nan, time: 5.862533, eta: 11 days, 16:23:31
meanstd-tanh_auxiloss: nan, meanstd-tanh_loss: nan, msg_normal_loss: nan, pairwise-normal-regress-edge_loss: nan, pairwise-normal-regress-plane_loss: nan, ranking-edge_auxiloss: nan, ranking-edge_loss: nan, abs_rel: 0.211080, whdr: 0.087764,
group0_lr: 0.001000, group1_lr: 0.001000,
您好,当我在用taskonomy DiverseDepth HRWSI Holopix50k这四个数据集训练的时候,loss变成了nan,请问您在训练的时候有遇到这样的问题吗?如果有应该怎么解决呢?谢谢!下面是我输入的参数
--backbone resnext101
--dataset_list taskonomy DiverseDepth HRWSI Holopix50k
--batchsize 16
--base_lr 0.001
--use_tfboard
--thread 8
--loss_mode ranking-edge_pairwise-normal-regress-edge_msgil-normal_meanstd-tanh_pairwise-normal-regress-plane_ranking-edge-auxi_meanstd-tanh-auxi
--epoch 50
--lr_scheduler_multiepochs 10 25 40
--val_step 5000
--snapshot_iters 5000
--log_interval 10 \

@YvanYin
Copy link
Contributor

YvanYin commented Aug 23, 2022

I didn't face this issue. You can clip your gradient to avoid this issue.

@erzhu222
Copy link
Author

Thanks very much, I will try! However, I didn't change the code (the latest) and only change the batchsize and thread and use 8 nvidia V100 to train, what batchsize and thread did you set as you train?

@guangkaixu
Copy link
Collaborator

The change of batchsize will not cause the loss nan. I ever faced the "loss nan" problem due to the crop operation. If the depth image becomes invalid(0) for the whole image after cropping, the loss will be nan. I will try to debug and avoid it but it may be time-consuming due to the need for 8 nvidia V100 GPUs.

How many iterations have you trained before the loss nan? You can try to clip the gradient to avoid it, or wait for my debugging. Thank you!

@erzhu222
Copy link
Author

Thanks for your reply!The loss became nan after I trained about 12000 iterations (the 3rd epoch), and I see the code you released contains gradient clip, it seems not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants