Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

Open
Charlottecuc opened this issue Sep 21, 2020 · 4 comments
Open

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

Charlottecuc opened this issue Sep 21, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@Charlottecuc
Copy link

Charlottecuc commented Sep 21, 2020

Related to Model/Framework(s)
PyTorch/Tacotron2/WaveGlow

Describe the bug
Hi. I tried to train the WaveGlow model and encountered Loss is NaN problem after 80 epoches. Before epoch 82, the loss was around -5 to -4.

To Reproduce
Steps to reproduce the behavior:
python3 -m multiproc train.py -m WaveGlow -o ./output/ -lr 1e-4 --epochs 1501 -bs 8 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --amp
I have changed audio parameters such as the sampling rate, filter-length, hop-length, win-length, mel-fmin and mel-fmax in all places. I also changed the stft.py according to NVIDIA/tacotron2#125 (comment) becuase of cpu memory issue.

(I ensure that the size of all training audio files is more than the segment length so the problem was not caused by 'padding with zero'.)

DLL 2020-09-21 20:06:34.624410 - PARAMETER output : ./output/
DLL 2020-09-21 20:06:34.624515 - PARAMETER dataset_path : ./
DLL 2020-09-21 20:06:34.624560 - PARAMETER model_name : WaveGlow
DLL 2020-09-21 20:06:34.624597 - PARAMETER log_file : nvlog4.json
DLL 2020-09-21 20:06:34.624633 - PARAMETER anneal_steps : None
DLL 2020-09-21 20:06:34.624671 - PARAMETER anneal_factor : 0.1
DLL 2020-09-21 20:06:34.624707 - PARAMETER epochs : 1501
DLL 2020-09-21 20:06:34.624742 - PARAMETER epochs_per_checkpoint : 20
DLL 2020-09-21 20:06:34.624776 - PARAMETER checkpoint_path : ./output/checkpoint_WaveGlow_80.pt
DLL 2020-09-21 20:06:34.624813 - PARAMETER resume_from_last : False
DLL 2020-09-21 20:06:34.624849 - PARAMETER dynamic_loss_scaling : True
DLL 2020-09-21 20:06:34.624882 - PARAMETER amp : True
DLL 2020-09-21 20:06:34.624915 - PARAMETER cudnn_enabled : True
DLL 2020-09-21 20:06:34.624949 - PARAMETER cudnn_benchmark : True
DLL 2020-09-21 20:06:34.624982 - PARAMETER disable_uniform_initialize_bn_weight : False
DLL 2020-09-21 20:06:34.625014 - PARAMETER use_saved_learning_rate : False
DLL 2020-09-21 20:06:34.625045 - PARAMETER learning_rate : 0.0001
DLL 2020-09-21 20:06:34.625077 - PARAMETER weight_decay : 0.0
DLL 2020-09-21 20:06:34.625109 - PARAMETER grad_clip_thresh : 3.4028234663852886e+38
DLL 2020-09-21 20:06:34.625147 - PARAMETER batch_size : 8
DLL 2020-09-21 20:06:34.625179 - PARAMETER grad_clip : 5.0
DLL 2020-09-21 20:06:34.625211 - PARAMETER load_mel_from_disk : False
DLL 2020-09-21 20:06:34.625248 - PARAMETER training_files : filelists/audio_text_train_filelist.txt
DLL 2020-09-21 20:06:34.625281 - PARAMETER validation_files : filelists/audio_text_val_filelist.txt
DLL 2020-09-21 20:06:34.625318 - PARAMETER text_cleaners : ['transliteration_cleaners']
DLL 2020-09-21 20:06:34.625355 - PARAMETER max_wav_value : 32768.0
DLL 2020-09-21 20:06:34.625388 - PARAMETER sampling_rate : 24000
DLL 2020-09-21 20:06:34.625420 - PARAMETER filter_length : 2048
DLL 2020-09-21 20:06:34.625461 - PARAMETER hop_length : 300
DLL 2020-09-21 20:06:34.625493 - PARAMETER win_length : 1200
DLL 2020-09-21 20:06:34.625524 - PARAMETER mel_fmin : 80.0
DLL 2020-09-21 20:06:34.625557 - PARAMETER mel_fmax : 7600.0
DLL 2020-09-21 20:06:34.625589 - PARAMETER rank : 0
DLL 2020-09-21 20:06:34.625620 - PARAMETER world_size : 4
DLL 2020-09-21 20:06:34.625652 - PARAMETER dist_url : tcp://localhost:23456
DLL 2020-09-21 20:06:34.625683 - PARAMETER group_name : group_name
DLL 2020-09-21 20:06:34.625717 - PARAMETER dist_backend : nccl
DLL 2020-09-21 20:06:34.625749 - PARAMETER bench_class :
DLL 2020-09-21 20:06:34.625780 - PARAMETER model_name : Tacotron2_PyT

DLLL {"timestamp": "1600683496.071985", "datetime": "2020-09-21 18:18:16.071985", "elapsedtime": "1953.546889", "type": "LOG", "step": [83, 56], "data": {"train_items_per_sec": 107229.49071595793}}
DLLL {"timestamp": "1600683496.072122", "datetime": "2020-09-21 18:18:16.072122", "elapsedtime": "1953.547026", "type": "LOG", "step": [83, 56], "data": {"train_iter_time": 2.3874029270373285}}
DLLL {"timestamp": "1600683497.381073", "datetime": "2020-09-21 18:18:17.381073", "elapsedtime": "1954.855977", "type": "LOG", "step": [83, 57], "data": {"glob_iter/iters_per_epoch": "497/220"}}
DLLL {"timestamp": "1600683497.954723", "datetime": "2020-09-21 18:18:17.954723", "elapsedtime": "1955.429627", "type": "LOG", "step": [83, 57], "data": {"train_loss": -4.422453880310059}}
DLLL {"timestamp": "1600683499.6586", "datetime": "2020-09-21 18:18:19.658600", "elapsedtime": "1957.133504", "type": "LOG", "step": [83, 57], "data": {"train_items_per_sec": 112401.72626839254}}
DLLL {"timestamp": "1600683499.65873", "datetime": "2020-09-21 18:18:19.658730", "elapsedtime": "1957.133634", "type": "LOG", "step": [83, 57], "data": {"train_iter_time": 2.2775450920453295}}
DLLL {"timestamp": "1600683500.788385", "datetime": "2020-09-21 18:18:20.788385", "elapsedtime": "1958.263289", "type": "LOG", "step": [83, 58], "data": {"glob_iter/iters_per_epoch": "498/220"}}
DLLL {"timestamp": "1600683501.361106", "datetime": "2020-09-21 18:18:21.361106", "elapsedtime": "1958.83601", "type": "LOG", "step": [83, 58], "data": {"train_loss": -1.9046612977981567}}
DLLL {"timestamp": "1600683502.975477", "datetime": "2020-09-21 18:18:22.975477", "elapsedtime": "1960.450381", "type": "LOG", "step": [83, 58], "data": {"train_items_per_sec": 117049.5258397939}}
DLLL {"timestamp": "1600683502.975609", "datetime": "2020-09-21 18:18:22.975609", "elapsedtime": "1960.450513", "type": "LOG", "step": [83, 58], "data": {"train_iter_time": 2.1871083899168298}}
DLLL {"timestamp": "1600683504.202287", "datetime": "2020-09-21 18:18:24.202287", "elapsedtime": "1961.677191", "type": "LOG", "step": [83, 59], "data": {"glob_iter/iters_per_epoch": "499/220"}}
DLLL {"timestamp": "1600683504.849494", "datetime": "2020-09-21 18:18:24.849494", "elapsedtime": "1962.324398", "type": "LOG", "step": [83, 59], "data": {"train_loss": -1.9907182455062866}}
DLLL {"timestamp": "1600683506.443915", "datetime": "2020-09-21 18:18:26.443915", "elapsedtime": "1963.918819", "type": "LOG", "step": [83, 59], "data": {"train_items_per_sec": 114201.97130917234}}
DLLL {"timestamp": "1600683506.444077", "datetime": "2020-09-21 18:18:26.444077", "elapsedtime": "1963.918981", "type": "LOG", "step": [83, 59], "data": {"train_iter_time": 2.241642565932125}}
DLLL {"timestamp": "1600683507.685249", "datetime": "2020-09-21 18:18:27.685249", "elapsedtime": "1965.160153", "type": "LOG", "step": [83, 60], "data": {"glob_iter/iters_per_epoch": "500/220"}}

Expected behavior
I have already tried the method offered in #236 , but the loss was still increasing after epoch 80 and reached positive 7 at epoch 82, which is abnormal.

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3):
    Container was not used.
    Pytorch version: 1.2.0

  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB):
    Four T4.
    Cuda compilation tools, release 10.0, V10.0.130

Thank you very much!

@Charlottecuc Charlottecuc added the bug Something isn't working label Sep 21, 2020
@ghost
Copy link

ghost commented Oct 23, 2020

Hi @Charlottecuc , could you please update your local repo and check if the issue still occurs? We've updated the repo recently. Even if that doesn't help, we'll at least have a common base for further investigation.

@ioannist
Copy link

ioannist commented Oct 30, 2020

Hi @Charlottecuc , could you please update your local repo and check if the issue still occurs? We've updated the repo recently. Even if that doesn't help, we'll at least have a common base for further investigation.

Just pulled a repo and got the same error at epoch 67

DLL 2020-10-30 14:07:59.656565 - (67, 36) glob_iter/iters_per_epoch : 12498/186 
Traceback (most recent call last):
  File "train.py", line 554, in <module>
    main()
  File "train.py", line 488, in main
    raise Exception("loss is NaN")
Exception: loss is NaN

Things I tried and did not work:

  • Normalize files with pydub
    normalized_sound = match_target_amplitude(sound, -20.0)
  • Add more training wav files (NaN was thrown at a different time)
  • Setting dynamic-loss-scaling to False

I added print(f"grad_norm: {grad_norm}") inside the else in line 505 and it seems that grad_norm becomes inf just before the NaN error

DLL 2020-10-31 17:57:38.359318 - (20, 114) train_iter_time : 0.7604210680001415 
DLL 2020-10-31 17:57:38.361658 - (20, 115) glob_iter/iters_per_epoch : 6235/306 
DLL 2020-10-31 17:57:38.573887 - (20, 115) train_loss : -3.8453807830810547 
grad_norm: 13.877706527709961
DLL 2020-10-31 17:57:39.117177 - (20, 115) train_items_per_sec : 21177.333947350646 
DLL 2020-10-31 17:57:39.117272 - (20, 115) train_iter_time : 0.7555247530108318 
DLL 2020-10-31 17:57:39.119611 - (20, 116) glob_iter/iters_per_epoch : 6236/306 
DLL 2020-10-31 17:57:39.332340 - (20, 116) train_loss : -4.56260347366333 
grad_norm: inf
DLL 2020-10-31 17:57:39.878253 - (20, 116) train_items_per_sec : 21090.173669932265 
DLL 2020-10-31 17:57:39.878334 - (20, 116) train_iter_time : 0.7586471429967787 
DLL 2020-10-31 17:57:39.880512 - (20, 117) glob_iter/iters_per_epoch : 6237/306 
Traceback (most recent call last):
  File "train.py", line 557, in <module>
    main()
  File "train.py", line 489, in main
    raise Exception("loss is NaN")
Exception: loss is NaN

The line that throws the NaN exception is this one
reduced_loss = loss.item()

@ioannist
Copy link

Started training again with --amp activated and all wav files normalized. Got a division by zero error at epoch 38.

DLL 2020-10-30 22:47:30.112884 - (38, 87) train_iter_time : 0.7559060430066893 
DLL 2020-10-30 22:47:30.114633 - (38, 88) glob_iter/iters_per_epoch : 11410/306 
DLL 2020-10-30 22:47:30.377361 - (38, 88) train_loss : -3.687452554702759 
Traceback (most recent call last):
  File "train.py", line 555, in <module>
    main()
  File "train.py", line 500, in main
    scaled_loss.backward()
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
    out_scale/grads_have_scale,   # 1./scale,
ZeroDivisionError: float division by zero

@ioannist
Copy link

ioannist commented Oct 31, 2020

Turns out that the previous issue is a separate one. Is NaN is also thrown with -amp activated (it just happened that in the previous run the zero division error interrupted the training first). I will open a new ticket for the zero error.

In my dataset, I can get "los is NaN" to be thrown in the first epoch if I set the learning rate to 1e-3. You can get the data project settings from #732 , they are identical for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants