[PyTorch/WaveGlow] NaN loss after 80 epochs #694

Charlottecuc · 2020-09-21T12:09:56Z

Related to Model/Framework(s)
PyTorch/Tacotron2/WaveGlow

Describe the bug
Hi. I tried to train the WaveGlow model and encountered Loss is NaN problem after 80 epoches. Before epoch 82, the loss was around -5 to -4.

To Reproduce
Steps to reproduce the behavior:
python3 -m multiproc train.py -m WaveGlow -o ./output/ -lr 1e-4 --epochs 1501 -bs 8 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --amp
I have changed audio parameters such as the sampling rate, filter-length, hop-length, win-length, mel-fmin and mel-fmax in all places. I also changed the stft.py according to NVIDIA/tacotron2#125 (comment) becuase of cpu memory issue.

(I ensure that the size of all training audio files is more than the segment length so the problem was not caused by 'padding with zero'.)

DLL 2020-09-21 20:06:34.624410 - PARAMETER output : ./output/
DLL 2020-09-21 20:06:34.624515 - PARAMETER dataset_path : ./
DLL 2020-09-21 20:06:34.624560 - PARAMETER model_name : WaveGlow
DLL 2020-09-21 20:06:34.624597 - PARAMETER log_file : nvlog4.json
DLL 2020-09-21 20:06:34.624633 - PARAMETER anneal_steps : None
DLL 2020-09-21 20:06:34.624671 - PARAMETER anneal_factor : 0.1
DLL 2020-09-21 20:06:34.624707 - PARAMETER epochs : 1501
DLL 2020-09-21 20:06:34.624742 - PARAMETER epochs_per_checkpoint : 20
DLL 2020-09-21 20:06:34.624776 - PARAMETER checkpoint_path : ./output/checkpoint_WaveGlow_80.pt
DLL 2020-09-21 20:06:34.624813 - PARAMETER resume_from_last : False
DLL 2020-09-21 20:06:34.624849 - PARAMETER dynamic_loss_scaling : True
DLL 2020-09-21 20:06:34.624882 - PARAMETER amp : True
DLL 2020-09-21 20:06:34.624915 - PARAMETER cudnn_enabled : True
DLL 2020-09-21 20:06:34.624949 - PARAMETER cudnn_benchmark : True
DLL 2020-09-21 20:06:34.624982 - PARAMETER disable_uniform_initialize_bn_weight : False
DLL 2020-09-21 20:06:34.625014 - PARAMETER use_saved_learning_rate : False
DLL 2020-09-21 20:06:34.625045 - PARAMETER learning_rate : 0.0001
DLL 2020-09-21 20:06:34.625077 - PARAMETER weight_decay : 0.0
DLL 2020-09-21 20:06:34.625109 - PARAMETER grad_clip_thresh : 3.4028234663852886e+38
DLL 2020-09-21 20:06:34.625147 - PARAMETER batch_size : 8
DLL 2020-09-21 20:06:34.625179 - PARAMETER grad_clip : 5.0
DLL 2020-09-21 20:06:34.625211 - PARAMETER load_mel_from_disk : False
DLL 2020-09-21 20:06:34.625248 - PARAMETER training_files : filelists/audio_text_train_filelist.txt
DLL 2020-09-21 20:06:34.625281 - PARAMETER validation_files : filelists/audio_text_val_filelist.txt
DLL 2020-09-21 20:06:34.625318 - PARAMETER text_cleaners : ['transliteration_cleaners']
DLL 2020-09-21 20:06:34.625355 - PARAMETER max_wav_value : 32768.0
DLL 2020-09-21 20:06:34.625388 - PARAMETER sampling_rate : 24000
DLL 2020-09-21 20:06:34.625420 - PARAMETER filter_length : 2048
DLL 2020-09-21 20:06:34.625461 - PARAMETER hop_length : 300
DLL 2020-09-21 20:06:34.625493 - PARAMETER win_length : 1200
DLL 2020-09-21 20:06:34.625524 - PARAMETER mel_fmin : 80.0
DLL 2020-09-21 20:06:34.625557 - PARAMETER mel_fmax : 7600.0
DLL 2020-09-21 20:06:34.625589 - PARAMETER rank : 0
DLL 2020-09-21 20:06:34.625620 - PARAMETER world_size : 4
DLL 2020-09-21 20:06:34.625652 - PARAMETER dist_url : tcp://localhost:23456
DLL 2020-09-21 20:06:34.625683 - PARAMETER group_name : group_name
DLL 2020-09-21 20:06:34.625717 - PARAMETER dist_backend : nccl
DLL 2020-09-21 20:06:34.625749 - PARAMETER bench_class :
DLL 2020-09-21 20:06:34.625780 - PARAMETER model_name : Tacotron2_PyT

DLLL {"timestamp": "1600683496.071985", "datetime": "2020-09-21 18:18:16.071985", "elapsedtime": "1953.546889", "type": "LOG", "step": [83, 56], "data": {"train_items_per_sec": 107229.49071595793}}
DLLL {"timestamp": "1600683496.072122", "datetime": "2020-09-21 18:18:16.072122", "elapsedtime": "1953.547026", "type": "LOG", "step": [83, 56], "data": {"train_iter_time": 2.3874029270373285}}
DLLL {"timestamp": "1600683497.381073", "datetime": "2020-09-21 18:18:17.381073", "elapsedtime": "1954.855977", "type": "LOG", "step": [83, 57], "data": {"glob_iter/iters_per_epoch": "497/220"}}
DLLL {"timestamp": "1600683497.954723", "datetime": "2020-09-21 18:18:17.954723", "elapsedtime": "1955.429627", "type": "LOG", "step": [83, 57], "data": {"train_loss": -4.422453880310059}}
DLLL {"timestamp": "1600683499.6586", "datetime": "2020-09-21 18:18:19.658600", "elapsedtime": "1957.133504", "type": "LOG", "step": [83, 57], "data": {"train_items_per_sec": 112401.72626839254}}
DLLL {"timestamp": "1600683499.65873", "datetime": "2020-09-21 18:18:19.658730", "elapsedtime": "1957.133634", "type": "LOG", "step": [83, 57], "data": {"train_iter_time": 2.2775450920453295}}
DLLL {"timestamp": "1600683500.788385", "datetime": "2020-09-21 18:18:20.788385", "elapsedtime": "1958.263289", "type": "LOG", "step": [83, 58], "data": {"glob_iter/iters_per_epoch": "498/220"}}
DLLL {"timestamp": "1600683501.361106", "datetime": "2020-09-21 18:18:21.361106", "elapsedtime": "1958.83601", "type": "LOG", "step": [83, 58], "data": {"train_loss": -1.9046612977981567}}
DLLL {"timestamp": "1600683502.975477", "datetime": "2020-09-21 18:18:22.975477", "elapsedtime": "1960.450381", "type": "LOG", "step": [83, 58], "data": {"train_items_per_sec": 117049.5258397939}}
DLLL {"timestamp": "1600683502.975609", "datetime": "2020-09-21 18:18:22.975609", "elapsedtime": "1960.450513", "type": "LOG", "step": [83, 58], "data": {"train_iter_time": 2.1871083899168298}}
DLLL {"timestamp": "1600683504.202287", "datetime": "2020-09-21 18:18:24.202287", "elapsedtime": "1961.677191", "type": "LOG", "step": [83, 59], "data": {"glob_iter/iters_per_epoch": "499/220"}}
DLLL {"timestamp": "1600683504.849494", "datetime": "2020-09-21 18:18:24.849494", "elapsedtime": "1962.324398", "type": "LOG", "step": [83, 59], "data": {"train_loss": -1.9907182455062866}}
DLLL {"timestamp": "1600683506.443915", "datetime": "2020-09-21 18:18:26.443915", "elapsedtime": "1963.918819", "type": "LOG", "step": [83, 59], "data": {"train_items_per_sec": 114201.97130917234}}
DLLL {"timestamp": "1600683506.444077", "datetime": "2020-09-21 18:18:26.444077", "elapsedtime": "1963.918981", "type": "LOG", "step": [83, 59], "data": {"train_iter_time": 2.241642565932125}}
DLLL {"timestamp": "1600683507.685249", "datetime": "2020-09-21 18:18:27.685249", "elapsedtime": "1965.160153", "type": "LOG", "step": [83, 60], "data": {"glob_iter/iters_per_epoch": "500/220"}}

Expected behavior
I have already tried the method offered in #236 , but the loss was still increasing after epoch 80 and reached positive 7 at epoch 82, which is abnormal.

Environment
Please provide at least:

Container version (e.g. pytorch:19.05-py3):
Container was not used.
Pytorch version: 1.2.0
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB):
Four T4.
Cuda compilation tools, release 10.0, V10.0.130

Thank you very much!

The text was updated successfully, but these errors were encountered:

ghost · 2020-10-23T21:20:50Z

Hi @Charlottecuc , could you please update your local repo and check if the issue still occurs? We've updated the repo recently. Even if that doesn't help, we'll at least have a common base for further investigation.

ioannist · 2020-10-30T12:09:23Z

Hi @Charlottecuc , could you please update your local repo and check if the issue still occurs? We've updated the repo recently. Even if that doesn't help, we'll at least have a common base for further investigation.

Just pulled a repo and got the same error at epoch 67

DLL 2020-10-30 14:07:59.656565 - (67, 36) glob_iter/iters_per_epoch : 12498/186 
Traceback (most recent call last):
  File "train.py", line 554, in <module>
    main()
  File "train.py", line 488, in main
    raise Exception("loss is NaN")
Exception: loss is NaN

Things I tried and did not work:

Normalize files with pydub
normalized_sound = match_target_amplitude(sound, -20.0)
Add more training wav files (NaN was thrown at a different time)
Setting dynamic-loss-scaling to False

I added print(f"grad_norm: {grad_norm}") inside the else in line 505 and it seems that grad_norm becomes inf just before the NaN error

DLL 2020-10-31 17:57:38.359318 - (20, 114) train_iter_time : 0.7604210680001415 
DLL 2020-10-31 17:57:38.361658 - (20, 115) glob_iter/iters_per_epoch : 6235/306 
DLL 2020-10-31 17:57:38.573887 - (20, 115) train_loss : -3.8453807830810547 
grad_norm: 13.877706527709961
DLL 2020-10-31 17:57:39.117177 - (20, 115) train_items_per_sec : 21177.333947350646 
DLL 2020-10-31 17:57:39.117272 - (20, 115) train_iter_time : 0.7555247530108318 
DLL 2020-10-31 17:57:39.119611 - (20, 116) glob_iter/iters_per_epoch : 6236/306 
DLL 2020-10-31 17:57:39.332340 - (20, 116) train_loss : -4.56260347366333 
grad_norm: inf
DLL 2020-10-31 17:57:39.878253 - (20, 116) train_items_per_sec : 21090.173669932265 
DLL 2020-10-31 17:57:39.878334 - (20, 116) train_iter_time : 0.7586471429967787 
DLL 2020-10-31 17:57:39.880512 - (20, 117) glob_iter/iters_per_epoch : 6237/306 
Traceback (most recent call last):
  File "train.py", line 557, in <module>
    main()
  File "train.py", line 489, in main
    raise Exception("loss is NaN")
Exception: loss is NaN

The line that throws the NaN exception is this one
reduced_loss = loss.item()

ioannist · 2020-10-30T21:08:31Z

Started training again with --amp activated and all wav files normalized. Got a division by zero error at epoch 38.

DLL 2020-10-30 22:47:30.112884 - (38, 87) train_iter_time : 0.7559060430066893 
DLL 2020-10-30 22:47:30.114633 - (38, 88) glob_iter/iters_per_epoch : 11410/306 
DLL 2020-10-30 22:47:30.377361 - (38, 88) train_loss : -3.687452554702759 
Traceback (most recent call last):
  File "train.py", line 555, in <module>
    main()
  File "train.py", line 500, in main
    scaled_loss.backward()
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
    out_scale/grads_have_scale,   # 1./scale,
ZeroDivisionError: float division by zero

ioannist · 2020-10-31T06:59:38Z

Turns out that the previous issue is a separate one. Is NaN is also thrown with -amp activated (it just happened that in the previous run the zero division error interrupted the training first). I will open a new ticket for the zero error.

In my dataset, I can get "los is NaN" to be thrown in the first epoch if I set the learning rate to 1e-3. You can get the data project settings from #732 , they are identical for this problem.

Charlottecuc added the bug Something isn't working label Sep 21, 2020

ioannist mentioned this issue Oct 31, 2020

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

Open

nvpstr assigned ghost Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

Charlottecuc commented Sep 21, 2020 •

edited

Loading

ghost commented Oct 23, 2020

ioannist commented Oct 30, 2020 •

edited

Loading

ioannist commented Oct 30, 2020

ioannist commented Oct 31, 2020 •

edited

Loading

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

Comments

Charlottecuc commented Sep 21, 2020 • edited Loading

ghost commented Oct 23, 2020

ioannist commented Oct 30, 2020 • edited Loading

ioannist commented Oct 30, 2020

ioannist commented Oct 31, 2020 • edited Loading

Charlottecuc commented Sep 21, 2020 •

edited

Loading

ioannist commented Oct 30, 2020 •

edited

Loading

ioannist commented Oct 31, 2020 •

edited

Loading