-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch/WaveGlow] NaN loss after 80 epochs #694
Comments
Hi @Charlottecuc , could you please update your local repo and check if the issue still occurs? We've updated the repo recently. Even if that doesn't help, we'll at least have a common base for further investigation. |
Just pulled a repo and got the same error at epoch 67
Things I tried and did not work:
I added print(f"grad_norm: {grad_norm}") inside the else in line 505 and it seems that grad_norm becomes inf just before the NaN error
The line that throws the NaN exception is this one |
Started training again with --amp activated and all wav files normalized. Got a division by zero error at epoch 38.
|
Turns out that the previous issue is a separate one. Is NaN is also thrown with -amp activated (it just happened that in the previous run the zero division error interrupted the training first). I will open a new ticket for the zero error. In my dataset, I can get "los is NaN" to be thrown in the first epoch if I set the learning rate to 1e-3. You can get the data project settings from #732 , they are identical for this problem. |
Related to Model/Framework(s)
PyTorch/Tacotron2/WaveGlow
Describe the bug
Hi. I tried to train the WaveGlow model and encountered
Loss is NaN
problem after 80 epoches. Before epoch 82, the loss was around -5 to -4.To Reproduce
Steps to reproduce the behavior:
python3 -m multiproc train.py -m WaveGlow -o ./output/ -lr 1e-4 --epochs 1501 -bs 8 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --amp
I have changed audio parameters such as the sampling rate, filter-length, hop-length, win-length, mel-fmin and mel-fmax in all places. I also changed the stft.py according to NVIDIA/tacotron2#125 (comment) becuase of cpu memory issue.
(I ensure that the size of all training audio files is more than the segment length so the problem was not caused by 'padding with zero'.)
DLL 2020-09-21 20:06:34.624410 - PARAMETER output : ./output/
DLL 2020-09-21 20:06:34.624515 - PARAMETER dataset_path : ./
DLL 2020-09-21 20:06:34.624560 - PARAMETER model_name : WaveGlow
DLL 2020-09-21 20:06:34.624597 - PARAMETER log_file : nvlog4.json
DLL 2020-09-21 20:06:34.624633 - PARAMETER anneal_steps : None
DLL 2020-09-21 20:06:34.624671 - PARAMETER anneal_factor : 0.1
DLL 2020-09-21 20:06:34.624707 - PARAMETER epochs : 1501
DLL 2020-09-21 20:06:34.624742 - PARAMETER epochs_per_checkpoint : 20
DLL 2020-09-21 20:06:34.624776 - PARAMETER checkpoint_path : ./output/checkpoint_WaveGlow_80.pt
DLL 2020-09-21 20:06:34.624813 - PARAMETER resume_from_last : False
DLL 2020-09-21 20:06:34.624849 - PARAMETER dynamic_loss_scaling : True
DLL 2020-09-21 20:06:34.624882 - PARAMETER amp : True
DLL 2020-09-21 20:06:34.624915 - PARAMETER cudnn_enabled : True
DLL 2020-09-21 20:06:34.624949 - PARAMETER cudnn_benchmark : True
DLL 2020-09-21 20:06:34.624982 - PARAMETER disable_uniform_initialize_bn_weight : False
DLL 2020-09-21 20:06:34.625014 - PARAMETER use_saved_learning_rate : False
DLL 2020-09-21 20:06:34.625045 - PARAMETER learning_rate : 0.0001
DLL 2020-09-21 20:06:34.625077 - PARAMETER weight_decay : 0.0
DLL 2020-09-21 20:06:34.625109 - PARAMETER grad_clip_thresh : 3.4028234663852886e+38
DLL 2020-09-21 20:06:34.625147 - PARAMETER batch_size : 8
DLL 2020-09-21 20:06:34.625179 - PARAMETER grad_clip : 5.0
DLL 2020-09-21 20:06:34.625211 - PARAMETER load_mel_from_disk : False
DLL 2020-09-21 20:06:34.625248 - PARAMETER training_files : filelists/audio_text_train_filelist.txt
DLL 2020-09-21 20:06:34.625281 - PARAMETER validation_files : filelists/audio_text_val_filelist.txt
DLL 2020-09-21 20:06:34.625318 - PARAMETER text_cleaners : ['transliteration_cleaners']
DLL 2020-09-21 20:06:34.625355 - PARAMETER max_wav_value : 32768.0
DLL 2020-09-21 20:06:34.625388 - PARAMETER sampling_rate : 24000
DLL 2020-09-21 20:06:34.625420 - PARAMETER filter_length : 2048
DLL 2020-09-21 20:06:34.625461 - PARAMETER hop_length : 300
DLL 2020-09-21 20:06:34.625493 - PARAMETER win_length : 1200
DLL 2020-09-21 20:06:34.625524 - PARAMETER mel_fmin : 80.0
DLL 2020-09-21 20:06:34.625557 - PARAMETER mel_fmax : 7600.0
DLL 2020-09-21 20:06:34.625589 - PARAMETER rank : 0
DLL 2020-09-21 20:06:34.625620 - PARAMETER world_size : 4
DLL 2020-09-21 20:06:34.625652 - PARAMETER dist_url : tcp://localhost:23456
DLL 2020-09-21 20:06:34.625683 - PARAMETER group_name : group_name
DLL 2020-09-21 20:06:34.625717 - PARAMETER dist_backend : nccl
DLL 2020-09-21 20:06:34.625749 - PARAMETER bench_class :
DLL 2020-09-21 20:06:34.625780 - PARAMETER model_name : Tacotron2_PyT
DLLL {"timestamp": "1600683496.071985", "datetime": "2020-09-21 18:18:16.071985", "elapsedtime": "1953.546889", "type": "LOG", "step": [83, 56], "data": {"train_items_per_sec": 107229.49071595793}}
DLLL {"timestamp": "1600683496.072122", "datetime": "2020-09-21 18:18:16.072122", "elapsedtime": "1953.547026", "type": "LOG", "step": [83, 56], "data": {"train_iter_time": 2.3874029270373285}}
DLLL {"timestamp": "1600683497.381073", "datetime": "2020-09-21 18:18:17.381073", "elapsedtime": "1954.855977", "type": "LOG", "step": [83, 57], "data": {"glob_iter/iters_per_epoch": "497/220"}}
DLLL {"timestamp": "1600683497.954723", "datetime": "2020-09-21 18:18:17.954723", "elapsedtime": "1955.429627", "type": "LOG", "step": [83, 57], "data": {"train_loss": -4.422453880310059}}
DLLL {"timestamp": "1600683499.6586", "datetime": "2020-09-21 18:18:19.658600", "elapsedtime": "1957.133504", "type": "LOG", "step": [83, 57], "data": {"train_items_per_sec": 112401.72626839254}}
DLLL {"timestamp": "1600683499.65873", "datetime": "2020-09-21 18:18:19.658730", "elapsedtime": "1957.133634", "type": "LOG", "step": [83, 57], "data": {"train_iter_time": 2.2775450920453295}}
DLLL {"timestamp": "1600683500.788385", "datetime": "2020-09-21 18:18:20.788385", "elapsedtime": "1958.263289", "type": "LOG", "step": [83, 58], "data": {"glob_iter/iters_per_epoch": "498/220"}}
DLLL {"timestamp": "1600683501.361106", "datetime": "2020-09-21 18:18:21.361106", "elapsedtime": "1958.83601", "type": "LOG", "step": [83, 58], "data": {"train_loss": -1.9046612977981567}}
DLLL {"timestamp": "1600683502.975477", "datetime": "2020-09-21 18:18:22.975477", "elapsedtime": "1960.450381", "type": "LOG", "step": [83, 58], "data": {"train_items_per_sec": 117049.5258397939}}
DLLL {"timestamp": "1600683502.975609", "datetime": "2020-09-21 18:18:22.975609", "elapsedtime": "1960.450513", "type": "LOG", "step": [83, 58], "data": {"train_iter_time": 2.1871083899168298}}
DLLL {"timestamp": "1600683504.202287", "datetime": "2020-09-21 18:18:24.202287", "elapsedtime": "1961.677191", "type": "LOG", "step": [83, 59], "data": {"glob_iter/iters_per_epoch": "499/220"}}
DLLL {"timestamp": "1600683504.849494", "datetime": "2020-09-21 18:18:24.849494", "elapsedtime": "1962.324398", "type": "LOG", "step": [83, 59], "data": {"train_loss": -1.9907182455062866}}
DLLL {"timestamp": "1600683506.443915", "datetime": "2020-09-21 18:18:26.443915", "elapsedtime": "1963.918819", "type": "LOG", "step": [83, 59], "data": {"train_items_per_sec": 114201.97130917234}}
DLLL {"timestamp": "1600683506.444077", "datetime": "2020-09-21 18:18:26.444077", "elapsedtime": "1963.918981", "type": "LOG", "step": [83, 59], "data": {"train_iter_time": 2.241642565932125}}
DLLL {"timestamp": "1600683507.685249", "datetime": "2020-09-21 18:18:27.685249", "elapsedtime": "1965.160153", "type": "LOG", "step": [83, 60], "data": {"glob_iter/iters_per_epoch": "500/220"}}
Expected behavior
I have already tried the method offered in #236 , but the loss was still increasing after epoch 80 and reached positive 7 at epoch 82, which is abnormal.
Environment
Please provide at least:
Container version (e.g. pytorch:19.05-py3):
Container was not used.
Pytorch version: 1.2.0
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB):
Four T4.
Cuda compilation tools, release 10.0, V10.0.130
Thank you very much!
The text was updated successfully, but these errors were encountered: