[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

ioannist · 2020-10-31T07:14:11Z

Related to Model/Framework(s)
PyTorch/Tacotron2/WaveGlow

Describe the bug

DLL 2020-10-30 22:47:30.112884 - (38, 87) train_iter_time : 0.7559060430066893 
DLL 2020-10-30 22:47:30.114633 - (38, 88) glob_iter/iters_per_epoch : 11410/306 
DLL 2020-10-30 22:47:30.377361 - (38, 88) train_loss : -3.687452554702759 
Traceback (most recent call last):
  File "train.py", line 555, in <module>
    main()
  File "train.py", line 500, in main
    scaled_loss.backward()
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
    out_scale/grads_have_scale,   # 1./scale,
ZeroDivisionError: float division by zero

To Reproduce
Steps to reproduce the behavior:

Install SpeechSynthesis/Tacotron2 form requirements.txt + PyTorch 1.6
Download training data from here and add to home Tacotron2 dir
https://s3.amazonaws.com/skinnybottle.com/downloads/tacotron-data.rar
Run and wait a few dozen epochs
python train.py -d wavs --model-name WaveGlow --training-files metadata-training-files.csv --validation-files metadata-validation-files.csv -o trumpbot-output-amp --epochs 1001 --learning-rate 1e-4 --batch-size 4 --cudnn-enabled --epochs-per-checkpoint 10 --resume-from-last --amp

Expected behavior
Should train all the way to epoch 1001

Environment
Please provide at least:

all dependencies installed with pip inside Pytorch 1.6 conda env
1080 Ti
Driver Version: 440.33.01 CUDA Version: 10.2

I should note that when I don't run into this error, I run into #694 . Not sure if they are related.

The text was updated successfully, but these errors were encountered:

ioannist added the bug Something isn't working label Oct 31, 2020

ioannist mentioned this issue Oct 31, 2020

[PyTorch/WaveGlow] NaN loss after 80 epochs #694

Open

nvpstr assigned ghost Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

ioannist commented Oct 31, 2020

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

Comments

ioannist commented Oct 31, 2020