Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

Open
ioannist opened this issue Oct 31, 2020 · 0 comments
Open

[PyTorch/WaveGlow] ZeroDivisionError: float division by zero #732

ioannist opened this issue Oct 31, 2020 · 0 comments
Labels
bug Something isn't working

Comments

@ioannist
Copy link

Related to Model/Framework(s)
PyTorch/Tacotron2/WaveGlow

Describe the bug

DLL 2020-10-30 22:47:30.112884 - (38, 87) train_iter_time : 0.7559060430066893 
DLL 2020-10-30 22:47:30.114633 - (38, 88) glob_iter/iters_per_epoch : 11410/306 
DLL 2020-10-30 22:47:30.377361 - (38, 88) train_loss : -3.687452554702759 
Traceback (most recent call last):
  File "train.py", line 555, in <module>
    main()
  File "train.py", line 500, in main
    scaled_loss.backward()
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))
  File "/home/ioannis/anaconda3/envs/waveglow/lib/python3.7/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
    out_scale/grads_have_scale,   # 1./scale,
ZeroDivisionError: float division by zero

To Reproduce
Steps to reproduce the behavior:

  1. Install SpeechSynthesis/Tacotron2 form requirements.txt + PyTorch 1.6

  2. Download training data from here and add to home Tacotron2 dir
    https://s3.amazonaws.com/skinnybottle.com/downloads/tacotron-data.rar

  3. Run and wait a few dozen epochs
    python train.py -d wavs --model-name WaveGlow --training-files metadata-training-files.csv --validation-files metadata-validation-files.csv -o trumpbot-output-amp --epochs 1001 --learning-rate 1e-4 --batch-size 4 --cudnn-enabled --epochs-per-checkpoint 10 --resume-from-last --amp

Expected behavior
Should train all the way to epoch 1001

Environment
Please provide at least:

  • all dependencies installed with pip inside Pytorch 1.6 conda env
  • 1080 Ti
  • Driver Version: 440.33.01 CUDA Version: 10.2

I should note that when I don't run into this error, I run into #694 . Not sure if they are related.

@ioannist ioannist added the bug Something isn't working label Oct 31, 2020
@nvpstr nvpstr assigned ghost Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant