Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WaveGlow/Pytorch] Cannot start training from last checkpoint #809

Open
IvanRubanov opened this issue Jan 12, 2021 · 2 comments
Open

[WaveGlow/Pytorch] Cannot start training from last checkpoint #809

IvanRubanov opened this issue Jan 12, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@IvanRubanov
Copy link

IvanRubanov commented Jan 12, 2021

Related to WaveGlow/Pytorch

Trying to continue training from checkpoint ends up in crashing python script.
Starting training with following command:

! bash scripts/train_waveglow.sh

Get following output:

DLL 2021-01-12 21:51:52.587361 - PARAMETER output : /home/user/checkpoints/ 
DLL 2021-01-12 21:51:52.587441 - PARAMETER dataset_path : /home/user/datasets/finnish-single-speaker-speech-dataset/data 
DLL 2021-01-12 21:51:52.587476 - PARAMETER model_name : WaveGlow 
DLL 2021-01-12 21:51:52.587517 - PARAMETER log_file : /home/user/logs/nvlog.json 
DLL 2021-01-12 21:51:52.587567 - PARAMETER anneal_steps : None 
DLL 2021-01-12 21:51:52.587619 - PARAMETER anneal_factor : 0.1 
DLL 2021-01-12 21:51:52.587669 - PARAMETER config_file : None 
DLL 2021-01-12 21:51:52.587722 - PARAMETER epochs : 1501 
DLL 2021-01-12 21:51:52.587786 - PARAMETER epochs_per_checkpoint : 1 
DLL 2021-01-12 21:51:52.587838 - PARAMETER checkpoint_path :  
DLL 2021-01-12 21:51:52.587888 - PARAMETER resume_from_last : True 
DLL 2021-01-12 21:51:52.587940 - PARAMETER dynamic_loss_scaling : True 
DLL 2021-01-12 21:51:52.587990 - PARAMETER amp : False 
DLL 2021-01-12 21:51:52.588041 - PARAMETER cudnn_enabled : True 
DLL 2021-01-12 21:51:52.588091 - PARAMETER cudnn_benchmark : True 
DLL 2021-01-12 21:51:52.588143 - PARAMETER disable_uniform_initialize_bn_weight : False 
DLL 2021-01-12 21:51:52.588199 - PARAMETER use_saved_learning_rate : False 
DLL 2021-01-12 21:51:52.588250 - PARAMETER learning_rate : 0.0001 
DLL 2021-01-12 21:51:52.588300 - PARAMETER weight_decay : 0.0 
DLL 2021-01-12 21:51:52.588351 - PARAMETER grad_clip_thresh : 3.4028234663852886e+38 
DLL 2021-01-12 21:51:52.588403 - PARAMETER batch_size : 4 
DLL 2021-01-12 21:51:52.588455 - PARAMETER grad_clip : 5.0 
DLL 2021-01-12 21:51:52.588506 - PARAMETER load_mel_from_disk : False 
DLL 2021-01-12 21:51:52.588557 - PARAMETER training_files : /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/train.csv 
DLL 2021-01-12 21:51:52.588607 - PARAMETER validation_files : /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/validate.csv 
DLL 2021-01-12 21:51:52.588657 - PARAMETER text_cleaners : ['basic_cleaners'] 
DLL 2021-01-12 21:51:52.588726 - PARAMETER max_wav_value : 32768.0 
DLL 2021-01-12 21:51:52.588791 - PARAMETER sampling_rate : 22050 
DLL 2021-01-12 21:51:52.588842 - PARAMETER filter_length : 1024 
DLL 2021-01-12 21:51:52.588891 - PARAMETER hop_length : 256 
DLL 2021-01-12 21:51:52.588941 - PARAMETER win_length : 1024 
DLL 2021-01-12 21:51:52.588991 - PARAMETER mel_fmin : 0.0 
DLL 2021-01-12 21:51:52.589043 - PARAMETER mel_fmax : 8000.0 
DLL 2021-01-12 21:51:52.589097 - PARAMETER rank : 0 
DLL 2021-01-12 21:51:52.589147 - PARAMETER world_size : 1 
DLL 2021-01-12 21:51:52.589197 - PARAMETER dist_url : tcp://localhost:23456 
DLL 2021-01-12 21:51:52.589247 - PARAMETER group_name : group_name 
DLL 2021-01-12 21:51:52.589298 - PARAMETER dist_backend : nccl 
DLL 2021-01-12 21:51:52.589348 - PARAMETER bench_class :  
DLL 2021-01-12 21:51:52.589397 - PARAMETER model_name : Tacotron2_PyT 
Loading checkpoint from symlink /home/user/checkpoints/checkpoint_WaveGlow_last.pt
scripts/train_waveglow.sh: line 2:   759 Killed                  python train.py -m WaveGlow -o /home/user/checkpoints/ -lr 1e-4 --epochs 1501 -bs 4 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file /home/user/logs/nvlog.json --epochs-per-checkpoint 1 --dataset-path /home/user/datasets/finnish-single-speaker-speech-dataset/data --training-files /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/train.csv --validation-files /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/validate.csv --resume-from-last --amp

Environment

  • OS: Win 10/WSL2
  • Container version: nvidia/cuda:11.0-devel-ubuntu20.04
  • GPUs in the system: RTX 3070
  • CUDA driver version: 465.21

Added print statement in to the train.py. Crashing in following line:

optimizer.load_state_dict(checkpoint['optimizer'])

Unfortunately, there is no error stack trace or any similar crash log. Could anyone suggest how can I debug the issue? What could be the reason?

@IvanRubanov IvanRubanov added the bug Something isn't working label Jan 12, 2021
@ghost ghost self-assigned this Feb 11, 2021
@ghost
Copy link

ghost commented Feb 23, 2021

Hi @IvanRubanov without stack trace it won't be easy, but let's try. Just to be sure, is /home/user/checkpoints/checkpoint_WaveGlow_last.pt a symlink or an actual file? Is it a checkpoint you trained yourself or was it downloaded?

@IvanRubanov
Copy link
Author

Hello, thanks for replay. File is a symlink. It is my own attempt to train waveglow. What should I do to provide a stack trace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant