Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss nan in VAE training #144

Open
shwj114514 opened this issue Sep 19, 2024 · 8 comments
Open

loss nan in VAE training #144

shwj114514 opened this issue Sep 19, 2024 · 8 comments

Comments

@shwj114514
Copy link

shwj114514 commented Sep 19, 2024

Thank you for your excellent work and the well-designed open-source code.

When I use your training code to train from scratch, I frequently encounter a situation where the loss becomes NaN after a certain number of training steps. Is this behavior expected?
image

This issue occurs when training both 44100 mono and stereo audio files. I have to repeat the training multiple times to ensure the loss remains stable.
image

I am using the stable audio 2.0 config.

@apply74
Copy link

apply74 commented Sep 23, 2024

I also encountered this problem. When I increased the model parameters, the training was unstable. Is it going to be solved?

@shwj114514
Copy link
Author

I also encountered this problem. When I increased the model parameters, the training was unstable. Is it going to be solved?

I solved this problem by reducing the learning rates of both the generator and discriminator to 1/10 of their original values, and the training became stable.

@apply74
Copy link

apply74 commented Sep 23, 2024

I also tried reducing the learning rate. Although the training is stable, the reconstruction result will be very poor.

@github-staff github-staff deleted a comment from Kami-prog Sep 25, 2024
@fletcherist
Copy link

Screenshot 2024-09-26 at 11 41 49

the same thing

@apply74
Copy link

apply74 commented Sep 26, 2024

Screenshot 2024-09-26 at 11 41 49 the same thing

I have solved the problem by increating the batch_size from 1 to 5.

@fletcherist
Copy link

fletcherist commented Sep 26, 2024

Screenshot 2024-09-26 at 11 41 49 the same thing

I have solved the problem by increating the batch_size from 1 to 5.
@apply74
oh rly? let me try it but i think this batch size doesn't fit to gpu))
i'll message here after a try. thanks for your help very appreciate it

@fletcherist
Copy link

reducing the learning rates of both the generator and discriminator to 1/10 of their original values

this works

@nateraw
Copy link

nateraw commented Sep 28, 2024

You have to tune the learning rates. Higher batch size helps keep things stable.

Another tip is if you can't get large enough batch size, you can reduce the sample size which should free up enough memory to bump back up the batch size.

Hope this helps ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants