Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GNMT v2/Tensorflow] Loss not decreasing when training custom dataset #603

Open
abbyDC opened this issue Jul 14, 2020 · 4 comments
Open
Assignees

Comments

@abbyDC
Copy link

abbyDC commented Jul 14, 2020

I just wanted to ask the following to help me train a custom model which allows me to translate <src_lang> to english. I have an issue where the loss ranges from 17-200 for a single epoch. It goes up and down drastically. I'm not sure what else I need to tweak.

Steps I've done:

  1. Edited wmt16_en_de.sh to preprocess my custom data
  2. Edited nmt.py to reflect the src and tgt files.
  3. FP32 training on 1 GPU

Questions:

  1. What other files/scripts do I need to change for training?
  2. Other ways to evaluate instead of sacrebleu? Since it uses wmt files which does not include the language I'm trying to translate.
@abbyDC abbyDC changed the title [Tensorflow GNMT v2] Loss not decreasing when training custom dataset [GNMT v2/Tensorflow] Loss not decreasing when training custom dataset Jul 14, 2020
@mwawrzos
Copy link
Contributor

Hello!

There are plenty of potential reasons. It might be too high learning rate value, a problem with data preprocessing, and many other reasons.

I suggest looking for some article explaining, how to deal with such problems. Following article seems fine to me: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Many articles begin with reducing the problem to the simplest example. For instance, reducing the dataset to just a few examples and checking if the model is able to overfit. If the simplest example works, other elements can be verified.

Can you try to follow this guide (or any other you find helpful)? If you will face a problem in some steps, it may be easier to help, knowing, what already works.

@abbyDC
Copy link
Author

abbyDC commented Jul 27, 2020

Thanks for the tips! I have done preprocessing on the data already and tweaked the learning rate value as well as other hyperparameters. There's not much difference though. I haven't changed anything with the core architecture of the model so I assumed it would work with other datasets as well.

Hello!

There are plenty of potential reasons. It might be too high learning rate value, a problem with data preprocessing, and many other reasons.

I suggest looking for some article explaining, how to deal with such problems. Following article seems fine to me: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Many articles begin with reducing the problem to the simplest example. For instance, reducing the dataset to just a few examples and checking if the model is able to overfit. If the simplest example works, other elements can be verified.

Can you try to follow this guide (or any other you find helpful)? If you will face a problem in some steps, it may be easier to help, knowing, what already works.

@mwawrzos
Copy link
Contributor

mwawrzos commented Aug 4, 2020

@abbyDC Have you tried to use some guide as I suggested before? Have it helped you to find any issue? Can you check the preprocessed data, for example, if all original data exists in the created dataset?

@abbyDC
Copy link
Author

abbyDC commented Aug 6, 2020

@mwawrzos Yup double checked them and there seems to be no problem with the dataset itself. Also tried doing batch inference after training and got okay results despite the loss being like that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants