[GNMT v2/Tensorflow] Loss not decreasing when training custom dataset #603

abbyDC · 2020-07-14T06:58:27Z

I just wanted to ask the following to help me train a custom model which allows me to translate <src_lang> to english. I have an issue where the loss ranges from 17-200 for a single epoch. It goes up and down drastically. I'm not sure what else I need to tweak.

Steps I've done:

Edited wmt16_en_de.sh to preprocess my custom data
Edited nmt.py to reflect the src and tgt files.
FP32 training on 1 GPU

Questions:

What other files/scripts do I need to change for training?
Other ways to evaluate instead of sacrebleu? Since it uses wmt files which does not include the language I'm trying to translate.

mwawrzos · 2020-07-15T11:25:59Z

Hello!

There are plenty of potential reasons. It might be too high learning rate value, a problem with data preprocessing, and many other reasons.

I suggest looking for some article explaining, how to deal with such problems. Following article seems fine to me: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Many articles begin with reducing the problem to the simplest example. For instance, reducing the dataset to just a few examples and checking if the model is able to overfit. If the simplest example works, other elements can be verified.

Can you try to follow this guide (or any other you find helpful)? If you will face a problem in some steps, it may be easier to help, knowing, what already works.

abbyDC · 2020-07-27T01:57:47Z

Thanks for the tips! I have done preprocessing on the data already and tweaked the learning rate value as well as other hyperparameters. There's not much difference though. I haven't changed anything with the core architecture of the model so I assumed it would work with other datasets as well.

Hello!

There are plenty of potential reasons. It might be too high learning rate value, a problem with data preprocessing, and many other reasons.

I suggest looking for some article explaining, how to deal with such problems. Following article seems fine to me: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Many articles begin with reducing the problem to the simplest example. For instance, reducing the dataset to just a few examples and checking if the model is able to overfit. If the simplest example works, other elements can be verified.

Can you try to follow this guide (or any other you find helpful)? If you will face a problem in some steps, it may be easier to help, knowing, what already works.

mwawrzos · 2020-08-04T15:41:21Z

@abbyDC Have you tried to use some guide as I suggested before? Have it helped you to find any issue? Can you check the preprocessed data, for example, if all original data exists in the created dataset?

abbyDC · 2020-08-06T08:05:30Z

@mwawrzos Yup double checked them and there seems to be no problem with the dataset itself. Also tried doing batch inference after training and got okay results despite the loss being like that

abbyDC changed the title ~~[Tensorflow GNMT v2] Loss not decreasing when training custom dataset~~ [GNMT v2/Tensorflow] Loss not decreasing when training custom dataset Jul 14, 2020

nv-kkudrynski assigned mwawrzos Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GNMT v2/Tensorflow] Loss not decreasing when training custom dataset #603

[GNMT v2/Tensorflow] Loss not decreasing when training custom dataset #603

abbyDC commented Jul 14, 2020

mwawrzos commented Jul 15, 2020

abbyDC commented Jul 27, 2020

mwawrzos commented Aug 4, 2020

abbyDC commented Aug 6, 2020

[GNMT v2/Tensorflow] Loss not decreasing when training custom dataset #603

[GNMT v2/Tensorflow] Loss not decreasing when training custom dataset #603

Comments

abbyDC commented Jul 14, 2020

mwawrzos commented Jul 15, 2020

abbyDC commented Jul 27, 2020

mwawrzos commented Aug 4, 2020

abbyDC commented Aug 6, 2020