Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot overfit net on a small training set #165

Open
sptom opened this issue Apr 23, 2020 · 35 comments
Open

Cannot overfit net on a small training set #165

sptom opened this issue Apr 23, 2020 · 35 comments

Comments

@sptom
Copy link

sptom commented Apr 23, 2020

Hi there,
First of all, thank you for this code!
My first run with the network was unsuccessful, so I tried to make a quick sanity check.
It is my understanding that a "healthy" neural net should easily overfit the training data when few examples are given. It should quickly learn to classify them with 100% accuracy by simply "memorising" the images.
So I tried to overfit the net for 38 images (35, actually, because 3 are used as validation set) from the Pascal VOC database.
image

I use binary black=background white=object masks.
image

Running training for 100 epochs, I still get very poor results - the training loss fluctuates rather heavily around 0.5, as can be seen from the tensor-board plot and the output to screen:
image

image

I'm quite certain that this behaviour is quite irregular and that I'm missing something. Do you have an idea what might that be?

The results on the very same training set don't make any sense:
image

@milesial
Copy link
Owner

Thank you for the detailed explaination of your problem. From the loss value and the loss plot it seems that your learning rate is way too high, have you tried lowering it? Divide it by 10 first maybe.

Also for a sanity check, you can try with even less images, like 1 or 2, the training will be faster.

@juliagong
Copy link

Hi @milesial and @sptom! I'm having this issue of failing to overfit as well, and I'm using 1e-4 as my learning rate and a training set of just a single image. I can't seem to find anything irregular in the training code, but if more than one of us is having this issue, I wonder if there's something we're overlooking? Any insight would be appreciated!

@milesial
Copy link
Owner

I think your learning rate is too high. For the full carvana dataset I used 2e-6 as a LR

@juliagong
Copy link

Thanks; I unfortunately have also tried learning rates on the order of 1e-6 and got the same results. I also disabled all augmentation, normalization, and other regularization so that it's exactly the steps that you used. For some reason, the model isn't overfitting to the one-image dataset and either gives nonsensical segmentations or converges to weights being all 0. Have you had this issue? Do you have any insights? Thanks!

@sptom
Copy link
Author

sptom commented Apr 30, 2020

Thanks for the reply Alex,
I also tried running the code with various lower orders of magnitude for the learning rate. I also tried using different optimizers, but I have to agree with Julia here, it did not help:
The phenomenon that I see is that the lower I take the LR, the quicker I get the loss to converge to 0.7 and stay there, for some reason.
image

In some of the runs I got somewhat indicative results for the masks, but it's hardly something that could be considered 'overfitting'.
image

Could it be that the loss funciton is unstable? Or perhaps you added some alteration in a recent update?

Thanks a lot,
Tom

@milesial
Copy link
Owner

milesial commented Apr 30, 2020

There was a lot of small tweaks in recent commits, but nothing that should affect the convergence I think. Are you using transposed conv or the bilinear route (default)?
The loss is just the cross entropy so it should be pretty stable.

Do you both work on the same dataset? Have you tried with an image from the Carvana dataset?

This problem is very strange. If you feed it 100 images, does it learn something?

@juliagong
Copy link

I'm using the bilinear route. I don't think we're using the same dataset, but we have the same issue. I've tried feeding it 50-100 images in training and it doesn't learn properly; it stays at around 0.7 loss.

@juliagong
Copy link

juliagong commented May 1, 2020

Update: I also tried transposed convolutions and they are not working either. It's such a strange issue and I don't think I've ever encountered something like this before.

I wonder if it is something that's not wrong with the model, but somehow with the training procedure. I have tried both the two models from this repo as well as a pretrained model from a different project, all with the same image. They all do not learn.

@juliagong
Copy link

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.

Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)

Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

@sptom
Copy link
Author

sptom commented May 3, 2020

Oh, Wow, awesome @juliagong ! that sounds really sneaky! I didn't notice that the net.train() was put in the epoch and not in the batch loop...
However, I tried to solve this, first by putting net.train() into the batch loop, which didn't help unfortunately. I then also tried to put the code in eval.py into with torch.no_grad(): instead of using net.eval(), but I didn't notice any significant effect by that.
Could you say how did you resolve the issue, and also - which parameters did you use afterwards?
Was the overfitting accurate? How many epochs did it take to converge to effectively zero loss?

@phper5
Copy link

phper5 commented May 3, 2020

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.

Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)

Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

i think the real train code is masks_pred = net(imgs) and then count the loss and loss.backward() so every batch it has traind, even net.train() was called only at the beginning of each epoch. am i wrong? thanks

@sptom
Copy link
Author

sptom commented May 3, 2020

@phper5, for masks_pred = net(imgs) you are right,
But the script does calculate the accuracy on the test set several times in an epoch using eval.py

@milesial
Copy link
Owner

milesial commented May 3, 2020

@juliagong Thanks for your investigation ! It is indeed a big mistake that is there from a long time ago and needs a fix. But do the train and eval methods of the net module affect something else than BatchNorms here?
When you say there is no meaningful training, I'm not sure, since there is still gradient updates on other layers, it's just the BatchNorms that are broken (?)

Thanks to all of you for participating in this

@phper5
Copy link

phper5 commented May 3, 2020

@phper5, for masks_pred = net(imgs) you are right,
But the script does calculate the accuracy on the test set several times in an epoch using eval.py

Yes you are right. sorry I didn't look carefully

@sptom
Copy link
Author

sptom commented May 4, 2020

Oh, Wow, awesome @juliagong ! that sounds really sneaky! I didn't notice that the net.train() was put in the epoch and not in the batch loop...
However, I tried to solve this, first by putting net.train() into the batch loop, which didn't help unfortunately. I then also tried to put the code in eval.py into with torch.no_grad(): instead of using net.eval(), but I didn't notice any significant effect by that.
Could you say how did you resolve the issue, and also - which parameters did you use afterwards?
Was the overfitting accurate? How many epochs did it take to converge to effectively zero loss?

@juliagong could you please share how did you resolve the issue? Because I did the above and it doesn't seem to help, still converging to 0.6 loss

@gboy2019
Copy link

gboy2019 commented May 5, 2020

Oh, Wow, awesome @juliagong ! that sounds really sneaky! I didn't notice that the net.train() was put in the epoch and not in the batch loop...
However, I tried to solve this, first by putting net.train() into the batch loop, which didn't help unfortunately. I then also tried to put the code in eval.py into with torch.no_grad(): instead of using net.eval(), but I didn't notice any significant effect by that.
Could you say how did you resolve the issue, and also - which parameters did you use afterwards?
Was the overfitting accurate? How many epochs did it take to converge to effectively zero loss?

@juliagong could you please share how did you resolve the issue? Because I did the above and it doesn't seem to help, still converging to 0.6 loss

me too,sometimes 0.7,sometimes 0.6,sometimes 0.8!so how fix this issue?

@ProfessorHuang
Copy link

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.

Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)

Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

Invaluable finding! Thank you very much. milesial's code is pretty, I think I can understand it, but I can't figure out Why it doesn't work on my dataset. Today, When I changed the code as you said, everything becomes good.

@milesial
Copy link
Owner

milesial commented May 5, 2020

Hi all, I modified to switch back to train mode in 773ef21

@sptom
Copy link
Author

sptom commented May 5, 2020

Thanks, @milesial for your update.
However, I'm sorry to say that this did not resolve the issue.
Loss is still stuck around 0.6
I tried reducing the dataset to 11 copies of a single image, and the loss is now stuck on 0.3:

image

image
The above is the result after running 15 epochs on 11 copies of the same image!

Were you able to overfit a set of 10 different images? If so, how many epochs did it take, and with which parameters?

@ProfessorHuang , What exactly did you change in the code?

@gboy2019
Copy link

gboy2019 commented May 6, 2020

Thanks, @milesial for your update.
However, I'm sorry to say that this did not resolve the issue.
Loss is still stuck around 0.6
I tried reducing the dataset to 11 copies of a single image, and the loss is now stuck on 0.3:

image

image
The above is the result after running 15 epochs on 11 copies of the same image!

Were you able to overfit a set of 10 different images? If so, how many epochs did it take, and with which parameters?

@ProfessorHuang , What exactly did you change in the code?

could you share your code?loss of mine is 1e+3 sometimes,very terrible ,so show your code?

@shilei2403
Copy link

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.
Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)
Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

Invaluable finding! Thank you very much. milesial's code is pretty, I think I can understand it, but I can't figure out Why it doesn't work on my dataset. Today, When I changed the code as you said, everything becomes good.

how do you change the code ?could you show the details?

@karlita101
Copy link

Hi there, I am also wondering if there's been any updates or willingness to share how they've overcame this issue.

Very much appreciated!

@juliagong @ProfessorHuang

@Li-Wei-NCKU
Copy link

Hi all, I modified to switch back to train mode in 773ef21

The modify makes sense to me logically.
However, the loss is still stuck and the model couldn't overfit on a small dataset with only 20 images.
Is there any kind suggestion? @milesial @juliagong @ProfessorHuang

@AJSVB
Copy link

AJSVB commented Jul 31, 2021

I might be wrong, but I had similar issue, and reducing weight decay and momentum helped me overfitting

@rgkannan676
Copy link

rgkannan676 commented Sep 6, 2021

Hi,

For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.

Example.

# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

@Flyingdog-Huang
Copy link

Hi,

For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.

Example.

# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

@rgkannan676 Thanks a lot ,this way helps the loss value becomes normal in small dataset for me ,I want to know why it can avoid LR becoming small after few steps, and now I will think how can reduce the loss shoking like this

微信图片_20210911221748

hope can recieve your reply, thanks again @rgkannan676

@Flyingdog-Huang
Copy link

maybe I find the reason
image

@k-nayak
Copy link

k-nayak commented Sep 15, 2021

Hi,

For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.

Example.

# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

I have a dataset of 260 images and 0.25 did help significantly to reduce loss, but the DIce Coefficient has remained stagnant at 0.36. Is there any way to improve the DICE score ? It is unable to generalize when given a new image.

@Flyingdog-Huang
Copy link

Hi,
For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.
Example.

# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

I have a dataset of 260 images and 0.25 did help significantly to reduce loss, but the DIce Coefficient has remained stagnant at 0.36. Is there any way to improve the DICE score ? It is unable to generalize when given a new image.

I also meet this problem , and I am thinking a way of that

@k-nayak
Copy link

k-nayak commented Sep 16, 2021

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.

Thanks in advance.

@Flyingdog-Huang
Copy link

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.

Thanks in advance.

hi, what about your project? 2 class or more class ?

@k-nayak
Copy link

k-nayak commented Sep 24, 2021

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.
Thanks in advance.

hi, what about your project? 2 class or more class ?

mine is binary segmentation

@Flyingdog-Huang
Copy link

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.
Thanks in advance.

hi, what about your project? 2 class or more class ?

mine is binary segmentation

oh, that is weird, mine is multi-seg

@Flyingdog-Huang
Copy link

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.
Thanks in advance.

hi, what about your project? 2 class or more class ?

mine is binary segmentation

I want to know if the target is smaller than the backgroud in your dataset?
my situation is :
my dataset is small , and target is smaller than backgroud.
in train part, I compute dice loss with backgroud channel, and the loss works well.
in evaluate part, I compute dice without backgroud channel , and the dice works bad.
image

and I found that the target is almost same as backgroud in this project's data,
so I guess the reason that dice is not good enough is we do not compute the big backgroud channel.
next I will analyze the relationship of dice and big backgroud in math way.

@k-nayak
Copy link

k-nayak commented Sep 29, 2021

Very interesting Approach @Flyingdog-Huang. My data set is small as well and the target is smaller compared to the whole image and there are water droplets, which are difficult to distinguish at times and also making masks. Based on the lighting conditions the data tends to be better or worse at times. The model is unable to distinguish at times if a droplet is present or not since it's transparent and lacks robust edges most of the times. I believe with my data set the problem is the the dataset itself. I am implementing it to do real-time segmentation in a video. And the results are not very good. Attention Unet perfomed a little better than U-Net and i am checking If residual Attention U-Net can perform better. My Dice score is around 0.73 with a loss of 0.20 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests