Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divergent loss with SimCLR #1633

Open
antho214 opened this issue Aug 15, 2024 · 5 comments
Open

Divergent loss with SimCLR #1633

antho214 opened this issue Aug 15, 2024 · 5 comments
Labels

Comments

@antho214
Copy link

Sometimes, when training using the SimCLR method I get some divergent loss function (see attached screenshot). I wonder if anyone has ever experienced this kind of issue when training with SimCLR. This has happened to me on different occasions with ResNet-18/50 model.
I don't think that this is an issue with the code, but if anyone has ever seen this kind of problem I would be grateful for your input.

Screenshot 2024-08-15 at 07 45 03

Here's some information about the training hyper-parameters:

  • multi-gpu training with batch_size of 256 per GPU
  • batch_size: 1024
  • criterion: NTXentLoss with temperature of 0.1 and gather_distributed=True
  • optimizer: LARS with a base learning rate of 0.3 and default parameters from here
  • scheduler: CosineWarmupScheduler with 10k warmup steps
@guarin
Copy link
Contributor

guarin commented Aug 16, 2024

This looks interesting, I haven't encountered this before. What type of data are you using? And do you use sync batchnorm?

@antho214
Copy link
Author

I am using microscopy data (large image; typically 2000x2000 pixels) from which I randomly crop 224x224 pixels size images. Using a grid-like sampling I can generate >750k crops that can be used for training.

I did set the sync_batchnorm=True in the trainer.

Something else that I realised is that the value of the loss function becomes constant at 7.624 (mean, min, and max). I tracked these values as well. This value somewhat corresponds to the loss value that I can obtain from two random vectors of size 1024x128 in the NTXentLoss.

@IgorSusmelj
Copy link
Contributor

IgorSusmelj commented Aug 21, 2024

I could imagine that you face some numerical instabilities

  • do you use fp16 or bf16? (in case you use fp16 try switching to bf16 as it might be more stable)
  • you could try to plot the histogram of the weights in tensorboard to see if some values are getting very large --> if yes, you could try use additional gradient clipping, weight clipping, stronger weight decay
  • (more a far shot into the dark) maybe also look at some of the augmented images. Maybe the representations all collapse to the same values because the data looks too similar to each other. You could play with augmentations to fix this.

We have used SimCLR and all kinds of data including medical and microscopy and haven't had issues.

@antho214
Copy link
Author

Thank you for taking the time to answer.

  • I do not use the 16-bit precision when training. I am using the default value of the Trainer which is 32-true according to the documentation.
  • I will investigate whether the weights are getting large using tensorboard.
  • I am tracking the augmentations over time and they do look different.

Again, thank you for the feedback. I will update if I find a fix/solution.

@antho214
Copy link
Author

I have tracked some of the weights during training. One thing that I notice is that the weights of the batchnorm layers are relatively large (see screenshot). The weights of the convolution layers on the other hand all seem to behave normally, typically centered at 0 and normally distributed with values ranging from approximately [-1, 1].

Screenshot 2024-08-28 at 10 16 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants