Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch size #13

Open
sushi31415926 opened this issue Nov 22, 2023 · 16 comments
Open

batch size #13

sushi31415926 opened this issue Nov 22, 2023 · 16 comments

Comments

@sushi31415926
Copy link

hello
i noticed that when i used different batch size the certainty value is changing
did you any idea why?

@Parskatt
Copy link
Owner

Which batch sizes? During training or testing?

@sushi31415926
Copy link
Author

sushi31415926 commented Nov 22, 2023

thank you for that fast response!
this issue happens during testing when i use the method match.
I load the images using PIL and use torch.stack to create batch.
when i use different batch size the certainty value changes between running of the same images
for example:
roma_model.match(torch.stack([im1] * 16), torch.stack([im2] *16)
roma_model.match(torch.stack([im1] * 8), torch.stack([im2] * 8)
the certainty value of this running is different
thanks!

@Parskatt
Copy link
Owner

Parskatt commented Nov 22, 2023 via email

@sushi31415926
Copy link
Author

sushi31415926 commented Nov 23, 2023

i think i found the issue.
in the class GP(matcher.py) in the method forward
when i change the tensors to double inside float the result became stable between batch size.
for example:
K_yy_inv = torch.linalg.inv((K_yy + sigma_noise).double())

@paolovic
Copy link

paolovic commented Jul 1, 2024

Hi @Parskatt ,

with my 11GB of GPU memory, I run out of memory when trying batching, does that make sense or am I doing something wrong?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   36C    P8     9W / 250W |      0MiB / 11178MiB |      0%      Default |

I am loading my image batches to the device, use the imagenet mean and std, and resize them to 560x560 pixels.

def load_and_preprocess_images(image_paths, target_size):
    preprocess = transforms.Compose([
        transforms.Resize(target_size, interpolation=Image.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize(mean=imagenet_mean, std=imagenet_std)
    ])
    images = [preprocess(Image.open(path).convert('RGB')).to(device) for path in image_paths]
    return torch.stack(images)

This is how I call it

batch = {"im_A": query_images, "im_B": ref_batch_images}
corresps = roma_model.forward(batch, batched=True)

currently, even batch_size=2 is too much....

In the matcher.py RegressionMatcher.forward(self, batch, batched = True, upsample = False, scale_factor = 1) after calling feature_pyramid = self.extract_backbone_features(batch, batched=batched, upsample = upsample) my CUDA memory consumption rises to 5213MiB / 11178MiB and when calling

corresps = self.decoder(f_q_pyramid, 
                                f_s_pyramid, 
                                upsample = upsample, 
                                **(batch["corresps"] if "corresps" in batch else {}),
                                scale_factor=scale_factor)

It runs out of memory...

I tried calling torch.cuda.empty_cache() between the encoder and decoder but it didn't help.

Thank you in advance
Best regards

@Parskatt
Copy link
Owner

Parskatt commented Jul 1, 2024

@paolovic

Hiya! Did you forget to wrap forward in inference_mode/no_grad?

For reference, during training a batch of 8 at res 560 fills up about 40 GB, so it would make sense that batch 2 can oom 11gb, since weights take some space.

If you want to finetune I'd suggest freezing batchnorm and using batchsize 1.

@paolovic
Copy link

paolovic commented Jul 2, 2024

@Parskatt Thanks for the very fast reply!
Lemme check, I'll update my answer later.

@Parskatt

Uh nice, yes with

batch = {"im_A": query_images, "im_B": ref_batch_images}
with torch.inference_mode():
    corresps = roma_model.forward(batch, batched=True)

I am able to reduce the memory food print (makes sense, forgot it), and can process batches of 6 images.

Thank you very much!

@paolovic
Copy link

paolovic commented Jul 3, 2024

Somehow I don't see the recommendation to apply model.eval here @Parskatt , but thank you, I changed my implementation to

batch = {"im_A": query_images, "im_B": ref_batch_images}
roma_model.eval()
with torch.inference_mode():
    corresps = roma_model.forward(batch, batched=True)

@Parskatt
Copy link
Owner

Parskatt commented Jul 3, 2024

Yeah github bugged for me and showed my comment as duplicated, removed one and both disappeared...

@paolovic
Copy link

paolovic commented Jul 3, 2024

Yeah github bugged for me and showed my comment as duplicated, removed one and both disappeared...

alright, in any case thank you very much!

@nfyfamr
Copy link

nfyfamr commented Sep 3, 2024

i think i found the issue. in the class GP(matcher.py) in the method forward when i change the tensors to double inside float the result became stable between batch size. for example: K_yy_inv = torch.linalg.inv((K_yy + sigma_noise).double())

@sushi31415926, Thanks for the reporting! I had the same issue and by your solution I could resolve it. By the way, could you let me know how this solution solve the problem?

@Zhimin00
Copy link

Hi, @Parskatt, does training batchsize significantly affect the final performance of the model? I only have 24G gpus, and the maximum batchsize is 2 rather than 8 you used as described in the paper.

@paolovic

Hiya! Did you forget to wrap forward in inference_mode/no_grad?

For reference, during training a batch of 8 at res 560 fills up about 40 GB, so it would make sense that batch 2 can oom 11gb, since weights take some space.

If you want to finetune I'd suggest freezing batchnorm and using batchsize 1.

@Parskatt
Copy link
Owner

Parskatt commented Nov 27, 2024

In general, yes lower batchsize reduces results. I would not go below 8.

Difficult to give advice, but you can decrease resolution for lower mem.

You can also reduce mem through making better local correlation kernel. I have done this, but it's part of a new project which I can't share yet.

You could also try things like sync batchnorm and gradient accumulation. Batchnorm is scary though.

@Zhimin00
Copy link

Thank you for your fast response. Do you mean a batch size of 8 in total (global batch size) or 8 per GPU (local batch size)?

@Parskatt
Copy link
Owner

Thank you for your fast response. Do you mean a batch size of 8 in total (global batch size) or 8 per GPU (local batch size)?

I have verified that batchsize of 8 of 1 gpu works (global batch size). Local batch size of 4 on 2 gpus with bn syncro should also give very similar results.

I really would like to throw out bn, but it's just too good for the refiners to switch easily (atleast in my older experiments this was the case).

@Zhimin00
Copy link

I see. bn is quite annoying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants