batch size #13

sushi31415926 · 2023-11-22T14:58:49Z

hello
i noticed that when i used different batch size the certainty value is changing
did you any idea why?

Parskatt · 2023-11-22T15:13:43Z

Which batch sizes? During training or testing?

sushi31415926 · 2023-11-22T15:27:47Z

thank you for that fast response!
this issue happens during testing when i use the method match.
I load the images using PIL and use torch.stack to create batch.
when i use different batch size the certainty value changes between running of the same images
for example:
roma_model.match(torch.stack([im1] * 16), torch.stack([im2] *16)
roma_model.match(torch.stack([im1] * 8), torch.stack([im2] * 8)
the certainty value of this running is different
thanks!

Parskatt · 2023-11-22T16:53:36Z

Aha. There might be a bug in the match method when not using image paths. I think the best approach is to simple use the model forward for now. Make sure that you use the same data preprocessing as us (load and resize with pil bicubic, normalize with imagenet mean and std). I'll try to clean up the code in the coming weeks.

…

________________________________ From: sushi31415926 ***@***.***> Sent: Wednesday, November 22, 2023 4:27:58 PM To: Parskatt/RoMa ***@***.***> Cc: Johan Edstedt ***@***.***>; Comment ***@***.***> Subject: Re: [Parskatt/RoMa] batch size (Issue #13) thank you for that fast response! this issue happens during testing when i use the method match. I load the images using PIL and use torch.stack to create batch. when i use different batch size the certainty value changes between running of the same images roma_model.match(torch.stack([im1]* 16), torch.stack([im2]*16) roma_model.match(torch.stack([im1]* 8), torch.stack([im2]*8) the certainty value of this running is different thanks! — Reply to this email directly, view it on GitHub<#13 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFIIB7QTJJTPXRNKOBACTSDYFYKX5AVCNFSM6AAAAAA7WMAQRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRSHE4DMOBQGE>. You are receiving this because you commented.Message ID: ***@***.***>

sushi31415926 · 2023-11-23T11:32:00Z

i think i found the issue.
in the class GP(matcher.py) in the method forward
when i change the tensors to double inside float the result became stable between batch size.
for example:
K_yy_inv = torch.linalg.inv((K_yy + sigma_noise).double())

paolovic · 2024-07-01T12:34:15Z

Hi @Parskatt ,

with my 11GB of GPU memory, I run out of memory when trying batching, does that make sense or am I doing something wrong?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   36C    P8     9W / 250W |      0MiB / 11178MiB |      0%      Default |

I am loading my image batches to the device, use the imagenet mean and std, and resize them to 560x560 pixels.

def load_and_preprocess_images(image_paths, target_size):
    preprocess = transforms.Compose([
        transforms.Resize(target_size, interpolation=Image.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize(mean=imagenet_mean, std=imagenet_std)
    ])
    images = [preprocess(Image.open(path).convert('RGB')).to(device) for path in image_paths]
    return torch.stack(images)

This is how I call it

batch = {"im_A": query_images, "im_B": ref_batch_images}
corresps = roma_model.forward(batch, batched=True)

currently, even batch_size=2 is too much....

In the matcher.py RegressionMatcher.forward(self, batch, batched = True, upsample = False, scale_factor = 1) after calling feature_pyramid = self.extract_backbone_features(batch, batched=batched, upsample = upsample) my CUDA memory consumption rises to 5213MiB / 11178MiB and when calling

corresps = self.decoder(f_q_pyramid, 
                                f_s_pyramid, 
                                upsample = upsample, 
                                **(batch["corresps"] if "corresps" in batch else {}),
                                scale_factor=scale_factor)

It runs out of memory...

I tried calling torch.cuda.empty_cache() between the encoder and decoder but it didn't help.

Thank you in advance
Best regards

Parskatt · 2024-07-01T14:48:03Z

@paolovic

Hiya! Did you forget to wrap forward in inference_mode/no_grad?

For reference, during training a batch of 8 at res 560 fills up about 40 GB, so it would make sense that batch 2 can oom 11gb, since weights take some space.

If you want to finetune I'd suggest freezing batchnorm and using batchsize 1.

paolovic · 2024-07-02T09:40:46Z

@Parskatt Thanks for the very fast reply!
Lemme check, I'll update my answer later.

@Parskatt

Uh nice, yes with

batch = {"im_A": query_images, "im_B": ref_batch_images}
with torch.inference_mode():
    corresps = roma_model.forward(batch, batched=True)

I am able to reduce the memory food print (makes sense, forgot it), and can process batches of 6 images.

Thank you very much!

paolovic · 2024-07-03T20:42:08Z

Somehow I don't see the recommendation to apply model.eval here @Parskatt , but thank you, I changed my implementation to

batch = {"im_A": query_images, "im_B": ref_batch_images}
roma_model.eval()
with torch.inference_mode():
    corresps = roma_model.forward(batch, batched=True)

Parskatt · 2024-07-03T21:16:20Z

Yeah github bugged for me and showed my comment as duplicated, removed one and both disappeared...

paolovic · 2024-07-03T21:26:44Z

Yeah github bugged for me and showed my comment as duplicated, removed one and both disappeared...

alright, in any case thank you very much!

nfyfamr · 2024-09-03T00:00:36Z

i think i found the issue. in the class GP(matcher.py) in the method forward when i change the tensors to double inside float the result became stable between batch size. for example: K_yy_inv = torch.linalg.inv((K_yy + sigma_noise).double())

@sushi31415926, Thanks for the reporting! I had the same issue and by your solution I could resolve it. By the way, could you let me know how this solution solve the problem?

Zhimin00 · 2024-11-27T21:22:22Z

Hi, @Parskatt, does training batchsize significantly affect the final performance of the model? I only have 24G gpus, and the maximum batchsize is 2 rather than 8 you used as described in the paper.

@paolovic

Hiya! Did you forget to wrap forward in inference_mode/no_grad?

For reference, during training a batch of 8 at res 560 fills up about 40 GB, so it would make sense that batch 2 can oom 11gb, since weights take some space.

If you want to finetune I'd suggest freezing batchnorm and using batchsize 1.

Parskatt · 2024-11-27T21:26:58Z

In general, yes lower batchsize reduces results. I would not go below 8.

Difficult to give advice, but you can decrease resolution for lower mem.

You can also reduce mem through making better local correlation kernel. I have done this, but it's part of a new project which I can't share yet.

You could also try things like sync batchnorm and gradient accumulation. Batchnorm is scary though.

Zhimin00 · 2024-11-27T21:41:00Z

Thank you for your fast response. Do you mean a batch size of 8 in total (global batch size) or 8 per GPU (local batch size)?

Parskatt · 2024-11-27T21:44:52Z

Thank you for your fast response. Do you mean a batch size of 8 in total (global batch size) or 8 per GPU (local batch size)?

I have verified that batchsize of 8 of 1 gpu works (global batch size). Local batch size of 4 on 2 gpus with bn syncro should also give very similar results.

I really would like to throw out bn, but it's just too good for the refiners to switch easily (atleast in my older experiments this was the case).

Zhimin00 · 2024-11-27T22:12:19Z

I see. bn is quite annoying!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch size #13

batch size #13

sushi31415926 commented Nov 22, 2023

Parskatt commented Nov 22, 2023

sushi31415926 commented Nov 22, 2023 •

edited

Loading

Parskatt commented Nov 22, 2023 via email

sushi31415926 commented Nov 23, 2023 •

edited

Loading

paolovic commented Jul 1, 2024 •

edited

Loading

Parskatt commented Jul 1, 2024 •

edited

Loading

paolovic commented Jul 2, 2024 •

edited

Loading

paolovic commented Jul 3, 2024

Parskatt commented Jul 3, 2024

paolovic commented Jul 3, 2024

nfyfamr commented Sep 3, 2024

Zhimin00 commented Nov 27, 2024

Parskatt commented Nov 27, 2024 •

edited

Loading

Zhimin00 commented Nov 27, 2024

Parskatt commented Nov 27, 2024

Zhimin00 commented Nov 27, 2024

batch size #13

batch size #13

Comments

sushi31415926 commented Nov 22, 2023

Parskatt commented Nov 22, 2023

sushi31415926 commented Nov 22, 2023 • edited Loading

Parskatt commented Nov 22, 2023 via email

sushi31415926 commented Nov 23, 2023 • edited Loading

paolovic commented Jul 1, 2024 • edited Loading

Parskatt commented Jul 1, 2024 • edited Loading

paolovic commented Jul 2, 2024 • edited Loading

paolovic commented Jul 3, 2024

Parskatt commented Jul 3, 2024

paolovic commented Jul 3, 2024

nfyfamr commented Sep 3, 2024

Zhimin00 commented Nov 27, 2024

Parskatt commented Nov 27, 2024 • edited Loading

Zhimin00 commented Nov 27, 2024

Parskatt commented Nov 27, 2024

Zhimin00 commented Nov 27, 2024

sushi31415926 commented Nov 22, 2023 •

edited

Loading

sushi31415926 commented Nov 23, 2023 •

edited

Loading

paolovic commented Jul 1, 2024 •

edited

Loading

Parskatt commented Jul 1, 2024 •

edited

Loading

paolovic commented Jul 2, 2024 •

edited

Loading

Parskatt commented Nov 27, 2024 •

edited

Loading