Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_size_per_gpu * grad_accumulation_steps == global steps ? #632

Open
4 tasks done
lumpidu opened this issue Dec 15, 2024 · 6 comments
Open
4 tasks done

batch_size_per_gpu * grad_accumulation_steps == global steps ? #632

lumpidu opened this issue Dec 15, 2024 · 6 comments
Labels
question Further information is requested

Comments

@lumpidu
Copy link

lumpidu commented Dec 15, 2024

Checks

  • This template is only for question, not feature requests or bug reports.
  • I have thoroughly reviewed the project documentation and read the related paper(s).
  • I have searched for existing issues, including closed ones, no similar questions.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Question details

To my understanding, if one likes to reproduce your model with a slightly different hardware, the combination of batch_size_per_gpu and grad_accumulation_steps should result in the same global batch size and gradient accumulation behaviour.

As an example:

You used 8xA100 80GB GPU's for model training, accordingly your settings are:

batch_size_per_gpu: 38400
grad_accumulation_steps: 1

If I want to use 8xA100 40GB GPU's for model training to reproduce your results, I'd use:

batch_size_per_gpu: 19200
grad_accumulation_steps: 2

Then the overall steps should be the same and also training time is roughly the same (same GPU, just half the memory).

But currently, the shown overall steps is not at all dependent from grad_accumulation_steps. Also the variable global_step in trainer.py is always updated independently from the value of grad_accumulation_steps.

This is not the case e.g. in diffusors.

The consequence is that overall number of steps in the above example is doubled, checkpoints at a certain step-time are therefore not comparable and also training seems to progress twice as fast, as now 2 steps are calculated in the same time where the 80GB A100 version needed 1 step.

It seems to me like a bug, but I am unsure because of your answer in #630.

@lumpidu lumpidu added the question Further information is requested label Dec 15, 2024
@lumpidu lumpidu changed the title global_step * grad_accumulation_steps == global steps ? batch_size_per_gpu * grad_accumulation_steps == global steps ? Dec 15, 2024
@SWivid
Copy link
Owner

SWivid commented Dec 15, 2024

steps = updates * grad_accumulation_steps

It is the same updates (in our paper) to compare but not steps

you can find the notes in train.py (previous version)
or

grad_accumulation_steps: 1 # note: updates = steps / grad_accumulation_steps
the new version using hydra

to be clear, if train with grad_accum=2, you can use your 2,400,000 step ckpt to compare with our released 1,200,000 step ckpt; they are all 1,200,000 update ckpts

@lumpidu
Copy link
Author

lumpidu commented Dec 16, 2024

Thanks for the clarifications.

It's confusing, as in some projects, updates == steps, here it's not so. Also the number of epochs are not taking into account number of updates, but only steps. Could it happen that training ends (e.g. after 100 epochs) without the last update being done?

IMHO, the number of updates is the most significant number for training, because it's the one number to compare results of different trainings, but it is not shown anywhere - if I follow the discussions about fine-tuning, people are not aware of the differences.

Maybe it would be worth having an extra note inside the training Readme, not just a side comment in the configuration ? I can draft a PR, if you like.

@SWivid
Copy link
Owner

SWivid commented Dec 17, 2024

Sure welcome PR.

The epoch is the number of rounds of going through the dataset, so its concept is natively with steps, but surely someone needs to consider multiplying it according to grad_accum to maintain same update counts.

@yachuntsaikk
Copy link

I have a similar question.

If I use a single A100 80GB GPU with batch_size_per_gpu: 38400, how can I achieve comparable results to training on 8 x A100 80GB GPUs?

  • Should I set grad_accumulation_steps to 8 to simulate the same global batch size?
  • Do I also need to scale the training steps (global steps) by 8?
  • Or does the number of GPUs only affect training speed without influencing the global batch size?

Thanks!

@SWivid
Copy link
Owner

SWivid commented Dec 17, 2024

I have a similar question.

If I use a single A100 80GB GPU with batch_size_per_gpu: 38400, how can I achieve comparable results to training on 8 x A100 80GB GPUs?

  • Should I set grad_accumulation_steps to 8 to simulate the same global batch size?
  • Do I also need to scale the training steps (global steps) by 8?
  • Or does the number of GPUs only affect training speed without influencing the global batch size?

Thanks!

yes, set grad_accum to 8, train to 1.2M*8=9.6M steps (effectively 1.2M updates

@ndhuynh02
Copy link

does this also mean that if I use a single A100 with 80GB, I can use the batch size of 19200 and 16 grad_accumulation_steps to achieve comparable results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants