-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch_size_per_gpu
* grad_accumulation_steps
== global steps ?
#632
Comments
global_step
* grad_accumulation_steps
== global steps ?batch_size_per_gpu
* grad_accumulation_steps
== global steps ?
It is the same updates (in our paper) to compare but not steps you can find the notes in train.py (previous version)
to be clear, if train with grad_accum=2, you can use your 2,400,000 step ckpt to compare with our released 1,200,000 step ckpt; they are all 1,200,000 update ckpts |
Thanks for the clarifications. It's confusing, as in some projects, IMHO, the number of updates is the most significant number for training, because it's the one number to compare results of different trainings, but it is not shown anywhere - if I follow the discussions about fine-tuning, people are not aware of the differences. Maybe it would be worth having an extra note inside the training Readme, not just a side comment in the configuration ? I can draft a PR, if you like. |
Sure welcome PR. The epoch is the number of rounds of going through the dataset, so its concept is natively with steps, but surely someone needs to consider multiplying it according to grad_accum to maintain same update counts. |
I have a similar question. If I use a single A100 80GB GPU with batch_size_per_gpu: 38400, how can I achieve comparable results to training on 8 x A100 80GB GPUs?
Thanks! |
yes, set grad_accum to 8, train to 1.2M*8=9.6M steps (effectively 1.2M updates |
does this also mean that if I use a single A100 with 80GB, I can use the batch size of 19200 and 16 grad_accumulation_steps to achieve comparable results |
Checks
Question details
To my understanding, if one likes to reproduce your model with a slightly different hardware, the combination of
batch_size_per_gpu
andgrad_accumulation_steps
should result in the same global batch size and gradient accumulation behaviour.As an example:
You used 8xA100 80GB GPU's for model training, accordingly your settings are:
If I want to use 8xA100 40GB GPU's for model training to reproduce your results, I'd use:
Then the overall steps should be the same and also training time is roughly the same (same GPU, just half the memory).
But currently, the shown overall steps is not at all dependent from
grad_accumulation_steps
. Also the variableglobal_step
intrainer.py
is always updated independently from the value ofgrad_accumulation_steps
.This is not the case e.g. in diffusors.
The consequence is that overall number of steps in the above example is doubled, checkpoints at a certain step-time are therefore not comparable and also training seems to progress twice as fast, as now 2 steps are calculated in the same time where the 80GB A100 version needed 1 step.
It seems to me like a bug, but I am unsure because of your answer in #630.
The text was updated successfully, but these errors were encountered: