`batch_size_per_gpu` * `grad_accumulation_steps` == global steps ? #632

lumpidu · 2024-12-15T17:35:12Z

Checks

This template is only for question, not feature requests or bug reports.
I have thoroughly reviewed the project documentation and read the related paper(s).
I have searched for existing issues, including closed ones, no similar questions.
I confirm that I am using English to submit this report in order to facilitate communication.

Question details

To my understanding, if one likes to reproduce your model with a slightly different hardware, the combination of batch_size_per_gpu and grad_accumulation_steps should result in the same global batch size and gradient accumulation behaviour.

As an example:

You used 8xA100 80GB GPU's for model training, accordingly your settings are:

batch_size_per_gpu: 38400
grad_accumulation_steps: 1

If I want to use 8xA100 40GB GPU's for model training to reproduce your results, I'd use:

batch_size_per_gpu: 19200
grad_accumulation_steps: 2

Then the overall steps should be the same and also training time is roughly the same (same GPU, just half the memory).

But currently, the shown overall steps is not at all dependent from grad_accumulation_steps. Also the variable global_step in trainer.py is always updated independently from the value of grad_accumulation_steps.

This is not the case e.g. in diffusors.

The consequence is that overall number of steps in the above example is doubled, checkpoints at a certain step-time are therefore not comparable and also training seems to progress twice as fast, as now 2 steps are calculated in the same time where the 80GB A100 version needed 1 step.

It seems to me like a bug, but I am unsure because of your answer in #630.

The text was updated successfully, but these errors were encountered:

SWivid · 2024-12-15T17:40:38Z

steps = updates * grad_accumulation_steps

It is the same updates (in our paper) to compare but not steps

you can find the notes in train.py (previous version)
or

F5-TTS/src/f5_tts/configs/F5TTS_Base_train.yaml

Line 16 in c85252f

grad_accumulation_steps: 1 # note: updates = steps / grad_accumulation_steps

the new version using hydra

to be clear, if train with grad_accum=2, you can use your 2,400,000 step ckpt to compare with our released 1,200,000 step ckpt; they are all 1,200,000 update ckpts

lumpidu · 2024-12-16T23:41:19Z

Thanks for the clarifications.

It's confusing, as in some projects, updates == steps, here it's not so. Also the number of epochs are not taking into account number of updates, but only steps. Could it happen that training ends (e.g. after 100 epochs) without the last update being done?

IMHO, the number of updates is the most significant number for training, because it's the one number to compare results of different trainings, but it is not shown anywhere - if I follow the discussions about fine-tuning, people are not aware of the differences.

Maybe it would be worth having an extra note inside the training Readme, not just a side comment in the configuration ? I can draft a PR, if you like.

SWivid · 2024-12-17T00:14:28Z

Sure welcome PR.

The epoch is the number of rounds of going through the dataset, so its concept is natively with steps, but surely someone needs to consider multiplying it according to grad_accum to maintain same update counts.

yachuntsaikk · 2024-12-17T12:27:23Z

I have a similar question.

If I use a single A100 80GB GPU with batch_size_per_gpu: 38400, how can I achieve comparable results to training on 8 x A100 80GB GPUs?

Should I set grad_accumulation_steps to 8 to simulate the same global batch size?
Do I also need to scale the training steps (global steps) by 8?
Or does the number of GPUs only affect training speed without influencing the global batch size?

Thanks!

SWivid · 2024-12-17T12:37:18Z

I have a similar question.

If I use a single A100 80GB GPU with batch_size_per_gpu: 38400, how can I achieve comparable results to training on 8 x A100 80GB GPUs?

Should I set grad_accumulation_steps to 8 to simulate the same global batch size?

Do I also need to scale the training steps (global steps) by 8?

Or does the number of GPUs only affect training speed without influencing the global batch size?

Thanks!

yes, set grad_accum to 8, train to 1.2M*8=9.6M steps (effectively 1.2M updates

ndhuynh02 · 2024-12-24T04:44:40Z

does this also mean that if I use a single A100 with 80GB, I can use the batch size of 19200 and 16 grad_accumulation_steps to achieve comparable results

lumpidu added the question Further information is requested label Dec 15, 2024

lumpidu changed the title ~~global_step * grad_accumulation_steps == global steps ?~~ batch_size_per_gpu * grad_accumulation_steps == global steps ? Dec 15, 2024

lumpidu mentioned this issue Dec 16, 2024

Checkpoint saving differences #630

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`batch_size_per_gpu` * `grad_accumulation_steps` == global steps ? #632

`batch_size_per_gpu` * `grad_accumulation_steps` == global steps ? #632

lumpidu commented Dec 15, 2024

SWivid commented Dec 15, 2024 •

edited

Loading

lumpidu commented Dec 16, 2024

SWivid commented Dec 17, 2024

yachuntsaikk commented Dec 17, 2024

SWivid commented Dec 17, 2024

ndhuynh02 commented Dec 24, 2024

batch_size_per_gpu * grad_accumulation_steps == global steps ? #632

batch_size_per_gpu * grad_accumulation_steps == global steps ? #632

Comments

lumpidu commented Dec 15, 2024

Checks

Question details

SWivid commented Dec 15, 2024 • edited Loading

lumpidu commented Dec 16, 2024

SWivid commented Dec 17, 2024

yachuntsaikk commented Dec 17, 2024

SWivid commented Dec 17, 2024

ndhuynh02 commented Dec 24, 2024

`batch_size_per_gpu` * `grad_accumulation_steps` == global steps ? #632

`batch_size_per_gpu` * `grad_accumulation_steps` == global steps ? #632

SWivid commented Dec 15, 2024 •

edited

Loading