Does Nemo-Aligner supports gradient accumulation (accumulate_grad_batches)? #451
Replies: 2 comments
-
Yes it's supported. It's implicitly used when global_batch_size > micro_batch_size * data_parallel_size. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick response! In my case, my global_batch_size = 128, micro_batch_size = 1, I didn't explicitly set data_parallel_size, but it should be 1 (according to world_size // (tensor_model_parallel_size * pipeline_model_parallel_size). So will NeMo Aligner automatically use accumulate_grad_batches = 128 in my case? A related question is on dataloader: we use this build_dataloader (https://github.com/NVIDIA/NeMo-Aligner/blob/main/nemo_aligner/data/nlp/builders.py#L462C1-L462C22) with |
Beta Was this translation helpful? Give feedback.
-
My understanding about gradient accumulation is we can logically use a larger global batch size, but each time, we only need to load a micro batch size of training dataset.
From NeMo's doc, it doesn support that: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/batching.html
Does NeMo Aligner also support it? If so, how to enable that?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions