Potentially incorrect calculation of `total_updates` on >=4.46.0 since #34198 affecting multi gpu training #35387

chiragjn · 2024-12-21T20:36:54Z

System Info

Okay I have been pulling my hair out for a few hours and turns out this bug only happens when average_tokens_across_devices is True and epochs > 1

Simplest case to reproduce

DDP
world size 2
dataset length = 4
epochs = 2
micro batch size = 1 (aka per gpu batch size)
gradient accumulation = 1
average_tokens_across_devices = True

So every epoch, total 2 steps on both devices

but as first epoch finishes, we get

[rank1]:   File "/home/jovyan/axolotl/src/axolotl/train.py", line 191, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2473, in _inner_training_loop
[rank1]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank1]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 5142, in get_batch_samples
[rank1]:     num_items_in_batch = self.accelerator.gather(num_items_in_batch).sum().item()
[rank1]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/accelerator.py", line 2459, in gather
[rank1]:     return gather(tensor)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 376, in wrapper
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 437, in gather
[rank1]:     return _gpu_gather(tensor)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank1]:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
[rank1]:     raise TypeError(
[rank1]: TypeError: Unsupported types (<class 'NoneType'>) passed to `_gpu_gather_one`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.

The main culprit here is

transformers/src/transformers/trainer.py

Lines 2468 to 2473 in 8f38f58

    
           total_updates = steps_in_epoch // args.gradient_accumulation_steps + 1 
        
           for _ in range(total_updates): 
        
               update_step += 1 
        
               num_batches = args.gradient_accumulation_steps if update_step != (total_updates - 1) else remainder 
        
               batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches) 
        
               for i, inputs in enumerate(batch_samples):

steps_in_epoch per rank is correctly calculated as 2 but total updates is 3
Normally that is harmless because dataloader would be exhausted and would result in empty batch and it won't enter the loop on 2473.
However, when using the recently added option average_tokens_across_devices, it will try to gather number of total items in batches across all ranks and gather doesn't like broadcasting None

transformers/src/transformers/trainer.py

Lines 5139 to 5156 in 8f38f58

    
           def get_batch_samples(self, epoch_iterator, num_batches): 
        
               batch_samples = [] 
        
               num_items_in_batch = None 
        
               for _ in range(num_batches): 
        
                   try: 
        
                       batch_samples += [next(epoch_iterator)] 
        
                   except StopIteration: 
        
                       break 
        
               if len(batch_samples) > 0 and "labels" in batch_samples[0]: 
        
                   # For now we don't support object detection 
        
                   try: 
        
                       num_items_in_batch = sum([(batch["labels"].ne(-100)).sum() for batch in batch_samples]) 
        
                   except (TypeError, AttributeError): 
        
                       pass 
        
               if self.args.average_tokens_across_devices: 
        
                   num_items_in_batch = self.accelerator.gather(num_items_in_batch).sum().item()

This problem does not surface with 1 gpu because average_tokens_across_devices is auto set to False and neither under epoch = 1 because DefaultFlowCallback stops the training process considering global step and expected max steps

Who can help?

@muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Train any LM on more than one gpu, set at least average_tokens_across_devices = True and epochs > 1

Expected behavior

Either we fix total_updates count or we handle None for gather

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-12-23T15:52:54Z

Thanks for the detailed report @chiragjn ! I think the simplest solution would be to set num_items_in_batch as a tensor of 0 when the value is None since it means that batch_samples doesn't have any element. Can you check if this solves the issue ?

if self.args.average_tokens_across_devices:
    if num_items_in_batch is None:
        num_items_in_batch = torch.tensor(0)

chiragjn added the bug label Dec 21, 2024

This was referenced Dec 22, 2024

Incorrect max_steps calculation with sample_packing and multi gpu case axolotl-ai-cloud/axolotl#2203

Closed

Fix num_items_in_batch not being an integer #35115

Merged

Handle None before gathering num items in batch across devices #35399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially incorrect calculation of `total_updates` on >=4.46.0 since #34198 affecting multi gpu training #35387

Potentially incorrect calculation of `total_updates` on >=4.46.0 since #34198 affecting multi gpu training #35387

chiragjn commented Dec 21, 2024 •

edited

Loading

SunMarc commented Dec 23, 2024

Potentially incorrect calculation of total_updates on >=4.46.0 since #34198 affecting multi gpu training #35387

Potentially incorrect calculation of total_updates on >=4.46.0 since #34198 affecting multi gpu training #35387

Comments

chiragjn commented Dec 21, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Dec 23, 2024

Potentially incorrect calculation of `total_updates` on >=4.46.0 since #34198 affecting multi gpu training #35387

Potentially incorrect calculation of `total_updates` on >=4.46.0 since #34198 affecting multi gpu training #35387

chiragjn commented Dec 21, 2024 •

edited

Loading