[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

tanypullin · 2024-11-11T11:41:18Z

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7B

What is the scenario where the problem happened?

transformers

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu
Python: Python3.10
GPUs: 2x NV 4090

Log output

File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/peft/peft_model.py", line 1644, in forward
    return self.base_model(
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
    return self.model.forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward
    loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
    loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 28, in fixed_cross_entropy
    loss = loss / num_items_in_batch
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Description

Steps to reproduce

This happens to Qwen2.5-7B-Instruct
The problem can be reproduced with the following steps:

just peft training

Expected results

The results are expected to be training

Attempts to fix

Anything else helpful for investigation

downgrade transformers to 4.45.0 will work.

jklj077 · 2024-11-14T06:59:25Z

downgrade transformers to 4.45.0 will work.

looks like an issue with tranformers after the loss functions are reworked in 4.46.

for a hot fix, could you try edit this line

  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 28, in fixed_cross_entropy
    loss = loss / num_items_in_batch

to

    loss = loss / torch.tensor(num_items_in_batch, device=loss.device)

or stay at transformers<4.46.0 until a proper fix is released.

github-actions · 2024-12-14T08:00:42Z

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

github-actions bot added the inactive label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

tanypullin commented Nov 11, 2024

jklj077 commented Nov 14, 2024

github-actions bot commented Dec 14, 2024

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

Comments

tanypullin commented Nov 11, 2024

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

Steps to reproduce

Expected results

Attempts to fix

Anything else helpful for investigation

jklj077 commented Nov 14, 2024

github-actions bot commented Dec 14, 2024