Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

Open
4 tasks done
tanypullin opened this issue Nov 11, 2024 · 2 comments
Open
4 tasks done

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

tanypullin opened this issue Nov 11, 2024 · 2 comments
Labels

Comments

@tanypullin
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7B

What is the scenario where the problem happened?

transformers

Is this a known issue?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find an answer there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu
Python: Python3.10
GPUs: 2x NV 4090

Log output

File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/peft/peft_model.py", line 1644, in forward
    return self.base_model(
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
    return self.model.forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward
    loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
    loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 28, in fixed_cross_entropy
    loss = loss / num_items_in_batch
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Description

Steps to reproduce

This happens to Qwen2.5-7B-Instruct
The problem can be reproduced with the following steps:

  1. just peft training

Expected results

The results are expected to be training

Attempts to fix

Anything else helpful for investigation

downgrade transformers to 4.45.0 will work.

@jklj077
Copy link
Collaborator

jklj077 commented Nov 14, 2024

downgrade transformers to 4.45.0 will work.

looks like an issue with tranformers after the loss functions are reworked in 4.46.

for a hot fix, could you try edit this line

  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 28, in fixed_cross_entropy
    loss = loss / num_items_in_batch

to

    loss = loss / torch.tensor(num_items_in_batch, device=loss.device)

or stay at transformers<4.46.0 until a proper fix is released.

Copy link

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants