You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The language models I used are llama3-70B and llama-3.1-70B.
With llama-3-70B, fsdp Qlora works very well.
However, it does not work well with llama-3.1-70B.
I think it's a library dependency issue, and I'm asking because I've been testing dependencies for the last 3 days, but I can't find one that makes it work.
experimented withThe torch version
torch2.2.0
torch2.2.2
torch2.3.0
Experimented with library versions
"transformers==4.40.0" 부터 최신버전까지
"datasets==2.18.0" 부터 최신버전까지
"accelerate==0.29.3" 부터 최신버전까지
"evaluate==0.4.1" 부터 최신버전까지
"bitsandbytes==0.43.1" 부터 최신버전까지
"huggingface_hub==0.22.2" 부터 최신버전까지
"trl==0.8.6" 부터 최신버전까지
"peft==0.10.0" 부터 최신버전까지
error 코드
trainable params: 671,088,640 || all params: 8,701,349,888 || trainable%: 7.7125
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/user/iitp-Data/./script/fsdp_qlora.py", line 179, in <module>
[rank0]: training_function(script_args, training_args)
[rank0]: File "/data/user/iitp-Data/./script/fsdp_qlora.py", line 158, in training_function
[rank0]: trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
[rank0]: output = super().train(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop
[rank0]: self.model = self.accelerator.prepare(self.model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
[rank0]: result = tuple(
[rank0]: ^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
[rank0]: model = FSDP(model, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
[rank0]: _init_param_handle_from_module(
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank0]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank0]: handle = FlatParamHandle(
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
[rank0]: self._init_flat_param_and_metadata(
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank0]: ) = self._validate_tensors_to_flatten(params)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
[rank0]: raise ValueError(
[rank0]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/user/iitp-Data/./script/fsdp_qlora.py", line 179, in <module>
[rank1]: training_function(script_args, training_args)
[rank1]: File "/data/user/iitp-Data/./script/fsdp_qlora.py", line 158, in training_function
[rank1]: trainer.train(resume_from_checkpoint=checkpoint)
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
[rank1]: output = super().train(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
[rank1]: result = tuple(
[rank1]: ^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
[rank1]: model = FSDP(model, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
[rank1]: _init_param_handle_from_module(
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank1]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank1]: handle = FlatParamHandle(
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
[rank1]: self._init_flat_param_and_metadata(
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank1]: ) = self._validate_tensors_to_flatten(params)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/commpany/anaconda3/envs/fsdp/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
[rank1]: raise ValueError(
[rank1]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
The text was updated successfully, but these errors were encountered:
I've been testing https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py
The language models I used are llama3-70B and llama-3.1-70B.
With llama-3-70B, fsdp Qlora works very well.
However, it does not work well with llama-3.1-70B.
I think it's a library dependency issue, and I'm asking because I've been testing dependencies for the last 3 days, but I can't find one that makes it work.
experimented withThe torch version
Experimented with library versions
"transformers==4.40.0" 부터 최신버전까지
"datasets==2.18.0" 부터 최신버전까지
"accelerate==0.29.3" 부터 최신버전까지
"evaluate==0.4.1" 부터 최신버전까지
"bitsandbytes==0.43.1" 부터 최신버전까지
"huggingface_hub==0.22.2" 부터 최신버전까지
"trl==0.8.6" 부터 최신버전까지
"peft==0.10.0" 부터 최신버전까지
error 코드
The text was updated successfully, but these errors were encountered: