Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model loading OOM when using FSDP + QLoRA #31721

Open
2 of 4 tasks
Neo9061 opened this issue Jul 1, 2024 · 0 comments
Open
2 of 4 tasks

Model loading OOM when using FSDP + QLoRA #31721

Neo9061 opened this issue Jul 1, 2024 · 0 comments

Comments

@Neo9061
Copy link

Neo9061 commented Jul 1, 2024

System Info

Base line. For single instance of p4de.24xlarge (640GB GPU, 1000 GB CPU), i am able to use Q(4-bit)LoRA to train a large model wit size close to 300B. Device_map is set as auto with code as below.

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

However, when I use FSDP + QLoRA with 2 p4de.24xlarge instances. Model loading went OOM on CPU.

Can anyone please share some insights? Looking at the from_pretained method's code here and here. Can I get clarification on the following questions? Many thanks.

  1. For FSDP + QLoRA, during model loading, Please comment if my understanding is correct.
  • If model is quantized, then the model is loaded on GPU and further casted into CPU, because of is_quantized in this line) and this comment.
  • If model is not quantized, then the model is directly loaded into CPU.
  1. The OOM happens on CPU as I didn't see any error of "not enough CUDA memory". Thus, for the model that is quantized, when you cast the model into CPU, is only rank 0 doing the job or each of all ranks is casting into CPU leading CPU memory exploding? Same comment for the model that is not quantized during loading.

  2. For quantized model, if you load it firstly into GPU, are you using all GPUs to load the model or using rank 0 to load it?

Who can help?

@SunMarc @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Distributed-finetuning.zip
Here is my code to reproduce the issue

Expected behavior

error free

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants