Add example Multi-GPU training script using pyreft #143

ramvenkat98 · 2024-11-19T17:43:29Z

Context

Create a version of the example Alpaca training script that supports training on multiple GPUs.

Implementation

The main changes from the original script are the distributed training set-up, data sampler change, logging changes, and saving/loading changes. Also referred to this multiGPU script when writing this.

Distributed training is done with DDP, and training is done using torchrun.

Training

Check that the existing training script works after the changes
Command:

python train.py --model_name_or_path yahma/llama-7b-hf --data_path ./alpaca_data.json --output_dir ./test_single_gpu_v1/ --layers "8;19" --rank 4 --position "f1+l1" --num_train_epochs 2 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "no" --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1     --max_n_train_example 10000

Output:
ram1998-job-3144778.txt
Wandb Report:
https://api.wandb.ai/links/ramvenkat98/5vqqkau8

Loss curves look reasonable, training completes successfully.

Check that the new script works as expected
Command:

torchrun --nproc_per_node 4 train_multigpu.py --model_name_or_path yahma/llama-7b-hf --data_path ./alpaca_data.json --output_dir ./test_multi_gpu_v1/ --layers "8;19" --rank 4 --position "f1+l1" --num_train_epochs 2 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "no" --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1     --max_n_train_example 10000

Output:
ram1998-job-4376567.txt
Wandb Report:
https://api.wandb.ai/links/ramvenkat98/tgsh7ru8

Testing

Compared the outputs of the original model, single-GPU REFT trained model, and multi-GPU REFT trained model. The two trained ones give a reasonable answer (note this is a Bento notebook exported and saved as a html file though the extension is txt):
local_test_output.txt

…/pyreft into wip_multigpu_example

Ramgopal Venkateswaran and others added 11 commits October 24, 2024 16:28

Not yet working attempt at MultiGPU

ff2c613

More recent debugging

be35538

Initial working multigpu model

02672a2

fix comments

72b8012

Not yet working attempt at MultiGPU

18dd78a

More recent debugging

d9f7dd5

Initial working multigpu model

bd68d52

fix comments

8c9b5df

Merge branch 'stanfordnlp:main' into wip_multigpu_example

6062abb

Merge branch 'wip_multigpu_example' of https://github.com/ramvenkat98…

7d21160

…/pyreft into wip_multigpu_example

Re-include sampler change

005b9df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example Multi-GPU training script using pyreft #143

Add example Multi-GPU training script using pyreft #143

ramvenkat98 commented Nov 19, 2024

Add example Multi-GPU training script using pyreft #143

Are you sure you want to change the base?

Add example Multi-GPU training script using pyreft #143

Conversation

ramvenkat98 commented Nov 19, 2024

Context

Implementation

Training

Testing