Add example Multi-GPU training script using pyreft #143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Create a version of the example Alpaca training script that supports training on multiple GPUs.
Implementation
The main changes from the original script are the distributed training set-up, data sampler change, logging changes, and saving/loading changes. Also referred to this multiGPU script when writing this.
Distributed training is done with DDP, and training is done using torchrun.
Training
Check that the existing training script works after the changes
Command:
Output:
ram1998-job-3144778.txt
Wandb Report:
https://api.wandb.ai/links/ramvenkat98/5vqqkau8
Loss curves look reasonable, training completes successfully.
Check that the new script works as expected
Command:
Output:
ram1998-job-4376567.txt
Wandb Report:
https://api.wandb.ai/links/ramvenkat98/tgsh7ru8
Testing
Compared the outputs of the original model, single-GPU REFT trained model, and multi-GPU REFT trained model. The two trained ones give a reasonable answer (note this is a Bento notebook exported and saved as a html file though the extension is txt):
local_test_output.txt