Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][c10d] fixup fsdp tutorial #1297

Merged
merged 1 commit into from
Nov 8, 2024
Merged

[doc][c10d] fixup fsdp tutorial #1297

merged 1 commit into from
Nov 8, 2024

Conversation

c-p-i-o
Copy link
Contributor

@c-p-i-o c-p-i-o commented Oct 31, 2024

Summary:
Fix up the FSDP tutorial to get it functional again.

  1. Add missing import for load_dataset.
  2. Use checkpoint instead of _shard.checkpoint to get rid of a warning.
  3. Add nlp to requirements.txt
  4. Get rid of load_metric as this function does not exist in new datasets module.
  5. Add legacy=False to get rid of tokenizer warnings.

Test Plan:
Ran the tutorial as follows and ensured that it ran successfully:

torchrun --nnodes=1 --nproc_per_node=2 T5_training.py
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting
OMP_NUM_THREADS environment variable for each process to be 1 in
default, to avoid your system being overloaded, please further tune the
variable for optimal performance in your application as needed.
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
bFloat16 enabled for mixed precision - using bfSixteen policy

Copy link

netlify bot commented Oct 31, 2024

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit cb00288
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/672a8d974a588a00083764e1

@fduwjj
Copy link
Contributor

fduwjj commented Oct 31, 2024

looks like running python example failed?

@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented Oct 31, 2024

looks like running python example failed?

Unrelated to my change - but I fixed it anyway. Needed to update to a newer Python version in CI.

See the additional diff I made to .github/workflows/main_python.yml. Let me know if you want me to split this out into a separate change.

@c-p-i-o c-p-i-o force-pushed the chirag/fix-fsdp-tutorial branch 2 times, most recently from 752efec to 0834097 Compare November 1, 2024 00:18
@@ -65,7 +65,7 @@ def load_model_sharded(model, rank, cfg, verbose=True):
if rank == 0:
ck = checkpoint.keys()
print(f" checkpoint key len = {len(ck)} and \n keys = {ck}")

dist_cp.load_state_dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DCP usage is pretty outdated. Should we also update them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this in a subsequent change - if that's ok with you?
This change is already too large as I am fixing up the python tests that broke.
The breakage is unrelated to this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented Nov 1, 2024

@fduwjj - CI is green now.
As mentioned, the CI broke because of some dependency changes upstream and I had to do 3 things to fix Run Python Examples.

  1. Use newer Python in .github/workflows
  2. Pin numpy to below version 2.
  3. Pin torchvision.

@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented Nov 5, 2024

This change will be rebased on #1299 to fix the failing Python Examples.

Summary:
Fix up the FSDP tutorial to get it functional again.
1. Add missing import for load_dataset.
2. Use `checkpoint` instead of `_shard.checkpoint` to get rid of a
   warning.
3. Add nlp to requirements.txt
4. Get rid of `load_metric` as this function does not exist in new
   `datasets` module.
5. Add `legacy=False` to get rid of tokenizer warnings.

Test Plan:
Ran the tutorial as follows and ensured that it ran successfully:
```
torchrun --nnodes=1 --nproc_per_node=2 T5_training.py
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting
OMP_NUM_THREADS environment variable for each process to be 1 in
default, to avoid your system being overloaded, please further tune the
variable for optimal performance in your application as needed.
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
bFloat16 enabled for mixed precision - using bfSixteen policy
```
@c-p-i-o c-p-i-o merged commit 1bef748 into main Nov 8, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants