Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP sample fails model validation #460

Open
shimomut opened this issue Oct 17, 2024 · 0 comments
Open

FSDP sample fails model validation #460

shimomut opened this issue Oct 17, 2024 · 0 comments

Comments

@shimomut
Copy link
Collaborator

When running FSDP sample app, it fails at the model evaluation with following error message.
As --validation_freq=500 is specified, it fails at 500th batch. By changing this configuration we can see the error sooner.

This error reproduced at least on HyperPod Slurm cluster with p5 x 4.

3: [rank3]: Traceback (most recent call last):
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 284, in <module>
3: [rank3]:     main(args)
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 269, in main
3: [rank3]:     train(model, 
3: [rank3]:     ^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 115, in train
3: [rank3]:     val_loss, val_ppl = eval_model(
3: [rank3]:                         ^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 54, in eval_model
3: [rank3]:     for batch_idx, input_data in enumerate(dataloader):
3: [rank3]:                                  ^^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
3: [rank3]:     return self._get_iterator()
3: [rank3]:            ^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
3: [rank3]:     return _MultiProcessingDataLoaderIter(self)
3: [rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
3: [rank3]:     w.start()
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/process.py", line 121, in start
3: [rank3]:     self._popen = self._Popen(self)
3: [rank3]:                   ^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/context.py", line 224, in _Popen
3: [rank3]:     return _default_context.get_context().Process._Popen(process_obj)
3: [rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/context.py", line 281, in _Popen
3: [rank3]:     return Popen(process_obj)
3: [rank3]:            ^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
3: [rank3]:     self._launch(process_obj)
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/popen_fork.py", line 66, in _launch
3: [rank3]:     self.pid = os.fork()
3: [rank3]:                ^^^^^^^^^
3: [rank3]: OSError: [Errno 12] Cannot allocate memory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant