You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running FSDP sample app, it fails at the model evaluation with following error message.
As --validation_freq=500 is specified, it fails at 500th batch. By changing this configuration we can see the error sooner.
This error reproduced at least on HyperPod Slurm cluster with p5 x 4.
3: [rank3]: Traceback (most recent call last):
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 284, in <module>
3: [rank3]: main(args)
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 269, in main
3: [rank3]: train(model,
3: [rank3]: ^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 115, in train
3: [rank3]: val_loss, val_ppl = eval_model(
3: [rank3]: ^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 54, in eval_model
3: [rank3]: for batch_idx, input_data in enumerate(dataloader):
3: [rank3]: ^^^^^^^^^^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
3: [rank3]: return self._get_iterator()
3: [rank3]: ^^^^^^^^^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
3: [rank3]: return _MultiProcessingDataLoaderIter(self)
3: [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
3: [rank3]: w.start()
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/process.py", line 121, in start
3: [rank3]: self._popen = self._Popen(self)
3: [rank3]: ^^^^^^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/context.py", line 224, in _Popen
3: [rank3]: return _default_context.get_context().Process._Popen(process_obj)
3: [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/context.py", line 281, in _Popen
3: [rank3]: return Popen(process_obj)
3: [rank3]: ^^^^^^^^^^^^^^^^^^
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
3: [rank3]: self._launch(process_obj)
3: [rank3]: File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/popen_fork.py", line 66, in _launch
3: [rank3]: self.pid = os.fork()
3: [rank3]: ^^^^^^^^^
3: [rank3]: OSError: [Errno 12] Cannot allocate memory
The text was updated successfully, but these errors were encountered:
When running FSDP sample app, it fails at the model evaluation with following error message.
As
--validation_freq=500
is specified, it fails at 500th batch. By changing this configuration we can see the error sooner.This error reproduced at least on HyperPod Slurm cluster with p5 x 4.
The text was updated successfully, but these errors were encountered: