You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am having issue in using multi gpu for inference , having 4*NVIDIA A10G with 23gb VRAM , still its getting the error in loading the model with below logs
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
The command i am using as follows
torchrun --nproc_per_node=4 sample_video.py
--video-size 1280 720
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--use-cpu-offload
--ulysses-degree 4
--ring-degree 1
--save-path ./results
tried all the combinations of ulysses-degreexring-degree
The text was updated successfully, but these errors were encountered:
dibyajitquilt
changed the title
Memorey issue on gpu with 4 NVIDIA A10G with 23gb VRAM using xdit
Memorey issue on inferencing on multi gpu with 4 X NVIDIA A10G with 23gb VRAM using xdit
Dec 11, 2024
I am having issue in using multi gpu for inference , having 4*NVIDIA A10G with 23gb VRAM , still its getting the error in loading the model with below logs
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
sample_video.py FAILED
Failures:
[1]:
time : 2024-12-11_10:10:29
host : ip-172-31-2-204.us-east-2.compute.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 26222)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-12-11_10:10:29
host : ip-172-31-2-204.us-east-2.compute.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 26223)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-12-11_10:10:29
host : ip-172-31-2-204.us-east-2.compute.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 26221)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The command i am using as follows
torchrun --nproc_per_node=4 sample_video.py
--video-size 1280 720
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--use-cpu-offload
--ulysses-degree 4
--ring-degree 1
--save-path ./results
The text was updated successfully, but these errors were encountered: