Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memorey issue on inferencing on multi gpu with 4 X NVIDIA A10G with 23gb VRAM using xdit #116

Open
dibyajitquilt opened this issue Dec 11, 2024 · 3 comments

Comments

@dibyajitquilt
Copy link

I am having issue in using multi gpu for inference , having 4*NVIDIA A10G with 23gb VRAM , still its getting the error in loading the model with below logs
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video.py FAILED

Failures:
[1]:
time : 2024-12-11_10:10:29
host : ip-172-31-2-204.us-east-2.compute.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 26222)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-12-11_10:10:29
host : ip-172-31-2-204.us-east-2.compute.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 26223)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-11_10:10:29
host : ip-172-31-2-204.us-east-2.compute.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 26221)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The command i am using as follows
torchrun --nproc_per_node=4 sample_video.py
--video-size 1280 720
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--use-cpu-offload
--ulysses-degree 4
--ring-degree 1
--save-path ./results

tried all the combinations of ulysses-degreexring-degree 
@dibyajitquilt dibyajitquilt changed the title Memorey issue on gpu with 4 NVIDIA A10G with 23gb VRAM using xdit Memorey issue on inferencing on multi gpu with 4 X NVIDIA A10G with 23gb VRAM using xdit Dec 11, 2024
@os-sos
Copy link

os-sos commented Dec 14, 2024

Im interested in doing something similar. Are you using NVlink between these cards? Do you know if shared memory can work?

1 similar comment
@os-sos
Copy link

os-sos commented Dec 14, 2024

Im interested in doing something similar. Are you using NVlink between these cards? Do you know if shared memory can work?

@ftaibi
Copy link

ftaibi commented Dec 14, 2024

same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants