You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to evaluate LLaVA OneVision 72B, but finding I need to use tensor-parallelism to fit it on memory. However, when I do, evaluating on datasets (e.g., MLVU) takes 90+hrs on 4 A100s.
Can this be sped up using multinode, using torchrun --nproc_per_node=1 --nnodes=64 so i can split the data between 64 nodes, each with 2-4 A100s, and that the nodes will use tensor-parallelism within the node and data-parallelism across the nodes?
Best,
Orr
The text was updated successfully, but these errors were encountered:
Right now, i am seeing it taking 45hrs on H100s and 90hrs on A100s... way to long for 72b model, no?
Yeah, and I think most of the time is the video reading time instead of the actual inference time. You can check that for a lot of time your gpu usage is low.
couldn't we figure out how to do DDP between nodes and TP inside nodes?
I think it could be possible for using sglang srt but we haven't really tested it as we didn't test even test the multi-node case for a stable release
Hello,
I am trying to evaluate LLaVA OneVision 72B, but finding I need to use tensor-parallelism to fit it on memory. However, when I do, evaluating on datasets (e.g., MLVU) takes 90+hrs on 4 A100s.
Can this be sped up using multinode, using
torchrun --nproc_per_node=1 --nnodes=64
so i can split the data between 64 nodes, each with 2-4 A100s, and that the nodes will use tensor-parallelism within the node and data-parallelism across the nodes?Best,
Orr
The text was updated successfully, but these errors were encountered: