-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context parallelism understanding #723
Comments
@fegin Thanks for your feedback. Not sure the possible reason for why there is no such kind of trace in AMD GPU. Thanks. |
I see. My best guest is that because of the hardware, SDPA doesn't trap into the right kernel to dispatch to the right CP implementation. We currently only support Flash attention and Memory efficient attention kernels. Math attention kernel is not supported. https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html |
Hi
We are recently testing the CP parallelism strategy, for a 2D configuration: FSDP+CP.
From what we know, CP is to slice the sequence length, as attention kernel needs to compute the attention for the whole sequence, which means each GPU needs to gather all the sharded KV cache using some collective communication kernels.
However, we didn't see any such kind of kernels, only found the All-Gather for parameters in pre-forward phase.
Is there anything that we misunderstood? please add your comments for better understanding.
Thanks.
The text was updated successfully, but these errors were encountered: