Context parallelism understanding #723

jinsong-mao · 2024-12-09T03:07:27Z

Hi

We are recently testing the CP parallelism strategy, for a 2D configuration: FSDP+CP.
From what we know, CP is to slice the sequence length, as attention kernel needs to compute the attention for the whole sequence, which means each GPU needs to gather all the sharded KV cache using some collective communication kernels.

However, we didn't see any such kind of kernels, only found the All-Gather for parameters in pre-forward phase.

Is there anything that we misunderstood? please add your comments for better understanding.

Thanks.

tianyu-l · 2024-12-09T08:05:57Z

cc: @XilunWu @fegin

fegin · 2024-12-09T18:15:29Z

This is the trace with

CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.data_parallel_replicate_degree=1 --training.data_parallel_shard_degree=2 --experimental.context_parallel_degree=4

The selected allgather is the one issued by CP, which is not the same as FSDP. I'm wondering what command did you use?

jinsong-mao · 2024-12-10T04:30:29Z

@fegin Thanks for your feedback.
I was running the experiments on AMD platform, suppose you were using Nvidia GPU.
My command is almost the same as yours, I can only find the stream of ALL-Gather issued by FSDP, and no streams of communication issued by CP.
Because of the difference in hardware, maybe the kernel issued by CP was not executed or has not been captured by profiler.

Not sure the possible reason for why there is no such kind of trace in AMD GPU.

Thanks.

fegin · 2024-12-10T06:11:30Z

I see. My best guest is that because of the hardware, SDPA doesn't trap into the right kernel to dispatch to the right CP implementation. We currently only support Flash attention and Memory efficient attention kernels. Math attention kernel is not supported. https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

tianyu-l added the question Further information is requested label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context parallelism understanding #723

Context parallelism understanding #723

jinsong-mao commented Dec 9, 2024

tianyu-l commented Dec 9, 2024

fegin commented Dec 9, 2024

jinsong-mao commented Dec 10, 2024

fegin commented Dec 10, 2024

Context parallelism understanding #723

Context parallelism understanding #723

Comments

jinsong-mao commented Dec 9, 2024

tianyu-l commented Dec 9, 2024

fegin commented Dec 9, 2024

jinsong-mao commented Dec 10, 2024

fegin commented Dec 10, 2024