vLLM w/ tp 8 on 8 x MI300 (gfx942) has problems with hipBLASlt #9251

lhl · 2024-10-10T14:39:47Z

lhl
Oct 10, 2024

FYI, I've filed a curious bug I've encountered w/ PyTorch but at least want to mention it here (if not create a duplicate issue yet unless they triage it as a not their problem): pytorch/pytorch#137695

Basically running the latest vLLM (HEAD) and the PyTorch nightly it depends on, hipBLASlt is used by default, and works for -tp 1 to -tp 4 but at -tp 8 it consistently starts to report errors with loading TensileLibrary_lazy_gfx942.dat. The workaround is to use TORCH_BLAS_PREFER_HIPBLASLT=0, and at least for tp 1-4, this is slightly faster anyway in my vLLM benchmark_throughput testing.

Leaving this here to potentially save some people some hair-pulling, as it took me a while to debug (since it works on lower tps)

lhl · 2024-10-10T17:09:38Z

lhl
Oct 10, 2024
Author

I've discovered an additional issue, which is that even if you run TORCH_BLAS_PREFER_HIPBLASLT=0, eg:

TORCH_BLAS_PREFER_HIPBLASLT=0 python benchmarks/benchmark_throughput.py --backend vllm --input-len 512 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 8 --quantization fp8

It will fail with the same errors when using FP8 quantization:

rocblaslt error: Could not load /home/hotaisle/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat

rocblaslt error: Could not load /home/hotaisle/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat

rocblaslt error: Could not load /home/hotaisle/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat
ERROR 10-10 17:00:25 multiproc_worker_utils.py:117] Worker VllmWorkerProcess pid 1614888 died, exit code: -11
INFO 10-10 17:00:25 multiproc_worker_utils.py:121] Killing local vLLM worker processes

Here, 3/8 of the therads fail to load TensileLibrary_lazy_gfx942.dat.

Presumably, the FP8 quantization kernels require hipBLASlt and so can't run w/ the current bug?

BTW, surprisingly when testing at -tp4, which does work, FP8 throughput, runs significantly slower than FP16:

$ TORCH_BLAS_PREFER_HIPBLASLT=0 python benchmarks/benchmark_throughput.py --backend vllm --input-len 512 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 4 --quantization fp8

Processed prompts: 100%|█████████████████████████████████████████| 1000/1000 [00:32<00:00, 30.53it/s, est. speed input: 15631.80 toks/s, output: 3907.95 toks/s]
Throughput: 30.22 requests/s, 19338.72 tokens/s

$ TORCH_BLAS_PREFER_HIPBLASLT=0 python benchmarks/benchmark_throughput.py --backend vllm --input-len 512 --output-len 128 --model meta-llama/Llama-2-7b-chat-hf -tp 4

Processed prompts: 100%|█████████████████████████████████████████| 1000/1000 [00:24<00:00, 41.12it/s, est. speed input: 21054.43 toks/s, output: 5263.61 toks/s]
Throughput: 40.58 requests/s, 25971.88 tokens/s

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM w/ tp 8 on 8 x MI300 (gfx942) has problems with hipBLASlt #9251

{{title}}

Replies: 1 comment

{{title}}

Select a reply

vLLM w/ tp 8 on 8 x MI300 (gfx942) has problems with hipBLASlt #9251

lhl Oct 10, 2024

Replies: 1 comment

lhl Oct 10, 2024 Author

lhl
Oct 10, 2024

lhl
Oct 10, 2024
Author