Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attention regression in ToM compared to MLPerf branch #107

Open
MaheshRavishankar opened this issue Oct 15, 2024 · 1 comment
Open

Attention regression in ToM compared to MLPerf branch #107

MaheshRavishankar opened this issue Oct 15, 2024 · 1 comment
Assignees

Comments

@MaheshRavishankar
Copy link
Contributor

MaheshRavishankar commented Oct 15, 2024

For reproduction.

Input Model:
https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet.mlir

Input data :
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.0.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.1.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.2.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.3.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.4.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/inference_input.5.bin
wget https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-punet/punet_weights.irpa

I built IREE on main and used the TD script in https://github.com/nod-ai/sdxl-scripts/blob/shared/sdxl_on_main/int8-model/specs/attention_and_matmul_spec.mlir

Compilation command for IREE on main

iree-compile \
    --iree-execution-model=async-external \
    --iree-hal-target-backends=rocm \
    --iree-hip-target=gfx942 \
    --iree-hip-waves-per-eu=2 \
    --iree-codegen-gpu-native-math-precision=true \
    --iree-codegen-llvmgpu-use-vector-distribution \
    --iree-codegen-transform-dialect-library= \
    --iree-dispatch-creation-enable-aggressive-fusion=true \
    --iree-global-opt-propagate-transposes=true \
    --iree-llvmgpu-enable-prefetch=true \
    --iree-opt-aggressively-propagate-transposes=true \
    --iree-opt-const-eval=false \
    --iree-opt-outer-dim-concat=true \
    --iree-opt-data-tiling=false \
    --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline,  iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
    --iree-vm-target-truncate-unsupported-floats \                                                                                                                                                                   ${PUNET_MODEL} \
    -o ${VMFB} \                                                               

Run Command :

iree-benchmark-module \
    --device=hip:0 \
    --device_allocator=caching \
    --function=main \
    --hip_allow_inline_execution=true \
    --hip_use_stream=true \
    --input=1x4x128x128xf16=@inference_input.0.bin \
    --input=1xf16=@inference_input.1.bin \
    --input=2x64x2048xf16=@inference_input.2.bin \
    --input=2x1280xf16=@inference_input.3.bin \
    --input=2x6xf16=@inference_input.4.bin \
    --input=1xf16=@inference_input.5.bin \
    --module=${VMFB} \
    --parameters=model=punet_weights.irpa 

For compilation on MLPerf I used the same inputs/weights but used
IREE Commit : https://github.com/iree-org/iree/tree/mlperf_v4.1_20240726
TD script : https://github.com/nod-ai/sdxl-scripts/blob/mlperf_v4.1_20240726/int8-model/specs/attention_and_matmul_spec.mlir

iree-compile
    --iree-execution-model=async-external \
    --iree-hal-target-backends=rocm \
    --iree-rocm-target-chip=gfx942 \
    --iree-rocm-waves-per-eu=2 \
    --iree-codegen-gpu-native-math-precision=true \
    --iree-codegen-llvmgpu-use-vector-distribution \
    --iree-codegen-transform-dialect-library=${TD_SPEC} \
    --iree-flow-enable-aggressive-fusion=true \                                                                                                                                                                      --iree-global-opt-propagate-transposes=true \
    --iree-llvmgpu-enable-prefetch=true \
    --iree-opt-aggressively-propagate-transposes=true \
    --iree-opt-const-eval=false \
    --iree-opt-outer-dim-concat=true \
    --iree-opt-data-tiling=false \
    --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline,  util.func(iree-preprocessing-pa\d-to-intrinsics), util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
    --iree-vm-target-truncate-unsupported-floats \
    ${PUNET_MODEL} \
    -o ${VMFB} \

and same run command

The following dispatches regress

attention_48_* 41ms -> 53 ms
attention_146_* 48 ms -> 56 ms

Below is IR dumps for MLPerf branch and ToM for the two attention dispatches.

sdxl_mlperf_attention_48.dump.mlir.txt
sdxl_mlperf_attention_146.dump.mlir.txt
sdxl_tom_attention_48.dump.mlir.txt
sdxl_tom_attention_146.dump.mlir.txt

@raikonenfnu
Copy link
Member

raikonenfnu commented Oct 22, 2024

Putting here for better visibility, this is the last commit we were working on for FP8
IREE: https://github.com/iree-org/iree/commits/shared/sdxl_fp8_model
SDXL: https://github.com/nod-ai/sdxl-scripts/commits/shared/sdxl_fp8_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants