Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM failure of TensorRT 10.X when running Segment Anything Model on GPU 3090 #3969

Open
summelon opened this issue Jun 26, 2024 · 4 comments

Comments

@summelon
Copy link

Description

I observed significant difference of GEMM output between ONNX(opset 18 + ort 1.18.0 + CPU) and TRT(10.0.1) results.
image

This only happens if the batch size of image_embeddings == 1 and trt ver >= 10.

Either (on trt ver == 8.6.3 and bs == any) or (on trt ver >= 10 and bs > 1), this won't happen.
image

I found the TensorRT release note of 10.1.0 mentioned a know issue: "There is a known accuracy issue when the network contains two consecutive GEMV operations (that is, MatrixMultiply with gemmM or gemmN == 1). To workaround this issue, try padding the MatrixMultiply input to have dimensions greater than 1."
So I guess the fusion strategy is a bit different among:

  • trt ver == 8.6.3, bs == any (acceptable diff.)
  • trt ver == 10.0.1, bs == 1 (significant diff.)
  • trt ver == 10.0.1, bs > 1 (acceptable diff.)

I used trex to do some visualization on each converted engine:
image
It seems that myelin compiler applies different optimizations for the aforementioned situations:

  • trt ver == 8.6.3, bs == any: unseen fusion myelin node
  • trt ver == 10.0.1, bs == 1: fuse two consecutive GEMM into one kgen node
  • trt ver == 10.0.1, bs > 1: separated kgen node for each GEMM

My question is:

  1. What's the recommended trt version to avoid this issue temporarily? 8.6.3 or higher(I'm not sure why ngc docker image bump 8.6.3 directly to 10.0.1)?
  2. When this issue will be fixed in the trt major version of 10?
  3. Maybe irrelative, what's the difference between kgen node and myelin node?

Thanks in advance.

Environment

PyTorch docker image 24.05 from NGC

TensorRT Version: TensorRT 10.0.1.6

NVIDIA GPU: NVIDIA GeForce RTX 3090

NVIDIA Driver Version: 555.42.02

CUDA Version: 12.4.1

CUDNN Version: 9.1.0.70

Operating System: Ubuntu 22.04.4 LTS

Python Version: 3.10.12

Tensorflow Version: N/A

PyTorch Version: 2.4.0a0+07cecf4168.nv24.05

Baremetal or Container: nvcr.io/nvidia/pytorch:24.05-py3

Relevant Files

Model link:
I think you can reproduce the issue based on any SAM decoder.
The exported ONNX from here may work: SAM ONNX from AnyLabeling

Steps To Reproduce

Commands or scripts:

# Export model to trt engine
trtexec \
    --onnx="${onnx_decoder_path}" \
    --minShapes=point_coords:1x1x1x2,point_labels:1x1x1x1,image_embeddings:1x256x64x64 \
    --optShapes=point_coords:1x1x3x2,point_labels:1x1x3x1,image_embeddings:1x256x64x64 \
    --maxShapes=point_coords:1x1x3x2,point_labels:1x1x3x1,image_embeddings:1x256x64x64 \
    --saveEngine="${trt_decoder_path}" \
    --exportProfile="${model_root}/${model}_decoder_profile.json" \
    --exportLayerInfo="${model_root}/${model}_decoder_graph.json" \
    --profilingVerbosity=detailed

# Compare
polygraphy run ${model_root}/${model}_decoder.onnx --trt --onnxrt --input-shapes image_embeddings:[1,256,64,64]

# Visualize engine
trex draw ${model_root}/${model}_decoder_graph.json

Have you tried the latest release?:
No, as this is mentioned as know issue in the latest release:
image

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes

@lix19937
Copy link

I think trt9.x is transition version, mainly for llm.

kgen node is also a part of myelin compile result.

@summelon
Copy link
Author

summelon commented Jul 1, 2024

Thanks for your reply!
So the ver.8.6.3 is the latest stable ver. for vision model before major ver. 10.
Do you know which pull request is coping with the GEMM error issue of ver. 10?

@lix19937
Copy link

lix19937 commented Jul 1, 2024

Try to add --builderOptimizationLevel=5 @summelon for two version.

@summelon
Copy link
Author

summelon commented Jul 1, 2024

Hi, @lix19937. I tried polygraphy run decoder.onnx --trt --onnxrt --input-shapes image_embeddings:[1,256,64,64] w/ or w/o --builder-optimization-level 5. The difference did not change and still significant on 10.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants