How to enable FP8 convolution in TensorRT 10.2 #3987

junstar92 · 2024-07-07T13:49:40Z

Hello,

I am using TensorRT 10.2 and noticed that the normal FP8 convolution has been updated.
However, when I try to use a simple QDQ + Conv model in ONNX, the FP8 convolution is not selected. Even timing FP8 tactics is not performed.

Here is the model I used. It was quantized by using TensorRT-Model-Optimizer. And I used H100 device.

simple_conv_fp8.onnx.zip
trtexec command:

$ trtexec --onnx=simple_conv_fp8.onnx --fp16 --fp8 --profilingVerbosity=detailed --verbose --exportLayerInfo=layerinfo.json

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-07-08T10:56:05Z

How was this file(simple_conv_fp8.onnx) generated ?

yuanyao-nv · 2024-07-11T22:07:12Z

@junstar92 You might have to add the --stronglyTyped flag as well.
cc: @nvpohanh

nvpohanh · 2024-07-12T03:08:03Z

Sorry, this is a bug in TRT 10.2. Please enable --stronglyTyped for now.

We will try to fix this issue in TRT 10.3

junstar92 · 2024-07-12T05:41:52Z

@nvpohanh Thank you for checking this issue.
With --stronglyTyped flags, FP8 tactic is enabled.

But, I have another question about FP8 convolution.
I tried to build ResNet18 and ResNet50, but, TensorRT cannot find any implementation for the first conv operation of ResNet.
It seems there are no convolution operation tactics which have 3 in-channels.
Does TensorRT 10.2 support overall ResNet18 or ResNet50 ?

Here is the error log.

[07/12/2024-05:24:18] [V] [TRT] =============== Computing costs for {ForeignNode[/fake_quantizer_7c435f9c02917de57484db91f86bbbaf/QuantizeLinear.../fake_quantizer_5d22154b57438817000ba0a1ea6159ca/DequantizeLinear]}
[07/12/2024-05:24:18] [V] [TRT] *************** Autotuning format combination: Float(150528,50176,224,1) -> Float(802816,12544,112,1) ***************
[07/12/2024-05:24:18] [V] [TRT] --------------- Timing Runner: {ForeignNode[/fake_quantizer_7c435f9c02917de57484db91f86bbbaf/QuantizeLinear.../fake_quantizer_5d22154b57438817000ba0a1ea6159ca/DequantizeLinear]} (Myelin[0x80000023])
[07/12/2024-05:24:18] [V] [TRT] [MemUsageChange] Subgraph create: CPU +0, GPU +0, now: CPU 2390, GPU 844 (MiB)
[07/12/2024-05:24:18] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:2061] Autotuner: no tactics to implement operation:
  131: corrltn: /L__self___conv1/Conv_output_0'_before_bias.1-(f32[16,64,112,112][]so[], mem_prop=0) | /fake_quantizer_7c435f9c02917de57484db91f86bbbaf/QuantizeLinear_output_0'.1-(fp8[16,3,224,224][]so[], mem_prop=0), __mye150_dconst-{-2.75, -1.625, -0.46875, 20, 15, 4.5, -3.25, 3, ...}(fp8[64,3,7,7][147,49,7,1]so[3,2,1,0], mem_prop=0)<entry>, __mye126-1.10786e-05F:(f32[][]so[], mem_prop=0)<entry>, __mye91/L__self___conv1/Conv_beta-0F:(f32[][]so[], mem_prop=0)<entry>, stream = 0 // __mye130_conv
         | n_groups: 1  lpad: {3, 3}  rpad: {3, 3}  pad_mode: 0 strides: {2, 2}  dilations: {1, 1}
[07/12/2024-05:24:18] [V] [TRT] {ForeignNode[/fake_quantizer_7c435f9c02917de57484db91f86bbbaf/QuantizeLinear.../fake_quantizer_5d22154b57438817000ba0a1ea6159ca/DequantizeLinear]} (Myelin[0x80000023]) profiling completed in 0.118439 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[07/12/2024-05:24:18] [E] Error[10]: IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/fake_quantizer_7c435f9c02917de57484db91f86bbbaf/QuantizeLinear.../fake_quantizer_5d22154b57438817000ba0a1ea6159ca/DequantizeLinear]}.)
[07/12/2024-05:24:18] [E] Engine could not be created from network
[07/12/2024-05:24:18] [E] Building engine failed
[07/12/2024-05:24:18] [E] Failed to create engine from model or file.
[07/12/2024-05:24:18] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v100200] # trtexec --onnx=ResNet18_batch16_fp8.onnx --fp16 --fp8 --stronglyTyped --verbose --profilingVerbosity=detailed

nvpohanh · 2024-07-12T06:27:46Z

Did you insert the Q/DQ ops by using the TensorRT Model Optimizer toolkit? https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

It should have avoided the Q/DQ ops before Convs whose C and K are not multiples of 16.

nvpohanh · 2024-07-12T06:32:12Z

But thanks for pointing this out. I will add this limitation to our release notes.

junstar92 · 2024-07-12T06:39:12Z

Did you insert the Q/DQ ops by using the TensorRT Model Optimizer toolkit? https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

It should have avoided the Q/DQ ops before Convs whose C and K are not multiples of 16.

@nvpohanh Thanks for quick answer.
I inserted Q/DQ ops by using modelopt. But, in case of native Q/DQ ops, same error occurs.

in case of using modelopt.

in case of native Q/DQ ops.

nvpohanh · 2024-07-12T07:30:11Z

Filed an internal tracker: id 4744383

We will debug this and find out how this is different from our FP8 ResNet50 testing in our CI/CD.

junstar92 · 2024-07-12T08:11:34Z

This is my quantization and onnx-export code.

import torch
import torchvision
import modelopt.torch.quantization as mtq

FP8_DEFAULT_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {"num_bits": (4, 3), "axis": None},
        "*input_quantizer": {"num_bits": (4, 3), "axis": None},
        "*output_quantizer": {"enable": False},
        "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
        "default": {"num_bits": (4, 3), "axis": None},
    },
    "algorithm": "max",
}

def calib_loop():
    for _ in range(10):
        model(torch.randn(16, 3, 224, 224, device='cuda'))

model = torchvision.models.resnet18(pretrained=True).cuda()
mtq.quantize(model, FP8_DEFAULT_CFG, forward_loop=calib_loop)
torch.onnx.export(
    model,
    torch.randn(16, 3, 224, 224, device='cuda'),
    'resnet18_fp8.onnx',
    input_names=['input'],
    output_names=['output'],
)

nvpohanh · 2024-07-12T08:39:49Z

@junstar92 Oh I see, you are using modelopt.torch.quantization while I was referring to modelopt.onnx.quantization. Could you first export the original model to ONNX and then use modelopt.onnx.quantization to add Q/DQ nodes?

I will check internally about modelopt.torch.quantization vs modelopt.onnx.quantization differences.

junstar92 · 2024-07-12T10:10:41Z

@nvpohanh Okay, it is the quantized onnx by using modelopt.onnx.quantization, and the first convolution is not quantized.

It succeeded to build this onnx model. It seems right that the first conv op is not implemented.

junstar92 · 2024-07-18T07:57:37Z

@nvpohanh My question has been resolved and I close this issue.
I appreciate your support.

junstar92 changed the title ~~[QST] How to enable FP8 convolution in TensorRT 10.2~~ How to enable FP8 convolution in TensorRT 10.2 Jul 8, 2024

junstar92 mentioned this issue Jul 11, 2024

10.2 GA release update #3998

Merged

junstar92 closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable FP8 convolution in TensorRT 10.2 #3987

How to enable FP8 convolution in TensorRT 10.2 #3987

junstar92 commented Jul 7, 2024 •

edited

Loading

lix19937 commented Jul 8, 2024

yuanyao-nv commented Jul 11, 2024

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024

nvpohanh commented Jul 12, 2024

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024 •

edited

Loading

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024

junstar92 commented Jul 18, 2024

How to enable FP8 convolution in TensorRT 10.2 #3987

How to enable FP8 convolution in TensorRT 10.2 #3987

Comments

junstar92 commented Jul 7, 2024 • edited Loading

lix19937 commented Jul 8, 2024

yuanyao-nv commented Jul 11, 2024

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024

nvpohanh commented Jul 12, 2024

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024 • edited Loading

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024

nvpohanh commented Jul 12, 2024

junstar92 commented Jul 12, 2024

junstar92 commented Jul 18, 2024

junstar92 commented Jul 7, 2024 •

edited

Loading

junstar92 commented Jul 12, 2024 •

edited

Loading