yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060

anazkhan · 2024-10-23T15:04:18Z

💡 Your Question

I have followed exactly same steps for model training followed by PTQ and QAT mentioned in the offcial super-gradient notebook :
https://github.com/Deci-AI/super-gradients/blob/master/notebooks/yolo_nas_custom_dataset_fine_tuning_with_qat.ipynb.

I am facing a latency issue for the quantized models converted using QAT and PTQ. these models are having longer inference time compared to its non quantized version.

Kindly let me know whats going wrong!

Versions

No response

freedreamer82 · 2024-11-05T20:56:15Z

same problem here...how can be possible?

M.

BloodAxe · 2024-11-06T07:14:48Z

Because quantized model are not meant to be run in eager execution mode using native PyTorch.
Yes you can run this model within PyTorch but it will be slow by design. You can check pytorch-quantization for more details.

Once you have quantized model you want to export it to ONNX and from there build you TensorRT engine of that quantized model

freedreamer82 · 2024-11-06T07:41:58Z

Thanks for the reply.Actually i have been running inference with onnx but the performances are poor as well.

Idea?

M.

BloodAxe · 2024-11-06T10:41:41Z

With what inference engine?

freedreamer82 · 2024-11-06T10:44:07Z

Onnx runtine.

BloodAxe · 2024-11-06T14:01:17Z

I don't know whether ONNXRuntime supports accelerated inference of quantized model. TensorRT does. In fact we are using pytorch-quantization package which was specifically build for TRT.

freedreamer82 · 2024-11-06T16:14:38Z

i see...unfortunately in my case i need to use onnx runtime.
I had experience with ptq quantizition with ultralitics model (using only last layers) having a boost in inference time with onnx CPU/GPU runtime.

Do u think is something i can reach also with your model? It seems using the same approach i'm not able to get any kind of improvements

thanks for your reply.

BloodAxe · 2024-11-07T15:46:36Z

The problem with ONNX engine is that it does not understand that Q/DQ layers in the graph and treat them as regular layers instead of fusing them. So my recommendation is export regular model to ONNX and then use their quantization methods https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html.

freedreamer82 · 2024-11-07T15:50:23Z

It makes sense. Thanks a lot!

freedreamer82 · 2024-11-10T17:11:19Z

For those who is interested I update.I manually did quantization with onnx tools. I had a good compromise in time inference only setting activations in uint8 and weights in int8. With others options seems remains the same.

M.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060

yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060

anazkhan commented Oct 23, 2024

freedreamer82 commented Nov 5, 2024

BloodAxe commented Nov 6, 2024

freedreamer82 commented Nov 6, 2024

BloodAxe commented Nov 6, 2024

freedreamer82 commented Nov 6, 2024

BloodAxe commented Nov 6, 2024

freedreamer82 commented Nov 6, 2024

BloodAxe commented Nov 7, 2024

freedreamer82 commented Nov 7, 2024

freedreamer82 commented Nov 10, 2024

yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060

yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060

Comments

anazkhan commented Oct 23, 2024

💡 Your Question

Versions

freedreamer82 commented Nov 5, 2024

BloodAxe commented Nov 6, 2024

freedreamer82 commented Nov 6, 2024

BloodAxe commented Nov 6, 2024

freedreamer82 commented Nov 6, 2024

BloodAxe commented Nov 6, 2024

freedreamer82 commented Nov 6, 2024

BloodAxe commented Nov 7, 2024

freedreamer82 commented Nov 7, 2024

freedreamer82 commented Nov 10, 2024