Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060

Open
anazkhan opened this issue Oct 23, 2024 · 10 comments

Comments

@anazkhan
Copy link

💡 Your Question

I have followed exactly same steps for model training followed by PTQ and QAT mentioned in the offcial super-gradient notebook :
https://github.com/Deci-AI/super-gradients/blob/master/notebooks/yolo_nas_custom_dataset_fine_tuning_with_qat.ipynb.

I am facing a latency issue for the quantized models converted using QAT and PTQ. these models are having longer inference time compared to its non quantized version.

Kindly let me know whats going wrong!

Versions

No response

@freedreamer82
Copy link

same problem here...how can be possible?

M.

@BloodAxe
Copy link
Contributor

BloodAxe commented Nov 6, 2024

Because quantized model are not meant to be run in eager execution mode using native PyTorch.
Yes you can run this model within PyTorch but it will be slow by design. You can check pytorch-quantization for more details.

Once you have quantized model you want to export it to ONNX and from there build you TensorRT engine of that quantized model

@freedreamer82
Copy link

Thanks for the reply.Actually i have been running inference with onnx but the performances are poor as well.

Idea?

M.

@BloodAxe
Copy link
Contributor

BloodAxe commented Nov 6, 2024

With what inference engine?

@freedreamer82
Copy link

Onnx runtine.

@BloodAxe
Copy link
Contributor

BloodAxe commented Nov 6, 2024

I don't know whether ONNXRuntime supports accelerated inference of quantized model. TensorRT does. In fact we are using pytorch-quantization package which was specifically build for TRT.

@freedreamer82
Copy link

i see...unfortunately in my case i need to use onnx runtime.
I had experience with ptq quantizition with ultralitics model (using only last layers) having a boost in inference time with onnx CPU/GPU runtime.

Do u think is something i can reach also with your model? It seems using the same approach i'm not able to get any kind of improvements

thanks for your reply.

@BloodAxe
Copy link
Contributor

BloodAxe commented Nov 7, 2024

The problem with ONNX engine is that it does not understand that Q/DQ layers in the graph and treat them as regular layers instead of fusing them. So my recommendation is export regular model to ONNX and then use their quantization methods https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html.

@freedreamer82
Copy link

It makes sense. Thanks a lot!

@freedreamer82
Copy link

For those who is interested I update.I manually did quantization with onnx tools. I had a good compromise in time inference only setting activations in uint8 and weights in int8. With others options seems remains the same.

M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants