-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yolo nas Post training quantization and quantization awareness model takes longer inference time than its orginal model. #2060
Comments
same problem here...how can be possible? M. |
Because quantized model are not meant to be run in eager execution mode using native PyTorch. Once you have quantized model you want to export it to ONNX and from there build you TensorRT engine of that quantized model |
Thanks for the reply.Actually i have been running inference with onnx but the performances are poor as well. Idea? M. |
With what inference engine? |
Onnx runtine. |
I don't know whether ONNXRuntime supports accelerated inference of quantized model. TensorRT does. In fact we are using pytorch-quantization package which was specifically build for TRT. |
i see...unfortunately in my case i need to use onnx runtime. Do u think is something i can reach also with your model? It seems using the same approach i'm not able to get any kind of improvements thanks for your reply. |
The problem with ONNX engine is that it does not understand that Q/DQ layers in the graph and treat them as regular layers instead of fusing them. So my recommendation is export regular model to ONNX and then use their quantization methods https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html. |
It makes sense. Thanks a lot! |
For those who is interested I update.I manually did quantization with onnx tools. I had a good compromise in time inference only setting activations in uint8 and weights in int8. With others options seems remains the same. M. |
💡 Your Question
I have followed exactly same steps for model training followed by PTQ and QAT mentioned in the offcial super-gradient notebook :
https://github.com/Deci-AI/super-gradients/blob/master/notebooks/yolo_nas_custom_dataset_fine_tuning_with_qat.ipynb.
I am facing a latency issue for the quantized models converted using QAT and PTQ. these models are having longer inference time compared to its non quantized version.
Kindly let me know whats going wrong!
Versions
No response
The text was updated successfully, but these errors were encountered: