quantization for transformer based models #10006
Replies: 4 comments 12 replies
-
I was able to get it working by updating spacy_transformers/pipeline_component.py and it gives a 20-25% improvement in inference speed , of course this might cause loss of precision so we need to make it configurable.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for reporting that this works and benchmarking the performance improvement! Quantization is certainly something we want to support out of the box in the future. We are working on support for model distillation to be able extract smaller and faster models from large transformer models. We also plan to experiment with quantization during distillation, which may result in a lower impact on accuracy than quantizing the model after training, since the model is then optimized with lower precision. |
Beta Was this translation helpful? Give feedback.
-
@danieldk Any news about the quantization support? I am looking forward to using it with Spacy :D |
Beta Was this translation helpful? Give feedback.
-
any news about quantization ? my 3060ti is not able to run mistral without 8bit mode |
Beta Was this translation helpful? Give feedback.
-
Hi
I have been using the transformer based models (pertained shipped with spacy) (primarily for NER), and need to improve upon the inference speed.
Is it possible to use quantization:
Due tot the thinc wrapper I was not able to figure out how to load the pytorch model and use quantization, any pointers will help
https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html
Use turbotransformers inference engine
https://github.com/Tencent/TurboTransformers
Any pointers on how I could use this would be very helpful!
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions