quantization for transformer based models #10006

abhinaynagpal · 2022-01-08T01:37:55Z

abhinaynagpal
Jan 8, 2022

Hi

I have been using the transformer based models (pertained shipped with spacy) (primarily for NER), and need to improve upon the inference speed.

Is it possible to use quantization:
Due tot the thinc wrapper I was not able to figure out how to load the pytorch model and use quantization, any pointers will help
https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html
Use turbotransformers inference engine
https://github.com/Tencent/TurboTransformers
Any pointers on how I could use this would be very helpful!

Thanks!

abhinaynagpal · 2022-01-09T20:34:10Z

abhinaynagpal
Jan 9, 2022
Author

I was able to get it working by updating spacy_transformers/pipeline_component.py and it gives a 20-25% improvement in inference speed , of course this might cause loss of precision so we need to make it configurable.

def from_disk(
        self, path: Union[str, Path], *, exclude: Iterable[str] = tuple()
    ) -> "Transformer":
    def load_model(p):
            try:
                with open(p, "rb") as mfile:
                    self.model.from_bytes(mfile.read())
                    import torch
                    qconfig_dict = {
                      torch.nn.EmbeddingBag : torch.quantization.float_qparams_weight_only_qconfig,
                      torch.nn.Linear: torch.quantization.default_dynamic_qconfig
                    }
       
                    self.model._transformer = torch.quantization.quantize_dynamic(self.model.transformer, qconfig_dict)

1 reply

rennanvoa2 Aug 9, 2022

I was able to get it working by updating spacy_transformers/pipeline_component.py and it gives a 20-25% improvement in inference speed , of course this might cause loss of precision so we need to make it configurable.

def from_disk(
        self, path: Union[str, Path], *, exclude: Iterable[str] = tuple()
    ) -> "Transformer":
    def load_model(p):
            try:
                with open(p, "rb") as mfile:
                    self.model.from_bytes(mfile.read())
                    import torch
                    qconfig_dict = {
                      torch.nn.EmbeddingBag : torch.quantization.float_qparams_weight_only_qconfig,
                      torch.nn.Linear: torch.quantization.default_dynamic_qconfig
                    }
       
                    self.model._transformer = torch.quantization.quantize_dynamic(self.model.transformer, qconfig_dict)

Is it possible to mutate the Spacy model in runtime (inference) to add the quantization or the only way is to change the code in the pipeline_component file?

danieldk · 2022-01-19T15:32:16Z

danieldk
Jan 19, 2022

Thanks for reporting that this works and benchmarking the performance improvement!

Quantization is certainly something we want to support out of the box in the future. We are working on support for model distillation to be able extract smaller and faster models from large transformer models. We also plan to experiment with quantization during distillation, which may result in a lower impact on accuracy than quantizing the model after training, since the model is then optimized with lower precision.

2 replies

imohitmayank May 26, 2022

+1 for the quantization and distillation features. Do we have anything yet?

danieldk May 27, 2022

It's on our radar. Since distillation is computationally-intensive, we first focused on removing various training and inference bottlenecks. Many of them shipped in recent Thinc and spaCy versions and we have some more in the pipeline for the next releases.

rennanvoa2 · 2022-08-19T13:53:46Z

rennanvoa2
Aug 19, 2022

@danieldk Any news about the quantization support? I am looking forward to using it with Spacy :D

9 replies

saitej123 Sep 15, 2023

Want to use pretrained model trf

saitej123 Sep 15, 2023

Document size: 40 sentences and 10k chars

danieldk Sep 18, 2023

We provide beta models for curated transformers here:

#12848

Like our current transformer models, not all models support NER, but for instance the English model does. It's probably best to benchmark the model for your particular data and hardware.

Circling back to the original topic: these models are not quantized. We do support quantization, but decided to disable it for the upcoming 3.7 release. Curated Transformers originally used TorchScript scripting, but we decided to switch to tracing and we have to redo the quantization support around that.

saitej123 Sep 19, 2023

How to get speed of spacy LG model for trf? Wps is more for lg model ...we have been using trf model for production but processing very less articles because of inference speed ... please share hacks to accelerate this process

danieldk Sep 19, 2023

The transformer models will be slower than our convolution models (sm/md/lg), because they have to do more computation. If you want to use the transformer models and speed is an issue, you should consider using GPUs for inference.

celsowm · 2024-02-24T02:55:48Z

celsowm
Feb 24, 2024

any news about quantization ? my 3060ti is not able to run mistral without 8bit mode

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization for transformer based models #10006

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

quantization for transformer based models #10006

Replies: 4 comments · 12 replies

abhinaynagpal Jan 9, 2022 Author

Replies: 4 comments 12 replies

abhinaynagpal
Jan 9, 2022
Author