Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TensorRT conversion and serving feature #32

Open
redrussianarmy opened this issue Nov 6, 2020 · 3 comments
Open

Support TensorRT conversion and serving feature #32

redrussianarmy opened this issue Nov 6, 2020 · 3 comments
Assignees
Labels
feature Request for new feature

Comments

@redrussianarmy
Copy link
Contributor

redrussianarmy commented Nov 6, 2020

I realized that the Tensorflow Lite does not support inference with using Nvidia GPU. I have a device of Nvidia Jetson Xavier. My current inference is made with unoptimized transformers model on GPU. It is faster than inference with TF Lite model on CPU.

After my research, I have found 2 types of model optimization such as TensorRT or TF-TRT. I have made some trials to achieve the conversion of fine-tuned transformers model to TensorRT but I could not achieve. It would be better if the dialog-nlu supports TensorRT conversion and serving feature.

@MahmoudWahdan
Copy link
Owner

Hi @redrussianarmy
Thank you for sharing your experience.
I'll give it a try and let you know.

Tflite doesn't support serving on PC GPUs, but supports mobile GPUs. I don't know if it supports all edge devices GPUs or not.

One question that came to my mind:
Did you try mixing transformers with layer_pruning feature and tflite conversion with hybrid_quantization?

k_layers_to_prune = 4 # try different values
config = {
...
...
    "layer_pruning": {
        "strategy": "top",
        "k": k_layers_to_prune
    }
}

nlu = TransformerNLU.from_config(config)
nlu.train(train_dataset, val_dataset, epochs, batch_size)

nlu.save(save_path, save_tflite=True, conversion_mode="hybrid_quantization")

nlu = TransformerNLU.load(model_path, quantized=True, num_process=4)

utterance = "add sabrina salerno to the grime instrumentals playlist"
result = nlu.predict(utterance)

@MahmoudWahdan MahmoudWahdan added the feature Request for new feature label Nov 6, 2020
@MahmoudWahdan MahmoudWahdan self-assigned this Nov 6, 2020
@redrussianarmy
Copy link
Contributor Author

redrussianarmy commented Nov 6, 2020

Hi @MahmoudWahdan
Thank you for your quick reply.

I have tried mixing transformers with layer_pruning feature and tflite conversion with hybrid_quantization as you mentioned. Unfortunately, the result is same. Prediction does not work on GPU of Nvidia Jetson Xavier.

I am looking forward to seeing new TensorRT conversion feature :)

@MahmoudWahdan
Copy link
Owner

Hi @redrussianarmy
Sure, This is a new thing that I'll try and of course it will be useful.
I'll keep you updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Request for new feature
Projects
None yet
Development

No branches or pull requests

2 participants