This repository contains setup examples for hosting model inference using NVIDIA triton
-
Setup your model and tokenizer files
- move model.onnx to
hf-embedding-template/onnx_model/1/
- move any other model files (model and tokenizer config) to
hf-embedding-template/preprocessing/1/
- move model.onnx to
-
Start Triton Server and attach shell
docker run --shm-size=16g --gpus all -it --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /hf-embedding-template:/models nvcr.io/nvidia/tritonserver:24.08-py3 bash
-
Run inside the Triton Container
pip install transformers tritonserver --model-repository=/models
-
Run client
pip install tritonclient[http] python client.py
This repo comes with ready to run helm charts. They can be found under /helm
. E.g. text-embedder-trion
is readily configured to run a triton embedding server.