Skip to content

Latest commit

 

History

History
125 lines (102 loc) · 5.74 KB

File metadata and controls

125 lines (102 loc) · 5.74 KB

Building Complex Pipelines: Stable Diffusion

Navigate to Part 5: Building Model Ensembles Part 7: Iterative Scheduling Tutorial Documentation: BLS

Watch this explainer video with discusses the pipeline, before proceeding with the example. This example focuses on showcasing two of Triton Inference Server's features:

Using Multiple Backends

Building a pipeline powered by deep learning models is a collaborative effort which often involves multiple contributors. Contributors often have differing development environment. This can lead to issues whilst building a single pipeline with work from different contributors. Triton users can solve this challenge with the use of the Python or a C++ backend along with the Business Logic Scripting API (BLS) API to trigger model execution.

Pipeline

In this example, the models are being run on:

  • ONNX Backend
  • TensorRT Backend
  • Python Backend

Both the models deployed on a framework backend can be triggered using the following API:

encoding_request = pb_utils.InferenceRequest(
    model_name="text_encoder",
    requested_output_names=["last_hidden_state"],
    inputs=[input_ids_1],
)

response = encoding_request.exec()
text_embeddings = pb_utils.get_output_tensor_by_name(response, "last_hidden_state")

Refer to model.py in the pipeline model for a complete example.

Stable Diffusion Example

Before starting, clone this repository and navigate to the root folder. Use three different terminals for an easier user experience.

Step 1: Prepare the Server Environment

  • First, run the Triton Inference Server Container.
# Replace yy.mm with year and month of release. Eg. 22.08
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
  • Next, install all the dependencies required by the models running in the python backend and login with your huggingface token(Account on HuggingFace is required).
# PyTorch & Transformers Lib
pip install torch torchvision torchaudio
pip install transformers ftfy scipy accelerate
pip install diffusers==0.9.0
pip install transformers[onnxruntime]
huggingface-cli login

Step 2: Exporting and converting the models

Use the NGC PyTorch container, to export and convert the models.

docker run -it --gpus all -p 8888:8888 -v ${PWD}:/mount nvcr.io/nvidia/pytorch:yy.mm-py3

pip install transformers ftfy scipy
pip install transformers[onnxruntime]
pip install diffusers==0.9.0
huggingface-cli login
cd /mount
python export.py

# Accelerating VAE with TensorRT
trtexec --onnx=vae.onnx --saveEngine=vae.plan --minShapes=latent_sample:1x4x64x64 --optShapes=latent_sample:4x4x64x64 --maxShapes=latent_sample:8x4x64x64 --fp16

# Place the models in the model repository
mkdir model_repository/vae/1
mkdir model_repository/text_encoder/1
mv vae.plan model_repository/vae/1/model.plan
mv encoder.onnx model_repository/text_encoder/1/model.onnx

Step 3: Launch the Server

From the server container, launch the Triton Inference Server.

tritonserver --model-repository=/models

Step 4: Run the client

Use the client container and run the client.

docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash

# Client with no GUI
python3 client.py

# Client with GUI
pip install gradio packaging
python3 gui/client.py --triton_url="localhost:8001"

Note: First Inference query may take more time than successive queries