Best Practices to Optimize Inferentia Utilization with FastAPI on Amazon EC2 Inf2 and Inf1 Instances

1. Overview

Production workloads often have high throughput, low latency and cost requirements. Inefficient architectures that sub-optimally utilize accelerators could lead to unnecessarily high production costs. In this repo, we will show how to optimally utilize NeuronCores with FastAPI to maximize throughput at minimum latency. In the following sections, we will show to setup this solution on an Inf1 instance and will walkthrough how to compile models on NeuronCores, deploy models with FastAPI and monitor NeuronCores. An overview of the solution architecture is depicted in Fig. 1.

Fig. 1 - Solution Architecture diagram using Amazon EC2 Inf2 instance type

Fig. 2 - Solution Architecture diagram using Amazon EC2 Inf1 instance type

2. AWS Inferentia NeuronCores

Each Inferentia chip has 4 NeuronCores available that share the system vCPUs and memory. T

he table below shows a breakdown of NeuroCore-v1 available for different Inf1 instance sizes.

Instance Size	# Accelerators	# NeuronCores-v1	vCPUs	Memory (GiB)
Inf1.xlarge	1	4	4	8
Inf1.2xlarge	1	4	8	16
Inf1.6xlarge	4	16	24	48
Inf1.24xlarge	16	64	96	19

Similarly, this is the breakdown of Inf2 instance types with the latest NeuronCore-v2

Instance Size	# Accelerators	# NeuronCores-v2	vCPUs	Memory (GiB)
Inf2.xlarge	1	2	4	32
Inf2.8xlarge	1	2	32	32
Inf2.24xlarge	6	12	96	192
Inf2.48xlarge	12	24	192	384

Neuron Runtime is responsible for executing models on Neuron Devices. Neuron Runtime determines which NeuronCore will execute which model and how to execute it. Configuration of the Neuron Runtime is controlled through the use of Environment variables at the process level. Two popular environment variables are NEURON_RT_NUM_CORES and NEURON_RT_VISIBLE_CORES. You can find a list of all environment variables here.

Fig. 3 - Key Neuron Runtime Environment Variables

3. EC2 Solution Setup

To setup the solution in a repeatable, reusable way we use Docker containers and provide the config file for users to provide inputs. This configuration file needs user defined name prefixes for Docker image and Docker containers. The build.sh script in the fastapi and trace-model folders will use this to create Docker images.

Once you have provisioned an appropriate EC2 instance (with the proper IAM role to get access to ECR) clone this repository.

Start by specifying the CHIP_TYPE variable (default "inf2") and the AWS_DEFAULT_REGION (default "us-east-2") you are working in the .env file. The .env file will automatically figure out your ECR registry information so no need to provide it.

Note: There are two .env files with the same variables. They're in the trace-model and fast-api directories. They're separate so that tracing and deployment can be two separate processes and can be deployed in two separate regions if need be.

3.1 Compiling Models on NeuronCores

First, we need to have a model compiled with AWS Neuron to get started. In the trace-model folder, we provide all the scripts necessary to trace a bert-base-uncased model on Inferentia. This script could be used for most models available on HuggingFace. The Dockerfile has all the dependencies to run models on AWS Neuron and runs trace-model.py code as entrypoint. You can build this container by simply running build.sh and push to ECR with push.sh. The push script will create a repo in ECR for you and push the container image.

To make things easier, we're going to rely on pre-built Neuron runtime Deep Learning Docker images provided by AWS.

To pull these images, we need temporary credentials. The fetch-credential.sh contains the command to pull these credentials

This is the order of commands to start compilation and then to run the images as containers.

cd ./trace-model
./fetch-credential.sh
./build.sh
./run.sh

3.2 Deploying Models with FastAPI

Once models are compiled, the TorchScript model file (.pt) will land under the trace-models folder. For this example, it is hard-coded as compiled-bert-bs-1.pt in config.properties file. The fast-api folder provides all the necessary scripts to deploy models with FastAPI. To deploy the models without any changes simply execute the deploy.sh script. This will build a fastapi container image and run containers on specified number of cores and deploy the specified number of models per server in each FastAPI model server.

cd ./fast-api
./deploy.sh

3.3 Calling APIs

Once the containers are deployed, we use the run_apis.py script that calls the APIs in parallel threads. The code is set up to call 6 models deployed, 1 on each NeuronCore but can be easily changed to a different setting.

python3 run_apis.py

3.4 Monitoring NeuronCores

Once the model servers are deployed, to monitor NeuronCore utilization, we may use neuron-top to observe in real time the utilization percentage of each NeuronCore. neuron-top is a CLI tool in the Neuron SDK to provide information such as NeuronCore, vCPU and memory utilization. In a separate terminal, enter the following command:

neuron-top

And your output should be similar to the following figure. In this scenario, we have specified to use 2 NeuronCores and 2 models per server on an Inf2.xlarge instance. The screenshot below shows that 2 models of size 675.3MB each are loaded on 2 NeuronCores. With a total of 4 models loaded, you can see the Device Memory Used is 1.3 GB. Use the arrow keys to move between the NeuronCores on different devices.

Fig. 4 - Loading Models on Amazon EC2 Inf2 instance type

Similarly this screenshot shows Inf1 instance with 6 NeuronCores and 2 models per server. Device memory used 2.1GB.

Fig. 5 - Loading Models on Amazon EC2 Inf1 instance type

Once you run run_apis.py script, you can see % utilization of each of the 2 NeuronCores as below. You can also see the System vCPU usage and Runtime vCPU usage.

NeuronCore utilization on Inf2 when calling APIs

Fig. 6 - NeuronCore Utilization when calling APIs on Amazon EC2 Inf2 instance type

The next screenshot shows the utilization on an Inf1 instance type with 6 NeuronCores.

NeuronCore utilization on Inf1 when calling APIs

Fig. 7 - NeuronCore Utilization when calling APIs on Amazon EC2 Inf1 instance type

3.4 Clean Up

To clean up all the Docker containers created in this work, we provide a cleanup.sh script which just removes all running and stopped containers. This script will remove all containers so don’t use it only if you wish to keep some containers running.

cd ./fast-api
./cleanup.sh

Security

See CONTRIBUTING for more information. Prior to any production deployment, customers should work with their local security teams to evaluate any additional controls

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
fast-api		fast-api
images		images
trace-model		trace-model
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
benchmark.sh		benchmark.sh
benchmark_fastapi.py		benchmark_fastapi.py
config.properties		config.properties
run_apis.py		run_apis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Best Practices to Optimize Inferentia Utilization with FastAPI on Amazon EC2 Inf2 and Inf1 Instances

1. Overview

2. AWS Inferentia NeuronCores

3. EC2 Solution Setup

3.1 Compiling Models on NeuronCores

3.2 Deploying Models with FastAPI

3.3 Calling APIs

3.4 Monitoring NeuronCores

3.4 Clean Up

Security

License

About

Releases

Packages

Contributors 4

Languages

License

aws-samples/best-practices-for-fastapi-on-inferentia

Folders and files

Latest commit

History

Repository files navigation

Best Practices to Optimize Inferentia Utilization with FastAPI on Amazon EC2 Inf2 and Inf1 Instances

1. Overview

2. AWS Inferentia NeuronCores

3. EC2 Solution Setup

3.1 Compiling Models on NeuronCores

3.2 Deploying Models with FastAPI

3.3 Calling APIs

3.4 Monitoring NeuronCores

3.4 Clean Up

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages