Best Practices to Optimize Inferentia Utilization with FastAPI on Amazon EC2 Inf2 and Inf1 Instances
Production workloads often have high throughput, low latency and cost requirements. Inefficient architectures that sub-optimally utilize accelerators could lead to unnecessarily high production costs. In this repo, we will show how to optimally utilize NeuronCores with FastAPI to maximize throughput at minimum latency. In the following sections, we will show to setup this solution on an Inf1 instance and will walkthrough how to compile models on NeuronCores, deploy models with FastAPI and monitor NeuronCores. An overview of the solution architecture is depicted in Fig. 1.
Fig. 1 - Solution Architecture diagram using Amazon EC2 Inf2 instance type
Fig. 2 - Solution Architecture diagram using Amazon EC2 Inf1 instance type
Each Inferentia chip has 4 NeuronCores available that share the system vCPUs and memory. T
he table below shows a breakdown of NeuroCore-v1 available for different Inf1 instance sizes.
Instance Size | # Accelerators | # NeuronCores-v1 | vCPUs | Memory (GiB) |
---|---|---|---|---|
Inf1.xlarge | 1 | 4 | 4 | 8 |
Inf1.2xlarge | 1 | 4 | 8 | 16 |
Inf1.6xlarge | 4 | 16 | 24 | 48 |
Inf1.24xlarge | 16 | 64 | 96 | 19 |
Similarly, this is the breakdown of Inf2 instance types with the latest NeuronCore-v2
Instance Size | # Accelerators | # NeuronCores-v2 | vCPUs | Memory (GiB) |
---|---|---|---|---|
Inf2.xlarge | 1 | 2 | 4 | 32 |
Inf2.8xlarge | 1 | 2 | 32 | 32 |
Inf2.24xlarge | 6 | 12 | 96 | 192 |
Inf2.48xlarge | 12 | 24 | 192 | 384 |
Neuron Runtime is responsible for executing models on Neuron Devices. Neuron Runtime determines which NeuronCore will
execute which model and how to execute it. Configuration of the Neuron Runtime is controlled through the use
of Environment variables
at the process level. Two popular environment variables are NEURON_RT_NUM_CORES
and NEURON_RT_VISIBLE_CORES
. You can
find a list of all environment
variables here.
To setup the solution in a repeatable, reusable way we use Docker containers and provide
the config file
for users to provide inputs. This configuration file needs
user defined name prefixes for Docker image and Docker containers. The build.sh
script in
the fastapi
and trace-model folders
will use this to create Docker images.
Once you have provisioned an appropriate EC2 instance (with the proper IAM role to get access to ECR) clone this repository.
Start by specifying the CHIP_TYPE
variable (default "inf2") and the AWS_DEFAULT_REGION
(default "us-east-2") you are working in the .env
file. The .env
file will automatically figure out your ECR registry information so no need to provide it.
Note: There are two .env
files with the same variables. They're in the trace-model
and fast-api
directories. They're
separate so that tracing and deployment can be two separate processes and can be deployed in two separate regions if need be.
First, we need to have a model compiled with AWS Neuron to get started. In the trace-model folder, we provide all the scripts necessary to trace a bert-base-uncased model on Inferentia. This script could be used for most models available on HuggingFace. The Dockerfile has all the dependencies to run models on AWS Neuron and runs trace-model.py code as entrypoint. You can build this container by simply running build.sh and push to ECR with push.sh. The push script will create a repo in ECR for you and push the container image.
To make things easier, we're going to rely on pre-built Neuron runtime Deep Learning Docker images provided by AWS.
To pull these images, we need temporary credentials. The fetch-credential.sh contains the command to pull these credentials
This is the order of commands to start compilation and then to run the images as containers.
cd ./trace-model
./fetch-credential.sh
./build.sh
./run.sh
Once models are compiled, the TorchScript model file (.pt) will land under the trace-models
folder. For this example,
it is hard-coded as compiled-bert-bs-1.pt
in config.properties
file.
The fast-api folder
provides all the necessary scripts to deploy models with FastAPI. To deploy the models without any changes simply
execute
the deploy.sh
script. This will build a fastapi container image and run containers on specified number of cores and deploy the
specified number of models per server in each FastAPI model server.
cd ./fast-api
./deploy.sh
Once the containers are deployed, we use the run_apis.py script that calls the APIs in parallel threads. The code is set up to call 6 models deployed, 1 on each NeuronCore but can be easily changed to a different setting.
python3 run_apis.py
Once the model servers are deployed, to monitor NeuronCore utilization, we may use neuron-top to observe in real time the utilization percentage of each NeuronCore. neuron-top is a CLI tool in the Neuron SDK to provide information such as NeuronCore, vCPU and memory utilization. In a separate terminal, enter the following command:
neuron-top
And your output should be similar to the following figure. In this scenario, we have specified to use 2 NeuronCores and 2 models per server on an Inf2.xlarge instance. The screenshot below shows that 2 models of size 675.3MB each are loaded on 2 NeuronCores. With a total of 4 models loaded, you can see the Device Memory Used is 1.3 GB. Use the arrow keys to move between the NeuronCores on different devices.
Similarly this screenshot shows Inf1 instance with 6 NeuronCores and 2 models per server. Device memory used 2.1GB.
Once you run run_apis.py script, you can see % utilization of each of the 2 NeuronCores as below. You can also see the System vCPU usage and Runtime vCPU usage.
The next screenshot shows the utilization on an Inf1 instance type with 6 NeuronCores.
To clean up all the Docker containers created in this work, we provide a cleanup.sh script which just removes all running and stopped containers. This script will remove all containers so don’t use it only if you wish to keep some containers running.
cd ./fast-api
./cleanup.sh
See CONTRIBUTING for more information. Prior to any production deployment, customers should work with their local security teams to evaluate any additional controls
This library is licensed under the MIT-0 License. See the LICENSE file.