Skip to content

aws-samples/aws-do-openfold-inference

Amazon EKS Architecture For OpenFold Inference

1. Overview

OpenFold, developed by Columbia University, is an open-source protein structure prediction model implemented with PyTorch. OpenFold is a faithful reproduction of the Alphafold2 protein structure prediction model, while delivering performance improvements over AlphaFold2. It contains a number of training and inference specific optimizations that take advantage of different memory-time tradeoffs for different protein lengths based on model training or inference runs. For training, OpenFold supports FlashAttention optimizations that accelerate the mutli sequence alignment (MSA) attention component. FlashAttention optimizations along with JIT compilation accelerate the inference pipeline delivering twice the performance for shorter protein sequences than AlphaFold2.

Columbia University has publicly released the model weights and training data consisting of 400,000 Multiple Sequence Alignments (MSAs) and PDB70 template hit files under a permissive license. Model weights are available via scripts in the GitHub repository while the MSAs are hosted by the Registry of Open Data on AWS (RODA). Using Python and Pytorch for implementation allows OpenFold to have access to a large array of ML modules and developers, thus ensuring its continued improvement and optimization.

In this repo, we will show how you can deploy OpenFold models on Amazon EKS and how to scale the EKS clusters to drastically reduce multi-sequence alignment (MSA) computation and protein structure inference times. We will show the performance of this architecture to run alignment computation and inference on the popular open source Cameo dataset. Running this workload end to end on all 92 proteins available in the Cameo dataset would take a total of 8 hours which includes downloading the required data, alignment computation and inference times. Figure 1 shows sample EKS architecture for inference with OpenFold.


Fig. 1 - Sample EKS infrastructure for OpenFold inference workload

2. Prerequisites

It is assumed that an EKS cluster exists and contains nodegroups of the desired target instance types. You can use this repo to create the cluster. aws-do-eks also includes steps to create and mount an FSx for Lustre volume on an EKS cluster here. Also update docker.properties with your ECR registry path.

3. Download OpenFold Data

The download-openfold-data folder contains all the necessary scripts to download data from S3 buckets s3://aws-batch-architecture-for-alphafold-public-artifacts/ and s3://pdbsnapshots/ into the FSx for Lustre file system. To download data, cd into the download-openfold-data folder and update and run ./build.sh to build the Docker image and do the same for ./push.sh. Once that is done run kubectl apply -f fsx-data-prep-pod.yaml to kickstart jobs to download data. Clone OpenFold model files from https://huggingface.co/nz/OpenFold and download them into an S3 bucket and from there into an FSx for Lustre file system using the above steps.

cd ./download-openfold-data
./build.sh
./push.sh
kubectl apply -f fsx-data-prep-pod.yaml

4. Run OpenFold Inference

Once the data and model files are downloaded, the run-openfold-inference provides all the scripts necessary to run [run-pretrained-openfold.py] (https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) script on EKS. Follow the ./build.sh and ./push.sh scripts to build and push docker images to ECR. You can start an inference pod by running kubectl apply -f run-openfold-inference.yaml.

cd ./run-openfold-inference
./run-openfold-inference/build.sh
./run-openfold-inference/push.sh
kubectl apply -f run-openfold-inference.yaml

5. Deploy OpenFold Models as APIs

The inference_config.properties file gives you a configuration script to specify openfold parameters and hardware specifications that you would use to pack OpenFold models in a container and deploy it. In addition to this config, the pack folder exposes alignment computation and inference calls from [run-pretrained-openfold.py] (https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) as apis using the fast-api framework here. Run the ./deploy.sh script to deploy models on EKS.

./build.sh
.push.sh
./deploy.sh

6. Alignment Computation

In this repo, we share an example where we run alignment computation on all 92 proteins available in the Cameo dataset. The cameo folder contains all scripts necessary for alignment computation and inference tests on the Cameo dataset. Please follow the following steps to set up alignment computation jobs on EKS:

a. Build and push docker image in the cameo-fastas folder and run kubectl apply -f temp-fasta.yaml to preprocess the Cameo data into individual fasta files per protein sequence in FSx for Lustre file system.

cd ./cameo/cameo-fastas/
./build.sh
./push.sh
kubectl apply -f temp-fasta.yaml

b. Run run-grid.py code that will use run-cameo.yaml as template to create 92 yaml files, one for each protein sequence) and save it in a cameo-yamls folder

cd ./cameo/
./build.sh
./push.sh
python run-grid.py

c. Run kubectl apply -f cameo-yamls to kick off 92 alignment computation pods

cd ./cameo/
kubectl apply -f cameo-yamls

7. Inference with OpenFold APIs

The inference-tests folder contains all the scripts necessary to call OpenFold Model APIs and run inference.

cd ./cameo/inference-tests/
./build.sh
./push.sh
kubectl apply -f cameo-inference.yaml

Security

See CONTRIBUTING for more information. Prior to any production deployment, customers should work with their local security teams to evaluate any additional controls

License

This library is licensed under the MIT-0 License. See the LICENSE file.