Merge branch 'deepmind:main' into main

dialvarezs · Jan 12, 2023 · 8bb51c5 · 8bb51c5
2 parents a0c15e5 + a394167
commit 8bb51c5
Show file tree

Hide file tree

Showing 11 changed files with 241 additions and 129 deletions.
diff --git a/README.md b/README.md
@@ -24,16 +24,15 @@
 # AlphaFold
 
 This package provides an implementation of the inference pipeline of AlphaFold
-v2.0. For simplicity, we refer to this model as AlphaFold throughout the rest
-of this document.
+v2. For simplicity, we refer to this model as AlphaFold throughout the rest of
+this document.
 
 We also provide:
 
 1.  An implementation of AlphaFold-Multimer. This represents a work in progress
     and AlphaFold-Multimer isn't expected to be as stable as our monomer
-    AlphaFold system.
-    [Read the guide](#updating-existing-installation)
-    for how to upgrade and update code.
+    AlphaFold system. [Read the guide](#updating-existing-installation) for how
+    to upgrade and update code.
 2.  The [technical note](docs/technical_note_v2.3.0.md) containing the models
     and inference procedure for an updated AlphaFold v2.3.0.
 3.  A [CASP15 baseline](docs/casp15_predictions.zip) set of predictions along
@@ -58,21 +57,51 @@ If you have any questions, please contact the AlphaFold team at
 
 ![CASP14 predictions](imgs/casp14_predictions.gif)
 
-## First time setup
+## Installation and running your first prediction
 
 You will need a machine running Linux, AlphaFold does not support other
-operating systems.
+operating systems. Full installation requires up to 3 TB of disk space to keep
+genetic databases (SSD storage is recommended) and a modern NVIDIA GPU (GPUs
+with more memory can predict larger protein structures).
 
-The following steps are required in order to run AlphaFold:
+Please follow these steps:
 
 1.  Install [Docker](https://www.docker.com/).
     *   Install
         [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
         for GPU support.
     *   Setup running
         [Docker as a non-root user](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user).
-1.  Download genetic databases (see below).
-1.  Download model parameters (see below).
+
+1.  Clone this repository and `cd` into it.
+
+    ```bash
+    git clone https://github.com/deepmind/alphafold.git
+    cd ./alphafold
+    ```
+
+1.  Download genetic databases and model parameters:
+
+    *   Install `aria2c` (on most Linux distributions it is available via the
+    package manager).
+
+    *   Please use the script `scripts/download_all_data.sh` to download
+    and set up full databases. This may take substantial time (download size is
+    556 GB), so we recommend running this script in the background:
+
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR> > download.log 2> download_all.log &
+    ```
+
+    *   **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
+    subdirectory in the AlphaFold repository directory.** If it is, the Docker
+    build will be slow as the large databases will be copied into the docker
+    build context.
+
+    *   It is possible to run AlphaFold with reduced databases; please refer to
+    the [complete documentation](#genetic-databases).
+
+
 1.  Check that AlphaFold will be able to use a GPU by running:
 
     ```bash
@@ -85,10 +114,58 @@ The following steps are required in order to run AlphaFold:
     or take a look at the following
     [NVIDIA Docker issue](https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573).
 
-If you wish to run AlphaFold using Singularity (a common containerization
-platform on HPC systems) we recommend using some of the third party Singularity
-setups as linked in https://github.com/deepmind/alphafold/issues/10 or
-https://github.com/deepmind/alphafold/issues/24.
+    If you wish to run AlphaFold using Singularity (a common containerization
+    platform on HPC systems) we recommend using some of the third party Singularity
+    setups as linked in https://github.com/deepmind/alphafold/issues/10 or
+    https://github.com/deepmind/alphafold/issues/24.
+
+1.  Build the Docker image:
+
+    ```bash
+    docker build -f docker/Dockerfile -t alphafold .
+    ```
+
+    If you encounter the following error:
+
+    ```
+    W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
+    E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
+    ```
+
+    use the workaround described in
+    https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
+
+1.  Install the `run_docker.py` dependencies. Note: You may optionally wish to
+    create a
+    [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
+    to prevent conflicts with your system's Python environment.
+
+    ```bash
+    pip3 install -r docker/requirements.txt
+    ```
+
+1.  Make sure that the output directory exists (the default is `/tmp/alphafold`)
+    and that you have sufficient permissions to write into it.
+
+1.  Run `run_docker.py` pointing to a FASTA file containing the protein
+    sequence(s) for which you wish to predict the structure (`--fasta_paths`
+    parameter). AlphaFold will search for the available templates before the
+    date specified by the `--max_template_date` parameter; this could be used to
+    avoid certain templates during modeling. `--data_dir` is the directory with
+    downloaded genetic databases and `--output_dir` is the absolute path to the
+    output directory.
+
+    ```bash
+    python3 docker/run_docker.py \
+      --fasta_paths=your_protein.fasta \
+      --max_template_date=2022-01-01 \
+      --data_dir=$DOWNLOAD_DIR \
+      --output_dir=/home/user/absolute_path_to_the_output_dir
+    ```
+
+1.  Once the run is over, the output directory shall contain predicted
+    structures of the target protein. Please check the documentation below for
+    additional options and troubleshooting tips.
 
 ### Genetic databases
 
@@ -108,32 +185,34 @@ AlphaFold needs multiple genetic (sequence) databases to run:
 We provide a script `scripts/download_all_data.sh` that can be used to download
 and set up all of these databases:
 
-*   Default:
+*   Recommended default:
 
     ```bash
     scripts/download_all_data.sh <DOWNLOAD_DIR>
     ```
 
     will download the full databases.
 
-*   With `reduced_dbs`:
+*   With `reduced_dbs` parameter:
 
     ```bash
     scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
     ```
 
     will download a reduced version of the databases to be used with the
-    `reduced_dbs` database preset.
+    `reduced_dbs` database preset. This shall be used with the corresponding
+    AlphaFold parameter `--db_preset=reduced_dbs` later during the AlphaFold run
+    (please see [AlphaFold parameters](#running-alphafold) section).
 
 :ledger: **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
 subdirectory in the AlphaFold repository directory.** If it is, the Docker build
 will be slow as the large databases will be copied during the image creation.
 
 We don't provide exactly the database versions used in CASP14 – see the
-[note on reproducibility](#note-on-casp14-reproducibility). Some of the databases are
-mirrored for speed, see [mirrored databases](#mirrored-databases).
+[note on reproducibility](#note-on-casp14-reproducibility). Some of the
+databases are mirrored for speed, see [mirrored databases](#mirrored-databases).
 
-:ledger: **Note: The total download size for the full databases is around 415 GB
+:ledger: **Note: The total download size for the full databases is around 556 GB
 and the total size when unzipped is 2.62 TB. Please make sure you have a large
 enough hard drive space, bandwidth and time to download. We recommend using an
 SSD for better genetic search performance.**
@@ -252,58 +331,11 @@ To use the deprecated v2.1.0 AlphaFold-Multimer model weights:
 **The simplest way to run AlphaFold is using the provided Docker script.** This
 was tested on Google Cloud with a machine using the `nvidia-gpu-cloud-image`
 with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
-3 TB disk, and an A100 GPU.
-
-1.  Clone this repository and `cd` into it.
+3 TB disk, and an A100 GPU. For your first run, please follow the instructions
+from [Installation and running your first prediction](#installation-and-running-your-first-prediction)
+section.
 
-    ```bash
-    git clone https://github.com/deepmind/alphafold.git
-    ```
-
-1.  Build the Docker image:
-
-    ```bash
-    docker build -f docker/Dockerfile -t alphafold .
-    ```
-
-    If you encounter the following error:
-
-    ```
-    W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
-    E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
-    ```
-
-    use the workaround described in
-    https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
-
-1.  Install the `run_docker.py` dependencies. Note: You may optionally wish to
-    create a
-    [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
-    to prevent conflicts with your system's Python environment.
-
-    ```bash
-    pip3 install -r docker/requirements.txt
-    ```
-
-1.  Make sure that the output directory exists (the default is `/tmp/alphafold`)
-    and that you have sufficient permissions to write into it.
-
-1.  Run `run_docker.py` pointing to a FASTA file containing the protein
-    sequence(s) for which you wish to predict the structure. If you are
-    predicting the structure of a protein that is already in PDB and you wish to
-    avoid using it as a template, then `max_template_date` must be set to be
-    before the release date of the structure. You must also provide the path to
-    the directory containing the downloaded databases. For example, for the
-    T1050 CASP14 target:
-
-    ```bash
-    python3 docker/run_docker.py \
-      --fasta_paths=T1050.fasta \
-      --max_template_date=2020-05-14 \
-      --data_dir=$DOWNLOAD_DIR
-    ```
-
-    By default, Alphafold will attempt to use all visible GPU devices. To use a
+1.  By default, Alphafold will attempt to use all visible GPU devices. To use a
     subset, specify a comma-separated list of GPU UUID(s) or index(es) using the
     `--gpu_devices` flag. See
     [GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
@@ -347,9 +379,24 @@ with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
       --max_template_date=2020-05-14 \
       --model_preset=monomer \
       --db_preset=reduced_dbs \
-      --data_dir=$DOWNLOAD_DIR
+      --data_dir=$DOWNLOAD_DIR \
+      --output_dir=/home/user/absolute_path_to_the_output_dir
     ```
 
+1.  After generating the predicted model, by default AlphaFold runs a relaxation
+    step to improve geometrical quality. You can control this via `--run_relax=true`
+    (default) or `--run_relax=false`.
+
+1.  The relaxation step can be run on GPU (faster, but could be less stable) or
+    CPU (slow, but stable). This can be controlled with `--enable_gpu_relax=true`
+    (default) or `--enable_gpu_relax=false`.
+
+1.  AlphaFold can re-use MSAs (multiple sequence alignments) for the same
+    sequence via `--use_precomputed_msas=true` option; this can be useful for
+    trying different AlphaFold parameters. This option assumes that the
+    directory structure generated by the first AlphaFold run in the output
+    directory exists and that the protein sequence is the same.
+
 ### Running AlphaFold-Multimer
 
 All steps are the same as when running the monomer system, but you will have to
@@ -364,14 +411,47 @@ python3 docker/run_docker.py \
   --fasta_paths=multimer.fasta \
   --max_template_date=2020-05-14 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 By default the multimer system will run 5 seeds per model (25 total predictions)
 for a small drop in accuracy you may wish to run a single seed per model. This
 can be done via the `--num_multimer_predictions_per_model` flag, e.g. set it to
 `--num_multimer_predictions_per_model=1` to run a single seed per model.
 
+### AlphaFold prediction speed
+
+The table below reports prediction runtimes for proteins of various lengths. We
+only measure unrelaxed structure prediction with three recycles while
+excluding runtimes from MSA and template search. When running
+`docker/run_docker.py` with `--benchmark=true`, this runtime is stored in
+`timings.json`. All runtimes are from a single A100 NVIDIA GPU. Prediction
+speed on A100 for smaller structures can be improved by increasing
+`global_config.subbatch_size` in `alphafold/model/config.py`.
+
+No. residues | Prediction time (s)
+-----------: | ------------------:
+100          | 4.9
+200          | 7.7
+300          | 13
+400          | 18
+500          | 29
+600          | 36
+700          | 53
+800          | 60
+900          | 91
+1,000        | 96
+1,100        | 140
+1,500        | 280
+2,000        | 450
+2,500        | 969
+3,000        | 1,240
+3,500        | 2,465
+4,000        | 5,660
+4,500        | 12,475
+5,000        | 18,824
+
 ### Examples
 
 Below are examples on how to use AlphaFold in different scenarios.
@@ -392,7 +472,8 @@ python3 docker/run_docker.py \
   --fasta_paths=monomer.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=monomer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding a homomer
@@ -416,7 +497,8 @@ python3 docker/run_docker.py \
   --fasta_paths=homomer.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding a heteromer
@@ -444,7 +526,8 @@ python3 docker/run_docker.py \
   --fasta_paths=heteromer.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding multiple monomers one after another
@@ -458,7 +541,8 @@ python3 docker/run_docker.py \
   --fasta_paths=monomer1.fasta,monomer2.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=monomer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding multiple multimers one after another
@@ -472,7 +556,8 @@ python3 docker/run_docker.py \
   --fasta_paths=multimer1.fasta,multimer2.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 ### AlphaFold output
@@ -746,3 +831,4 @@ reference to the following:
     restrictions and made fully and freely available for both non-commercial and
     commercial use under
     [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
+
diff --git a/alphafold/common/residue_constants.py b/alphafold/common/residue_constants.py
@@ -120,7 +120,7 @@
 # 4,5,6,7: 'chi1,2,3,4-group'
 # The atom positions are relative to the axis-end-atom of the corresponding
 # rotation axis. The x-axis is in direction of the rotation axis, and the y-axis
-# is defined such that the dihedral-angle-definiting atom (the last entry in
+# is defined such that the dihedral-angle-defining atom (the last entry in
 # chi_angles_atoms above) is in the xy-plane (with a positive y-coordinate).
 # format: [atomname, group_idx, rel_position]
 rigid_group_atom_positions = {
@@ -772,10 +772,10 @@ def _make_rigid_transformation_4x4(ex, ey, translation):
 # and an array with (restype, atomtype, coord) for the atom positions
 # and compute affine transformation matrices (4,4) from one rigid group to the
 # previous group
-restype_atom37_to_rigid_group = np.zeros([21, 37], dtype=np.int)
+restype_atom37_to_rigid_group = np.zeros([21, 37], dtype=int)
 restype_atom37_mask = np.zeros([21, 37], dtype=np.float32)
 restype_atom37_rigid_group_positions = np.zeros([21, 37, 3], dtype=np.float32)
-restype_atom14_to_rigid_group = np.zeros([21, 14], dtype=np.int)
+restype_atom14_to_rigid_group = np.zeros([21, 14], dtype=int)
 restype_atom14_mask = np.zeros([21, 14], dtype=np.float32)
 restype_atom14_rigid_group_positions = np.zeros([21, 14, 3], dtype=np.float32)
 restype_rigid_group_default_frame = np.zeros([21, 8, 4, 4], dtype=np.float32)