Merge remote-tracking branch 'upstream/main'

dialvarezs · Dec 13, 2022 · d210535 · d210535
2 parents 13e9d2a + 569eb4f
commit d210535
Show file tree

Hide file tree

Showing 32 changed files with 935 additions and 525 deletions.
diff --git a/README.md b/README.md
diff --git a/afdb/README.md b/afdb/README.md
@@ -11,6 +11,11 @@ Google Cloud account is required for the download, but the data can be freely
 used under the terms of the
 [CC-BY 4.0 Licence](http://creativecommons.org/licenses/by/4.0/legalcode).
 
+This document provides an overview of how to access and download the dataset for
+different use cases. Please refer to the [AlphaFold database FAQ](https://www.alphafold.com/faq)
+for further information on what proteins are in the database and a changelog of
+releases.
+
 :ledger: **Note: The full dataset is difficult to manipulate without significant
 computational resources (the size of the dataset is 23 TiB, 3 * 214M files).**
 
@@ -62,25 +67,26 @@ accession]-F[a fragment number]`.
 
 Three files are provided for each entry:
 
-*   **model_v3.cif** – contains the atomic coordinates for the predicted protein
+*   **model_v4.cif** – contains the atomic coordinates for the predicted protein
     structure, along with some metadata. Useful references for this file format
     are the [ModelCIF](https://github.com/ihmwg/ModelCIF) and
     [PDBx/mmCIF](https://mmcif.wwpdb.org) project sites.
-*   **confidence_v3.json** – contains a confidence metric output by AlphaFold
+*   **confidence_v4.json** – contains a confidence metric output by AlphaFold
     called pLDDT. This provides a number for each residue, indicating how
     confident AlphaFold is in the *local* surrounding structure. pLDDT ranges
     from 0 to 100, where 100 is most confident. This is also contained in the
     CIF file.
-*   **predicted_aligned_error_v3.json** – contains a confidence metric output by
+*   **predicted_aligned_error_v4.json** – contains a confidence metric output by
     AlphaFold called PAE. This provides a number for every pair of residues,
     which is lower when AlphaFold is more confident in the relative position of
-    the two residues. PAE is more suitable than pLDDT for judging confidence in
+    the two residues. PAE is more suitable than pLDDT for judging confidence in 
     relative domain placements.
     [See here](https://alphafold.ebi.ac.uk/faq#faq-7) for a description of the
     format.
 
 Predictions grouped by NCBI taxonomy ID are available as
-`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID].tar` within the same bucket.
+`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID]_v4.tar` within the same
+bucket.
 
 There are also two extra files stored in the bucket:
 
@@ -91,7 +97,7 @@ There are also two extra files stored in the bucket:
     *   First residue index (UniProt numbering), e.g. 1
     *   Last residue index (UniProt numbering), e.g. 199
     *   AlphaFold DB identifier, e.g. AF-A8H2R3-F1
-    *   Latest version, e.g. 3
+    *   Latest version, e.g. 4
 *   `sequences.fasta` – This file contains sequences for all proteins in the
     current database version in FASTA format. The identifier rows start with
     ">AFDB", followed by the AlphaFold DB identifier and the name of the
@@ -141,7 +147,7 @@ for the services that you use to avoid any surprises.**
 The data is available from:
 
 *   GCS data bucket:
-    [gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
+    [gs://public-datasets-deepmind-alphafold-v4](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4)
 
 ## Bulk download
 
@@ -158,12 +164,12 @@ are some suggested approaches for downloading the dataset. Please reach out to
 questions.
 
 The recommended way of downloading the whole database is by downloading
-1,015,808 sharded proteome tar files using the command below. This is
+1,015,797 sharded proteome tar files using the command below. This is
 significantly faster than downloading all of the individual files because of
 large constant per-file latency.
 
 ```bash
-gsutil -m cp -r gs://public-datasets-deepmind-alphafold/proteomes/ .
+gsutil -m cp -r gs://public-datasets-deepmind-alphafold-v4/proteomes/ .
 ```
 
 You will then have to un-tar all of the proteomes and un-gzip all of the
@@ -208,8 +214,8 @@ Swiss-Prot are available on the
 want other species, or *all* proteins for a particular species, please continue
 reading.
 
-We provide 1,015,808 sharded tar files for all species in
-[gs://public-datasets-deepmind-alphafold/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold/proteomes/).
+We provide 1,015,797 sharded tar files for all species in
+[gs://public-datasets-deepmind-alphafold-v4/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/proteomes/).
 We shard each proteome so that each shard contains at most 10,000 proteins
 (which corresponds to 30,000 files per shard, since there are 3 files per
 protein). To download a proteome of your choice, you have to do the following
@@ -218,14 +224,14 @@ steps:
 1.  Find the [NCBI taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy)
     (`[TAX_ID]`) of the species in question.
 2.  Run `gsutil -m cp
-    gs://public-datasets-deepmind-alphafold/proteomes/proteome-tax_id-[TAX
-    ID]-*.tar .` to download all shards for this proteome.
+    gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-[TAX
+    ID]-*_v4.tar .` to download all shards for this proteome.
 3.  Un-tar all of the downloaded files and un-gzip all of the individual files.
 
 ### File manifests
 
 Pre-made lists of files (manifests) are available at
-[gs://public-datasets-deepmind-alphafold/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold/manifests/).
+[gs://public-datasets-deepmind-alphafold-v4/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/manifests/).
 Note that these filenames do not include the bucket prefix, but this can be
 added once the files have been downloaded to your filesystem.
 
@@ -299,6 +305,8 @@ fractionPlddtVeryLow   | `FLOAT64`       | Fraction of the residues in the predi
 gene                   | `STRING`        | The name of the gene if known, e.g. "COII"
 geneSynonyms           | `ARRAY<STRING>` | Additional synonyms for the gene
 globalMetricValue      | `FLOAT64`       | The mean pLDDT of this prediction
+isReferenceProteome    | `BOOL`          | Is this protein part of the reference proteome?
+isReviewed             | `BOOL`          | Has this protein been reviewed, i.e. is it part of SwissProt?
 latestVersion          | `INT64`         | The latest AFDB version for this prediction
 modelCreatedDate       | `DATE`          | The date of creation for this entry, e.g. "2022-06-01"
 organismCommonNames    | `ARRAY<STRING>` | List of common organism names
@@ -345,15 +353,15 @@ given below:
 with file_rows AS (
   with file_cols AS (
     SELECT
-      CONCAT(entryID, '-model_v', latestVersion, '.cif') as m,
-      CONCAT(entryID, '-predicted_aligned_error_v', latestVersion, '.json') as p
+      CONCAT(entryID, '-model_v4.cif') as m,
+      CONCAT(entryID, '-predicted_aligned_error_v4.json') as p
     FROM bigquery-public-data.deepmind_alphafold.metadata
     WHERE organismScientificName = "Homo sapiens"
       AND (fractionPlddtVeryHigh + fractionPlddtConfident) > 0.5
   )
   SELECT * FROM file_cols UNPIVOT (files for filetype in (m, p))
 )
-SELECT CONCAT('gs://public-datasets-deepmind-alphafold/', files) as files
+SELECT CONCAT('gs://public-datasets-deepmind-alphafold-v4/', files) as files
 from file_rows
 ```
 
@@ -362,7 +370,7 @@ sapiens* for which over half the residues are confident or better (>70 pLDDT).
 
 This creates a table with one column "files", where each row is the cloud
 location of one of the two file types that has been provided for each protein.
-There is an additional `confidence_v[version].json` file which contains the
+There is an additional `confidence_v4.json` file which contains the
 per-residue pLDDT. This information is already in the CIF file but may be
 preferred if only this information is required.
 
@@ -375,3 +383,8 @@ documentation should be followed to download these file subsets locally, as the
 most appropriate approach will depend on the filesize. Note that it may be
 easier to download large files using [Colab](https://colab.research.google.com/)
 (e.g. pandas to_csv).
+
+#### Previous versions
+Previous versions of AFDB will remain available at
+[gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
+to enable reproducible research. We recommend using the latest version (v4).
diff --git a/alphafold/data/pipeline.py b/alphafold/data/pipeline.py
@@ -117,7 +117,7 @@ def __init__(self,
                uniref90_database_path: str,
                mgnify_database_path: str,
                bfd_database_path: Optional[str],
-               uniclust30_database_path: Optional[str],
+               uniref30_database_path: Optional[str],
                small_bfd_database_path: Optional[str],
                template_searcher: TemplateSearcher,
                template_featurizer: templates.TemplateHitFeaturizer,
@@ -135,9 +135,9 @@ def __init__(self,
           binary_path=jackhmmer_binary_path,
           database_path=small_bfd_database_path)
     else:
-      self.hhblits_bfd_uniclust_runner = hhblits.HHBlits(
+      self.hhblits_bfd_uniref_runner = hhblits.HHBlits(
           binary_path=hhblits_binary_path,
-          databases=[bfd_database_path, uniclust30_database_path])
+          databases=[bfd_database_path, uniref30_database_path])
     self.jackhmmer_mgnify_runner = jackhmmer.Jackhmmer(
         binary_path=jackhmmer_binary_path,
         database_path=mgnify_database_path)
@@ -211,14 +211,14 @@ def process(self, input_fasta_path: str, msa_output_dir: str) -> FeatureDict:
           use_precomputed_msas=self.use_precomputed_msas)
       bfd_msa = parsers.parse_stockholm(jackhmmer_small_bfd_result['sto'])
     else:
-      bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniclust_hits.a3m')
-      hhblits_bfd_uniclust_result = run_msa_tool(
-          msa_runner=self.hhblits_bfd_uniclust_runner,
+      bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniref_hits.a3m')
+      hhblits_bfd_uniref_result = run_msa_tool(
+          msa_runner=self.hhblits_bfd_uniref_runner,
           input_fasta_path=input_fasta_path,
           msa_out_path=bfd_out_path,
           msa_format='a3m',
           use_precomputed_msas=self.use_precomputed_msas)
-      bfd_msa = parsers.parse_a3m(hhblits_bfd_uniclust_result['a3m'])
+      bfd_msa = parsers.parse_a3m(hhblits_bfd_uniref_result['a3m'])
 
     templates_result = self.template_featurizer.get_templates(
         query_sequence=input_sequence,

diff --git a/alphafold/model/common_modules.py b/alphafold/model/common_modules.py
@@ -128,3 +128,64 @@ def __call__(self, inputs):
 
     return output
 
+
+class LayerNorm(hk.LayerNorm):
+  """LayerNorm module.
+
+  Equivalent to hk.LayerNorm but with different parameter shapes: they are
+  always vectors rather than possibly higher-rank tensors. This makes it easier
+  to change the layout whilst keep the model weight-compatible.
+  """
+
+  def __init__(self,
+               axis,
+               create_scale: bool,
+               create_offset: bool,
+               eps: float = 1e-5,
+               scale_init=None,
+               offset_init=None,
+               use_fast_variance: bool = False,
+               name=None,
+               param_axis=None):
+    super().__init__(
+        axis=axis,
+        create_scale=False,
+        create_offset=False,
+        eps=eps,
+        scale_init=None,
+        offset_init=None,
+        use_fast_variance=use_fast_variance,
+        name=name,
+        param_axis=param_axis)
+    self._temp_create_scale = create_scale
+    self._temp_create_offset = create_offset
+
+  def __call__(self, x: jnp.ndarray) -> jnp.ndarray:
+    is_bf16 = (x.dtype == jnp.bfloat16)
+    if is_bf16:
+      x = x.astype(jnp.float32)
+
+    param_axis = self.param_axis[0] if self.param_axis else -1
+    param_shape = (x.shape[param_axis],)
+
+    param_broadcast_shape = [1] * x.ndim
+    param_broadcast_shape[param_axis] = x.shape[param_axis]
+    scale = None
+    offset = None
+    if self._temp_create_scale:
+      scale = hk.get_parameter(
+          'scale', param_shape, x.dtype, init=self.scale_init)
+      scale = scale.reshape(param_broadcast_shape)
+
+    if self._temp_create_offset:
+      offset = hk.get_parameter(
+          'offset', param_shape, x.dtype, init=self.offset_init)
+      offset = offset.reshape(param_broadcast_shape)
+
+    out = super().__call__(x, scale=scale, offset=offset)
+
+    if is_bf16:
+      out = out.astype(jnp.bfloat16)
+
+    return out
+