Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main'
Browse files Browse the repository at this point in the history
  • Loading branch information
dialvarezs committed Dec 13, 2022
2 parents 13e9d2a + 569eb4f commit d210535
Show file tree
Hide file tree
Showing 32 changed files with 935 additions and 525 deletions.
245 changes: 131 additions & 114 deletions README.md

Large diffs are not rendered by default.

49 changes: 31 additions & 18 deletions afdb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ Google Cloud account is required for the download, but the data can be freely
used under the terms of the
[CC-BY 4.0 Licence](http://creativecommons.org/licenses/by/4.0/legalcode).

This document provides an overview of how to access and download the dataset for
different use cases. Please refer to the [AlphaFold database FAQ](https://www.alphafold.com/faq)
for further information on what proteins are in the database and a changelog of
releases.

:ledger: **Note: The full dataset is difficult to manipulate without significant
computational resources (the size of the dataset is 23 TiB, 3 * 214M files).**

Expand Down Expand Up @@ -62,25 +67,26 @@ accession]-F[a fragment number]`.

Three files are provided for each entry:

* **model_v3.cif** – contains the atomic coordinates for the predicted protein
* **model_v4.cif** – contains the atomic coordinates for the predicted protein
structure, along with some metadata. Useful references for this file format
are the [ModelCIF](https://github.com/ihmwg/ModelCIF) and
[PDBx/mmCIF](https://mmcif.wwpdb.org) project sites.
* **confidence_v3.json** – contains a confidence metric output by AlphaFold
* **confidence_v4.json** – contains a confidence metric output by AlphaFold
called pLDDT. This provides a number for each residue, indicating how
confident AlphaFold is in the *local* surrounding structure. pLDDT ranges
from 0 to 100, where 100 is most confident. This is also contained in the
CIF file.
* **predicted_aligned_error_v3.json** – contains a confidence metric output by
* **predicted_aligned_error_v4.json** – contains a confidence metric output by
AlphaFold called PAE. This provides a number for every pair of residues,
which is lower when AlphaFold is more confident in the relative position of
the two residues. PAE is more suitable than pLDDT for judging confidence in
the two residues. PAE is more suitable than pLDDT for judging confidence in
relative domain placements.
[See here](https://alphafold.ebi.ac.uk/faq#faq-7) for a description of the
format.

Predictions grouped by NCBI taxonomy ID are available as
`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID].tar` within the same bucket.
`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID]_v4.tar` within the same
bucket.

There are also two extra files stored in the bucket:

Expand All @@ -91,7 +97,7 @@ There are also two extra files stored in the bucket:
* First residue index (UniProt numbering), e.g. 1
* Last residue index (UniProt numbering), e.g. 199
* AlphaFold DB identifier, e.g. AF-A8H2R3-F1
* Latest version, e.g. 3
* Latest version, e.g. 4
* `sequences.fasta` – This file contains sequences for all proteins in the
current database version in FASTA format. The identifier rows start with
">AFDB", followed by the AlphaFold DB identifier and the name of the
Expand Down Expand Up @@ -141,7 +147,7 @@ for the services that you use to avoid any surprises.**
The data is available from:

* GCS data bucket:
[gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
[gs://public-datasets-deepmind-alphafold-v4](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4)

## Bulk download

Expand All @@ -158,12 +164,12 @@ are some suggested approaches for downloading the dataset. Please reach out to
questions.

The recommended way of downloading the whole database is by downloading
1,015,808 sharded proteome tar files using the command below. This is
1,015,797 sharded proteome tar files using the command below. This is
significantly faster than downloading all of the individual files because of
large constant per-file latency.

```bash
gsutil -m cp -r gs://public-datasets-deepmind-alphafold/proteomes/ .
gsutil -m cp -r gs://public-datasets-deepmind-alphafold-v4/proteomes/ .
```

You will then have to un-tar all of the proteomes and un-gzip all of the
Expand Down Expand Up @@ -208,8 +214,8 @@ Swiss-Prot are available on the
want other species, or *all* proteins for a particular species, please continue
reading.

We provide 1,015,808 sharded tar files for all species in
[gs://public-datasets-deepmind-alphafold/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold/proteomes/).
We provide 1,015,797 sharded tar files for all species in
[gs://public-datasets-deepmind-alphafold-v4/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/proteomes/).
We shard each proteome so that each shard contains at most 10,000 proteins
(which corresponds to 30,000 files per shard, since there are 3 files per
protein). To download a proteome of your choice, you have to do the following
Expand All @@ -218,14 +224,14 @@ steps:
1. Find the [NCBI taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy)
(`[TAX_ID]`) of the species in question.
2. Run `gsutil -m cp
gs://public-datasets-deepmind-alphafold/proteomes/proteome-tax_id-[TAX
ID]-*.tar .` to download all shards for this proteome.
gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-[TAX
ID]-*_v4.tar .` to download all shards for this proteome.
3. Un-tar all of the downloaded files and un-gzip all of the individual files.

### File manifests

Pre-made lists of files (manifests) are available at
[gs://public-datasets-deepmind-alphafold/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold/manifests/).
[gs://public-datasets-deepmind-alphafold-v4/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/manifests/).
Note that these filenames do not include the bucket prefix, but this can be
added once the files have been downloaded to your filesystem.

Expand Down Expand Up @@ -299,6 +305,8 @@ fractionPlddtVeryLow | `FLOAT64` | Fraction of the residues in the predi
gene | `STRING` | The name of the gene if known, e.g. "COII"
geneSynonyms | `ARRAY<STRING>` | Additional synonyms for the gene
globalMetricValue | `FLOAT64` | The mean pLDDT of this prediction
isReferenceProteome | `BOOL` | Is this protein part of the reference proteome?
isReviewed | `BOOL` | Has this protein been reviewed, i.e. is it part of SwissProt?
latestVersion | `INT64` | The latest AFDB version for this prediction
modelCreatedDate | `DATE` | The date of creation for this entry, e.g. "2022-06-01"
organismCommonNames | `ARRAY<STRING>` | List of common organism names
Expand Down Expand Up @@ -345,15 +353,15 @@ given below:
with file_rows AS (
with file_cols AS (
SELECT
CONCAT(entryID, '-model_v', latestVersion, '.cif') as m,
CONCAT(entryID, '-predicted_aligned_error_v', latestVersion, '.json') as p
CONCAT(entryID, '-model_v4.cif') as m,
CONCAT(entryID, '-predicted_aligned_error_v4.json') as p
FROM bigquery-public-data.deepmind_alphafold.metadata
WHERE organismScientificName = "Homo sapiens"
AND (fractionPlddtVeryHigh + fractionPlddtConfident) > 0.5
)
SELECT * FROM file_cols UNPIVOT (files for filetype in (m, p))
)
SELECT CONCAT('gs://public-datasets-deepmind-alphafold/', files) as files
SELECT CONCAT('gs://public-datasets-deepmind-alphafold-v4/', files) as files
from file_rows
```

Expand All @@ -362,7 +370,7 @@ sapiens* for which over half the residues are confident or better (>70 pLDDT).

This creates a table with one column "files", where each row is the cloud
location of one of the two file types that has been provided for each protein.
There is an additional `confidence_v[version].json` file which contains the
There is an additional `confidence_v4.json` file which contains the
per-residue pLDDT. This information is already in the CIF file but may be
preferred if only this information is required.

Expand All @@ -375,3 +383,8 @@ documentation should be followed to download these file subsets locally, as the
most appropriate approach will depend on the filesize. Note that it may be
easier to download large files using [Colab](https://colab.research.google.com/)
(e.g. pandas to_csv).

#### Previous versions
Previous versions of AFDB will remain available at
[gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
to enable reproducible research. We recommend using the latest version (v4).
14 changes: 7 additions & 7 deletions alphafold/data/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def __init__(self,
uniref90_database_path: str,
mgnify_database_path: str,
bfd_database_path: Optional[str],
uniclust30_database_path: Optional[str],
uniref30_database_path: Optional[str],
small_bfd_database_path: Optional[str],
template_searcher: TemplateSearcher,
template_featurizer: templates.TemplateHitFeaturizer,
Expand All @@ -135,9 +135,9 @@ def __init__(self,
binary_path=jackhmmer_binary_path,
database_path=small_bfd_database_path)
else:
self.hhblits_bfd_uniclust_runner = hhblits.HHBlits(
self.hhblits_bfd_uniref_runner = hhblits.HHBlits(
binary_path=hhblits_binary_path,
databases=[bfd_database_path, uniclust30_database_path])
databases=[bfd_database_path, uniref30_database_path])
self.jackhmmer_mgnify_runner = jackhmmer.Jackhmmer(
binary_path=jackhmmer_binary_path,
database_path=mgnify_database_path)
Expand Down Expand Up @@ -211,14 +211,14 @@ def process(self, input_fasta_path: str, msa_output_dir: str) -> FeatureDict:
use_precomputed_msas=self.use_precomputed_msas)
bfd_msa = parsers.parse_stockholm(jackhmmer_small_bfd_result['sto'])
else:
bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniclust_hits.a3m')
hhblits_bfd_uniclust_result = run_msa_tool(
msa_runner=self.hhblits_bfd_uniclust_runner,
bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniref_hits.a3m')
hhblits_bfd_uniref_result = run_msa_tool(
msa_runner=self.hhblits_bfd_uniref_runner,
input_fasta_path=input_fasta_path,
msa_out_path=bfd_out_path,
msa_format='a3m',
use_precomputed_msas=self.use_precomputed_msas)
bfd_msa = parsers.parse_a3m(hhblits_bfd_uniclust_result['a3m'])
bfd_msa = parsers.parse_a3m(hhblits_bfd_uniref_result['a3m'])

templates_result = self.template_featurizer.get_templates(
query_sequence=input_sequence,
Expand Down
61 changes: 61 additions & 0 deletions alphafold/model/common_modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,64 @@ def __call__(self, inputs):

return output


class LayerNorm(hk.LayerNorm):
"""LayerNorm module.
Equivalent to hk.LayerNorm but with different parameter shapes: they are
always vectors rather than possibly higher-rank tensors. This makes it easier
to change the layout whilst keep the model weight-compatible.
"""

def __init__(self,
axis,
create_scale: bool,
create_offset: bool,
eps: float = 1e-5,
scale_init=None,
offset_init=None,
use_fast_variance: bool = False,
name=None,
param_axis=None):
super().__init__(
axis=axis,
create_scale=False,
create_offset=False,
eps=eps,
scale_init=None,
offset_init=None,
use_fast_variance=use_fast_variance,
name=name,
param_axis=param_axis)
self._temp_create_scale = create_scale
self._temp_create_offset = create_offset

def __call__(self, x: jnp.ndarray) -> jnp.ndarray:
is_bf16 = (x.dtype == jnp.bfloat16)
if is_bf16:
x = x.astype(jnp.float32)

param_axis = self.param_axis[0] if self.param_axis else -1
param_shape = (x.shape[param_axis],)

param_broadcast_shape = [1] * x.ndim
param_broadcast_shape[param_axis] = x.shape[param_axis]
scale = None
offset = None
if self._temp_create_scale:
scale = hk.get_parameter(
'scale', param_shape, x.dtype, init=self.scale_init)
scale = scale.reshape(param_broadcast_shape)

if self._temp_create_offset:
offset = hk.get_parameter(
'offset', param_shape, x.dtype, init=self.offset_init)
offset = offset.reshape(param_broadcast_shape)

out = super().__call__(x, scale=scale, offset=offset)

if is_bf16:
out = out.astype(jnp.bfloat16)

return out

Loading

0 comments on commit d210535

Please sign in to comment.