Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track provenance information for all modeling inputs #976

Open
benmwebb opened this issue Apr 18, 2017 · 1 comment
Open

Track provenance information for all modeling inputs #976

benmwebb opened this issue Apr 18, 2017 · 1 comment
Assignees

Comments

@benmwebb
Copy link
Member

benmwebb commented Apr 18, 2017

IMP currently takes as input files in a variety of formats, but doesn't care where those files originate. This becomes a problem when we come to publish a modeling study and deposit the files (e.g. at PDB-dev). It's a lot of work to backtrack and try to figure out where such files came from. It would be much simpler if IMP tracked this information from day one, reading it in some standardized way from the files themselves (or the Python script), storing it in the Model, and also storing it in RMF files.

Since this is prerequisite information for outputting mmCIF files, solving this issue would be a step towards addressing #968. Much of this information is currently stored outside of the Model, mostly in PMI 1 data structures, and so currently outputting mmCIF requires PMI 1.

Only input atomic models are explicitly considered here but similar considerations should apply to restraints (e.g. where an EM map comes from), sequences (e.g. uniprot identifier), etc. (More generally, any transformation of the model, such as sampling, filtering or clustering, should also be recorded.)

Input files

  • All input PDB files should contain suitable headers identifying their source.
    • "Official PDB structures": we can simply keep the existing PDB headers.
    • Modeller models: need to add a header pointing to its ModelArchive or PMP ID, or add headers to provide similar metadata (e.g. a path to the alignment file used)
    • Other comparative models (e.g. Phyre2) need similar header information.
    • Derived models (e.g. an experimental model that has been rotated and translated) need a header to point to the original model with an explanation of what was done; X-ray example; comparative modeling example
  • Headers should be parsed by atom::read_pdb
  • Files without headers should result in a warning; in future this can be upgraded to a hard error.
  • Files that come from multiple sources (e.g. two crystal structures and a comparative model docked together) should not be allowed. Such files should be split into their constituents and suitable headers added to each.
  • Other input files should contain similar custom headers as appropriate. Where the file format doesn't permit this (e.g. .mrc or .pgm files) the metadata will need to be stored somewhere else - one solution would be an accompanying JSON file (e.g. foo.mrc is described by foo.mrc.json) with domain-specific metadata.

Storage in Model

  • atom::read_pdb should add suitable decorators to the created Hierarchy particles to identify their source. atom::StructureSource (see Add a Source decorator #894) is one example, although more data should be stored here (e.g. path to the file, PDB ID, version, descriptive text).
  • Functions like atom::create_simplified_along_backbone and PMI's generation of initial models should copy or otherwise preserve this information in the newly-generated hierarchies, while adding additional information about the simplification applied (e.g. the resolution).
  • Add a core::ProvenanceHierarchy decorator to track a separate hierarchy from the atom::Hierarchy. The root of this tree is the current state of the object, while children are inputs or previous states (and so will be both core::ProvenanceHierarchy and some other decorator such as StructureSource). A particle can be decorated as both core::ProvenanceHierarchy and atom::Hierarchy. Example hierarchies include:
    • System (atom::Hierarchy), also core::ProvenanceHierarchy root
      • Citation foo et. al
      • Clustered with k-means from
        • Ensemble 1 stored in output.1/rmfs/0.rmf3
          • Generated by MD/MC with
            • EM2D restraint using image foo.pgm
            • Crosslinks read from foo.csv
        • Ensemble 2 stored in output.2/rmfs/0.rmf3
          • ...
    • Chain (atom::Hierarchy), also core::ProvenanceHierarchy root
      • Read from PDB file foo.pdb, chain A, heavy atoms only
        • Built from template 1xyzA with alignment foo.ali
  • core::ProvenanceHierarchy should be static (i.e. the same for all frames in a trajectory) so that it doesn't need to be updated during a simulation, and can be stored efficiently in an RMF file. Care also needs to be taken to avoid unnecessary duplication (e.g. each ensemble contributing to a cluster likely has the exact same set of inputs).
  • Functions that change the state of the model (such as samplers, filters, or clustering algorithms) should add to the provenance hierarchy appropriately.

Storage in RMF

  • Support for any new decorators should be added to RMF, so that generated output files incorporate this information.
  • Code that reads and writes RMF files (e.g. clustering in PMI) should preserve this information.
@benmwebb benmwebb self-assigned this Apr 18, 2017
benmwebb added a commit that referenced this issue May 5, 2017
Prior to actually having atom::read_pdb pull in
provenance information about PDB files, describe what
will be necessary in the documentation. Relates #976.
benmwebb added a commit to salilab/rmf that referenced this issue Oct 17, 2017
These allow a structure to be tagged with a tree of provenance
nodes, that explain how the structure was created.
Relates salilab/imp#976.
benmwebb added a commit that referenced this issue Oct 18, 2017
This adds basic support for provenance tracking to
IMP and RMF. Relates #976. Operations that alter parts
of the structure (such as reading in from a PDB file,
sampling, filtering, clustering) can now be recorded
directly in the Model itself by means of provenance
decorators, attached to atom::Hierarchy nodes. This
provenance information is also stored in RMF files.
Where possible, IMP and PMI should fill in this information
automatically.
@benmwebb
Copy link
Member Author

For tracking provenance of most experimental information, some additional information needs to be stored in the RMF file, namely the set of restraints, which particles they act on, and which restraints were used in each sampling step.

Proposal: RMF already stores basic information about decomposed restraints. Make each set of decomposed restraints children of the 'real' restraint, which holds serialized information on the restraint itself (e.g. filename where the EM map was read from, cross correlation information, total score). The SampleProvenance decorator then contains an RMF Alias node child for each restraint used in that sampling. This information is already stored in IMP (partly in the Model, and partly in the ScoringFunction.)

benmwebb added a commit that referenced this issue Nov 16, 2017
This adds a new IMP.mmcif module, which is similar in
concept to the IMP.rmf module - it adds support for the
mmCIF file format. It is intended to be used to convert
sets of IMP models (generally read from intermediate
RMF files) into a single mmCIF file, for deposition in
PDB-Dev, and relies on provenance information (see #976)
being present in the models. Relates #968.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant