Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss Output format #2

Open
kyleabeauchamp opened this issue Sep 16, 2014 · 16 comments
Open

Discuss Output format #2

kyleabeauchamp opened this issue Sep 16, 2014 · 16 comments

Comments

@kyleabeauchamp
Copy link
Collaborator

Continued from #1.

So I'm hesitant about the separate metadata files, as IMHO it's very important to keep the metadata attached to the coordinates throughout the pipeline. Ideally, the metadata-data attachment would be an "atomic" operation, in the sense that they could never get broken. That's more effort than I have to invest in this, however, but it could probably be done later via a context manager type approach.

If we can make this metadata an official part of the MDTraj format, I'd be happy to follow that route as well.

The current pull request separates the directory structure from the trajectory structure. fah.py contains the tools for concatenating a single "CLONE", "stream", or "trajectory" object. automation.py contains the tools for iterating over FAH projects.

Obviously one can engineer these things ad nauseam. Right now, I just want something that works, as we're generating 6 datasets and grabbing the coordinates from the bzips takes about 7 days.

@kyleabeauchamp
Copy link
Collaborator Author

Here is a schematic of how we're going to lay things out for automated analysis.

fah_scheme

@kyleabeauchamp
Copy link
Collaborator Author

We also have some open discussions here: https://github.com/choderalab/fah-projects/issues

@rmcgibbo
Copy link

Love the whiteboard.

@kyleabeauchamp
Copy link
Collaborator Author

The idea is that #1 generates the protein HDF5 files, which we then rsync to your desktop or cluster for analysis.

@rmcgibbo
Copy link

My interest isn't so much in the automation side. I'm thinking about things like on-boarding new students, and standardizing protocols. For all of its downsides, I think the MSMBuilder2 "Project" format has been pretty useful. In particular, it helps jump in to debugging another students MSM work, because I'm already pretty familiar with how their stuff is laid out.

I like that most of the Mixtape API doesn't insist that you structure your files according to any particular layout, but I think that more opinionated conversion / munging code is good.

@kyleabeauchamp
Copy link
Collaborator Author

So my pipeline can be converted into MSMB2 format with like 10 lines of Python. However, I think we can improve on MSMB2 in several ways:

  1. Arbitrary MDTraj formats (e.g. dcd)
  2. Meaningful filenames (e.g. run0-clone0.h5 instead of trj0.h5)

@rmcgibbo
Copy link

I definitely see that linking the trajectory files and their output provenance has the advantage that it's harder to get them out of sync, but that's not the only concern.

Putting the provenance inside the trajectory file

  • Ties you to HDF5. The MDTraj HDF5 format is a not-very-widely-used custom format, and I feel kind of ambivalent about it for that reason.
  • And the current method of putting it as an extra field in the HDF5 file is even more "custom" and unstandardized.

Have you looked at any of the more 'standardized' formats for expressing provenance information? Maybe we should be using JSON-LD or something, for example.

@kyleabeauchamp
Copy link
Collaborator Author

Using HDF5 format is not a real barrier if it's treated as a "munging intermediate". If we're just interested in the protein coordinates, it's quite fast to convert to the output format of choice. So I'm not really sure we're "tied".

I agree that more standardization provenance is desirable, but the HDF5 fields are a reasonable near-term solution.

@rmcgibbo
Copy link

Another thing is that, essentially, the FAH project layout can be thought of, for the purpose of statistical analysis as, "A bunch of nested nested directories in a tree, the leaves of which each contain an MD trajectory which is saved in 'chunks' following some filename pattern". For classic FAH, the directories are RUN/CLONE, and the pattern is 'frame_.xtc'. For siegetank, you've got 'results-_.tar.bz2', but in general this stuff is not so different.

@rmcgibbo
Copy link

I agree that more standardization provenance is desirable, but the HDF5 fields are a reasonable near-term solution.

Yeah, this is kind of the conversation I want to have. Like, what does the 'right' long-term solution look like?

@jchodera
Copy link
Member

I kind of like the idea of cramming everything into an HDF5 file, including provenance information. It is enormously quick to slice and reslice to extract bits you want out, and distributing it as a single object makes it easy to distribute and analyze the data.

The biggest drawback seems to be that if the HDF5 file keeps growing, you can't easily rsync it to another machine periodically for backup purposes. That is where a many-files-in-a-directory-tree approach would have an advantage.

@kyleabeauchamp
Copy link
Collaborator Author

(regarding the tree form) I agree, but IMHO there is considerable utility in constructing "continuous" trajectories for simplified visualization etc. It's quite useful to massage the data into the most "human meaningful" form.

The right "long-term solution" should be brilliant and implemented by someone who's not me...

@proteneer
Copy link

Siegetank directories are not quite result-*.tar.bz2

Siegetank doesn't even try to auto-tar or auto-compress anything. Frames
are layed out as raw xtc files. (Checkpoints however, are compressed). The
directory structure is laid out as fragments, which have a very non-trivial
layout in order to guarantee atomic transactions.

On Mon, Sep 15, 2014 at 9:33 PM, Robert McGibbon [email protected]
wrote:

Another thing is that, essentially, the FAH project layout can be thought
of, for the purpose of statistical analysis as, "A bunch of nested nested
directories in a tree, the leaves of which each contain an MD trajectory
which is saved in 'chunks' following some filename pattern". For classic
FAH, the directories are RUN/CLONE, and the pattern is 'frame_.xtc'. For
siegetank, you've got 'results-_.tar.bz2', but in general this stuff is not
so different.


Reply to this email directly or view it on GitHub
#2 (comment)
.

Yutong Zhao
Stanford University

www.proteneer.com | simbios.stanford.edu

@rmcgibbo
Copy link

FWIW, here's the script I'm using to convert an old FAH project into something I can analyze:

https://gist.github.com/rmcgibbo/32ca6845c5d415b8e784

@rmcgibbo
Copy link

It's nothing fancy, but using os.walk to walk down the filesystem and a configurable glob pattern for matching trajectories makes it pretty easy to do a variety of things.

@kyleabeauchamp
Copy link
Collaborator Author

So this is fine too. I'm happy with either approach.

Maybe we should just add both approaches to this github, test them out, pick a winner, fill it out with the additional functionality, and deprecate the loser?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants