Discuss Output format #2

kyleabeauchamp · 2014-09-16T01:01:45Z

Continued from #1.

So I'm hesitant about the separate metadata files, as IMHO it's very important to keep the metadata attached to the coordinates throughout the pipeline. Ideally, the metadata-data attachment would be an "atomic" operation, in the sense that they could never get broken. That's more effort than I have to invest in this, however, but it could probably be done later via a context manager type approach.

If we can make this metadata an official part of the MDTraj format, I'd be happy to follow that route as well.

The current pull request separates the directory structure from the trajectory structure. fah.py contains the tools for concatenating a single "CLONE", "stream", or "trajectory" object. automation.py contains the tools for iterating over FAH projects.

Obviously one can engineer these things ad nauseam. Right now, I just want something that works, as we're generating 6 datasets and grabbing the coordinates from the bzips takes about 7 days.

The text was updated successfully, but these errors were encountered:

kyleabeauchamp · 2014-09-16T01:04:43Z

Here is a schematic of how we're going to lay things out for automated analysis.

kyleabeauchamp · 2014-09-16T01:05:24Z

We also have some open discussions here: https://github.com/choderalab/fah-projects/issues

rmcgibbo · 2014-09-16T01:05:51Z

Love the whiteboard.

kyleabeauchamp · 2014-09-16T01:07:10Z

The idea is that #1 generates the protein HDF5 files, which we then rsync to your desktop or cluster for analysis.

rmcgibbo · 2014-09-16T01:12:34Z

My interest isn't so much in the automation side. I'm thinking about things like on-boarding new students, and standardizing protocols. For all of its downsides, I think the MSMBuilder2 "Project" format has been pretty useful. In particular, it helps jump in to debugging another students MSM work, because I'm already pretty familiar with how their stuff is laid out.

I like that most of the Mixtape API doesn't insist that you structure your files according to any particular layout, but I think that more opinionated conversion / munging code is good.

kyleabeauchamp · 2014-09-16T01:19:19Z

So my pipeline can be converted into MSMB2 format with like 10 lines of Python. However, I think we can improve on MSMB2 in several ways:

Arbitrary MDTraj formats (e.g. dcd)
Meaningful filenames (e.g. run0-clone0.h5 instead of trj0.h5)

rmcgibbo · 2014-09-16T01:20:39Z

I definitely see that linking the trajectory files and their output provenance has the advantage that it's harder to get them out of sync, but that's not the only concern.

Putting the provenance inside the trajectory file

Ties you to HDF5. The MDTraj HDF5 format is a not-very-widely-used custom format, and I feel kind of ambivalent about it for that reason.
And the current method of putting it as an extra field in the HDF5 file is even more "custom" and unstandardized.

Have you looked at any of the more 'standardized' formats for expressing provenance information? Maybe we should be using JSON-LD or something, for example.

kyleabeauchamp · 2014-09-16T01:30:21Z

Using HDF5 format is not a real barrier if it's treated as a "munging intermediate". If we're just interested in the protein coordinates, it's quite fast to convert to the output format of choice. So I'm not really sure we're "tied".

I agree that more standardization provenance is desirable, but the HDF5 fields are a reasonable near-term solution.

rmcgibbo · 2014-09-16T01:33:47Z

Another thing is that, essentially, the FAH project layout can be thought of, for the purpose of statistical analysis as, "A bunch of nested nested directories in a tree, the leaves of which each contain an MD trajectory which is saved in 'chunks' following some filename pattern". For classic FAH, the directories are RUN/CLONE, and the pattern is 'frame_.xtc'. For siegetank, you've got 'results-_.tar.bz2', but in general this stuff is not so different.

rmcgibbo · 2014-09-16T01:36:59Z

I agree that more standardization provenance is desirable, but the HDF5 fields are a reasonable near-term solution.

Yeah, this is kind of the conversation I want to have. Like, what does the 'right' long-term solution look like?

jchodera · 2014-09-16T01:37:46Z

I kind of like the idea of cramming everything into an HDF5 file, including provenance information. It is enormously quick to slice and reslice to extract bits you want out, and distributing it as a single object makes it easy to distribute and analyze the data.

The biggest drawback seems to be that if the HDF5 file keeps growing, you can't easily rsync it to another machine periodically for backup purposes. That is where a many-files-in-a-directory-tree approach would have an advantage.

kyleabeauchamp · 2014-09-16T01:38:45Z

(regarding the tree form) I agree, but IMHO there is considerable utility in constructing "continuous" trajectories for simplified visualization etc. It's quite useful to massage the data into the most "human meaningful" form.

The right "long-term solution" should be brilliant and implemented by someone who's not me...

proteneer · 2014-09-16T01:45:50Z

Siegetank directories are not quite result-*.tar.bz2

Siegetank doesn't even try to auto-tar or auto-compress anything. Frames
are layed out as raw xtc files. (Checkpoints however, are compressed). The
directory structure is laid out as fragments, which have a very non-trivial
layout in order to guarantee atomic transactions.

On Mon, Sep 15, 2014 at 9:33 PM, Robert McGibbon [email protected]
wrote:

Another thing is that, essentially, the FAH project layout can be thought
of, for the purpose of statistical analysis as, "A bunch of nested nested
directories in a tree, the leaves of which each contain an MD trajectory
which is saved in 'chunks' following some filename pattern". For classic
FAH, the directories are RUN/CLONE, and the pattern is 'frame_.xtc'. For
siegetank, you've got 'results-_.tar.bz2', but in general this stuff is not
so different.

—
Reply to this email directly or view it on GitHub
#2 (comment)
.

Yutong Zhao
Stanford University

www.proteneer.com | simbios.stanford.edu

rmcgibbo · 2014-09-16T02:33:58Z

FWIW, here's the script I'm using to convert an old FAH project into something I can analyze:

https://gist.github.com/rmcgibbo/32ca6845c5d415b8e784

rmcgibbo · 2014-09-16T02:41:13Z

It's nothing fancy, but using os.walk to walk down the filesystem and a configurable glob pattern for matching trajectories makes it pretty easy to do a variety of things.

kyleabeauchamp · 2014-09-16T03:22:20Z

So this is fine too. I'm happy with either approach.

Maybe we should just add both approaches to this github, test them out, pick a winner, fill it out with the additional functionality, and deprecate the loser?

kyleabeauchamp mentioned this issue Sep 17, 2014

Initial Draft of Automated Data Munging #1

Merged

steven-albanese mentioned this issue Jan 18, 2016

Parallel Munge Script #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss Output format #2

Discuss Output format #2

kyleabeauchamp commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

jchodera commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

proteneer commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

Discuss Output format #2

Discuss Output format #2

Comments

kyleabeauchamp commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

jchodera commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014

proteneer commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 16, 2014