Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller variant of multicol files for 3D grids (and beyond) #348

Open
jhenin opened this issue Apr 27, 2020 · 10 comments
Open

Smaller variant of multicol files for 3D grids (and beyond) #348

jhenin opened this issue Apr 27, 2020 · 10 comments
Assignees

Comments

@jhenin
Copy link
Member

jhenin commented Apr 27, 2020

Multicol files in their current states are great in 1D and 2D because they are read seamlessly by many plotting programs. In 3D they can get quite bulky, and plotting becomes less straightforward anyway. To reduce the size and the time spend reading and writing them, I see two options: using DX files, which have their own constraints, or a variant of multicol that doesn't have the x values (which can be completely generated based on the data in the header). Thoughts?

@giacomofiorin
Copy link
Member

A semi-standard format for a 3D image is OpenDX itself (or at least the subset that is used by APBS, VMD, NAMD, Pymol, etc.). Like "multicol" it contains a header and is text-based, which makes it easy to inspect. I guess that the only constraint is not having the ability to encode a "periodic" flag?

Another option is the Matlab format, which is binary but at least is just as widely supported also by open-source code (Octave, SciPy, ...). Because it encodes multiple variables, there is the advantage of putting multiple fields in, e.g. mean gradients alongside histogram, periodic flag etc. It is probably possible to just reuse the open-source implementations.
https://docs.scipy.org/doc/scipy/reference/tutorial/io.html#matlab-files
NumPy itself doesn't seem to support 3D meshes easily. loadtxt/savetxt are 1D or 2D only, and the native .npy binary format is then specific to NumPy (good for many, but Matlab hasn't just disappeared yet).

@giacomofiorin
Copy link
Member

Reading more about the format from the SciPy page and other open source libraries like this one:
https://github.com/tbeu/matio
It seems that since version 7.3 (released in 2006) the format is HDF5-based. I presume later versions are backward compatible with pre-7.3 files, but that alone makes it tricky to embed a writer or reader for code that should be used for several years 👎

My original idea was to register the grids currently allocated in the various biases as elements in colvarparams, so that they can be fetched in scripts using their name. The inspiration for that was the extract() method of many LAMMPS classes, though I named it differently because the semantics are not the same. Then we'd have a bit more freedom to decide later how to use them in post-processing. Unless there is a tool ready that can easily read those files right after they're written (like GNUplot for 2D multicol), I wouldn't worry about changing the format. Does that sound like an option?

To reduce the space used by high-dimensional grids, I would also suggest in the doc to use outputFreq 0 for the respective bias, thus only write the state file, which is compact enough as it is for a text format. Do you see any scenario where disabling outputFreq would have unintended consequences?

@HanatoK
Copy link
Member

HanatoK commented Apr 27, 2020

What about just using an internal format to record the boundaries and widths, and then providing an post-processing tool to convert it to the current grid format?

@giacomofiorin
Copy link
Member

That's along the lines of what I meant in the previous comments: the state file has all that information encoded. The scripting interface could easily provide access to the grid array, its boundaries, widths and periodic flags associated with it. Pros: leverages existing code. Cons: requires running the module for post-processing, which currently means using an updated VMD and a molecular structure file for the system.

If these requirements are too heavy, a post-processing script could be used as well. But then, the same script would need to replicate some of the internal functionality (particularly, the multicolumn grid I/O code).

@HanatoK
Copy link
Member

HanatoK commented Apr 27, 2020

@giacomofiorin I think NAMD with an updated colvars should be enough to read the state file and write the grid. Users can run simulation of "0" step with dummy psf and pdb files, only the colvars input and state files are from an actual simulation. It's just like merging ABF windows with inputPrefix.

@jhenin
Copy link
Member Author

jhenin commented Sep 1, 2022

Here is a plan:

  • take the logic currently in the ABF bias and move it into: in dimension higher than 2, write a DX file in addition to multicol.
  • add a DX parser to colvars_grid.py (see the multimap analysis tools for an example)
  • remove redundant multicol output, use multicol if dimension <= 2 and DX otherwise.

@HanatoK @fabsugar Would this work for you?

@giacomofiorin
Copy link
Member

@jhenin I pinged @fabsugar in person and he agrees with switching format for dimensions 3 and above.

One issue I didn't think about immediately when you wrote your last comment is how would you define the behavior for inputPrefix. Currently multicolumn is expected, but if all output is switched to DX for higher dimensions, then you'd need a DX reader in the C++ code as well, which is undesirable. We'd just end up defining yet another dialect of the DX format.

Are you okay with limiting use of the DX format for output/visualization only and perhaps begin using the state file format for input?

@jhenin
Copy link
Member Author

jhenin commented Sep 13, 2022

Yes, I agree that we don't want to add a DX parser in there. The state file format seems ok, but we'll need the same flexibility offered by multicol: specifically the option of reading multiple grids, potentially with mismatched parameters, into the same bias.

@giacomofiorin
Copy link
Member

giacomofiorin commented Sep 13, 2022

I checked, in the colvargrid::read_restart() function there is this condition:

colvars/src/colvargrid.h

Lines 959 to 966 in 7f850e0

if ((is >> key) && (key == std::string("grid_parameters"))) {
is.seekg(start_pos, std::ios::beg);
is >> colvarparse::read_block("grid_parameters", &conf);
parse_params(conf, colvarparse::parse_silent);
} else {
cvm::log("Grid parameters are missing in the restart file, using those from the configuration.\n");
is.seekg(start_pos, std::ios::beg);
}

that mimics a previous check first introduced in 2009 (before the start of the Git history), when the state file began containing the grid_parameters block to support metadynamics grid rebinning.

It's safe to say that nowadays the if () condition never evaluates to false, i.e. every instance of a grid object in a state file comes prefixed with a grid_parameters block.

So I think you can safely adapt the existing logic that read_multicol() currently has:

if ( remap ) {

(To be precise, compared to the multicolumn format the state file lacks the periodicity flag, but that's not used to do the remapping anyway).

@giacomofiorin
Copy link
Member

BTW @HanatoK I rememberd that there is also this case of inputting data through a grid. I hope it's fine to keep the format as multi-column, otherwise a DX parser would be needed :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants