Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMD (idea): compress #21

Closed
yarikoptic opened this issue Oct 11, 2019 · 5 comments
Closed

CMD (idea): compress #21

yarikoptic opened this issue Oct 11, 2019 · 5 comments
Labels
enhancement New feature or request UX

Comments

@yarikoptic
Copy link
Member

I have noted that network traffic while rcloning Svoboda's data is only about 10% of the local "write" IO .

That observation is confirmed by simply compressing the obtained .nwb files using tar/gz:

smaug:/mnt/btrfs/datasets/datalad/crawl-misc/svoboda-rclone/Exported NWB 2.0
$> du -scm Chen\ 2017*
35113   Chen 2017
3298    Chen 2017.tgz
38410   total

so indeed -- x10 factor!

Apparently hdmf/pynwb does not bother compressing stored in the .nwb data arrays. They do both document ability to pass compression parameters down (to h5py I guess) though, but as far as I saw it, compression is not on by default. Sure thing hdf5 end compression ration might not reach 10 since not all data will be compressed, but I expect that it will be notable.

As we keep running into those, it might be valuable to provide a dandi compress command which would take care about (re)compressing provided .nwb files (inplace or into a new file).
Perspective interface:

dandi compress [-i|--inplace] [-o|--output FILE] [-c|--compression METHOD (default gzip)] [-l|--level LEVEL (default 5)] [FILES]
  • --inplace to explicitly state to (re)compress each file in place (might want to do not really "inplace" but rather into a new file, and then replace old one -- this would provide a better workflow for git-annex'ed files, where original ones by default would be read/only)
  • --output filename - where to store output file (then a single FILE is expected to be provided)
@yarikoptic yarikoptic added the enhancement New feature or request label Oct 11, 2019
@yarikoptic
Copy link
Member Author

woohoo. By using an external tool which is already available, and using gzip with level 5 compression:

for f in Chen\ 2017/*nwb; do h5repack -v -f GZIP=5 "$f" "${f/\//_comp-gzip5/}"; done

I got

$> du -scm Chen\ 2017*
35113   Chen 2017
3324    Chen 2017_comp-gzip5
3298    Chen 2017.tgz

so almost exactly the same compression factor as using external tar/gz! testing with level 1 and 9 now to see the spread. And then will chime in to pynwb ppl

@yarikoptic
Copy link
Member Author

@bendichter (attn @satra ) as you have recently explored compression within NWB, do you think it would be worthwhile to have dandi compress to expose it or we better deffer this functionality to some more specialized nwb tools of a kind since dandi-cli deals with all kinds of data types (so in principle we could add compress functionality for zarrs and tifs I guess).

@yarikoptic yarikoptic added the UX label Jul 27, 2022
@satra
Copy link
Member

satra commented Jul 27, 2022

i would leave this in the nwb validator to inform the user and in the nwb conversion tools. i'm not sure this should be a functionality in dandi.

@yarikoptic
Copy link
Member Author

I believe there was some ideas alongside of this in nwb inspector, right @bendichter ? overall, probably should not be in dandi client since not dandi specific, so I will close

@bendichter
Copy link
Member

  • Compression is now the default behavior for NeuroConv (and NWB GUIDE)
  • When you have an NWB File is memory, you can use NeuroConv to automatically apply our recommended chunking and compression to each dataset before writing to disk
  • NWB Inspector will inform the user if they have large datasets that are not compressed
  • One thing we'd like to do, but have not done yet, is to have a function to "repack" an NWB file. This function would take as input an uncompressed NWB file and will produce a file where each dataset is compressed according to our recommendations. See the issue here: [Feature]: repack NWB file catalystneuro/neuroconv#892

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request UX
Projects
None yet
Development

No branches or pull requests

3 participants