Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auto_compress=True option to HDF5IO.write() #183

Closed
yarikoptic opened this issue Oct 21, 2019 · 4 comments
Closed

Add auto_compress=True option to HDF5IO.write() #183

yarikoptic opened this issue Oct 21, 2019 · 4 comments
Assignees
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@yarikoptic
Copy link
Contributor

Per our discussion at SfN with @oruebel it might be greatly beneficial to enable compression for at least some datasets within HDF5 files. Upon initial palpation of .nwb files in the wild it is possible to find some instances where even gzip level 1 compression would bring 10x storage savings. (e.g. for http://datasets.datalad.org/?dir=/labs/svoboda/Chen_2017 I saw:

smaug:/mnt/btrfs/datasets/datalad/crawl-misc/svoboda-rclone/Exported NWB 2.0
$> du -scm Chen*                        
35113	Chen 2017
3579	Chen 2017_comp-gzip1
3324	Chen 2017_comp-gzip5
3319	Chen 2017_comp-gzip9
35114	Chen_2017-ds
3298	Chen 2017.tgz
83744	total

Ref: our very initial "discussion" at @dandiarchive : dandi/dandiarchive-legacy#15
Attn: @bendichter @mgrauer

@yarikoptic yarikoptic added the category: enhancement improvements of code or code behavior label Oct 21, 2019
@yarikoptic
Copy link
Contributor Author

On the related note - are you aware of any way to "index" compressed (in HDF5) datasets? Indexing could allow for efficient access to portions of the dataset without chunking and requiring to decompress the preceding data just to get to the offset.

Related

But may be, instead of indexing, it would indeed be better to take advantage of the domain knowledge here and chunk data wisely (e.g. separate channels, and then some reasonable time durations within each chunk, e.g. for each 10k samples) instead. I expect the extra chunking "metadata" not being a significant storage demand. Not yet sure how much of performance hit it would be to reconstruct upon loading entire dataset(s), but this report suggests that increasing chunk cache could be of tremendous benefit. In summary chunking should allow for quite efficient data access regardless of the usage scheme (slicing in time or in channels space).

@oruebel
Copy link
Contributor

oruebel commented Dec 3, 2019

Chunking and indexing address two orthogonal problems. Chunking mostly addresses the data load whereas indexing is concerned with data search. Chunking is something we already support.

For indexing in HDF5, FastBit may be an attractive option.

@rly rly assigned rly and oruebel and unassigned rly Jan 5, 2023
@rly rly added this to the Next Release milestone Jan 5, 2023
@rly rly added the priority: low alternative solution already working and/or relevant to only specific user(s) label Jan 5, 2023
@mavaylon1 mavaylon1 assigned mavaylon1 and unassigned oruebel Aug 1, 2024
@mavaylon1 mavaylon1 modified the milestones: 3.14.4, Future Aug 1, 2024
@mavaylon1
Copy link
Contributor

Related to #1158 ; however, we will close for now until it comes back up again. We already have neuroconv to do automatic compression. Feel free to reopen.

@bendichter
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

5 participants