Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload: option to compress traffic #23

Closed
yarikoptic opened this issue Oct 11, 2019 · 15 comments
Closed

upload: option to compress traffic #23

yarikoptic opened this issue Oct 11, 2019 · 15 comments

Comments

@yarikoptic
Copy link
Member

In the light of #21 it might be highly benefitial to compress files during upload

@mgrauer - does girder support receiving compressed payload?

@bendichter - do you have a quick way/code to assess if hdf5 file used compression, so we could include that into ls output and dynamically decide either to compress payload to girder?

@bendichter
Copy link
Member

@yarikoptic HDF5 compression is done by dataset. You could query each dataset to see if it is compressed. Is that what you want to do? pynwb does not have a flag that compresses all datasets. Alternatively, we could just automatically compress the entire HDF5, but that would not require any HDF5 programming.

@yarikoptic
Copy link
Member Author

Yep, it is per dataset, but unlikely any user would need/like to choose it per each dataset. I guess for sensing it would be enough to check some (eg 10) datasets within a file and treat compression of any of them as indicator that likely compression was used

@yarikoptic
Copy link
Member Author

As for pynwb, I will suggest enabling compression by default (thus for all datasets) unless there was targeted investigation of typical cases which showed significant performance hit on topical operations

@bendichter
Copy link
Member

It has been a design decision of pynwb to leave datasets plain by default. That means no compression and no chunking. If a user wants compression or chunking they must specify that for each dataset. What is the motivation behind checking datasets to see if some of them have been compressed?

@yarikoptic
Copy link
Member Author

Motivation is the observed up to 90% storage and possibly traffic waste.

@yarikoptic
Copy link
Member Author

Related observation - in neuroimaging majority of data is compressed (.nii.gz) although uncompressed is an option and used (rarely) for memory mapped access.

@yarikoptic
Copy link
Member Author

Re design decision - was there some open discussion or document describing reasoning? I might indeed be fighting windmills if compression would complicate some use cases or cause significant performance degradation. But it would be great to see some reasoning

@bendichter
Copy link
Member

Ok well checking a few isn't going to tell you whether the biggest ones are compressed, since the command must be made for each dataset. I can't think of any strong reasons why datasets shouldn't be compressed by default. I like the idea of chunking by default because it would allow us to grow datasets in append mode. Good luck.

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 11, 2019

ok then, we will add a mode to ls to get % of compressed datasets. My wild bet is that it is either 0 or very rarely close to 100% and nothing in the middle ;-) since you are the one producing many of them, you can beat me to it and prove that I am wrong! ;-)

@bendichter
Copy link
Member

Since it's optional you would probably only expect to see it on the large datasets. You would have to do things in an awkward way to get the datasets in DynamicTables for instance to be compressed. Is there a reason you can't just compress the whole HDF5 file when transferring?

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 12, 2019

For transfer - the original question to @mgrauer . But built in compression - for any storage and I would not be surprised if it would cause some operations speed up actually (eg ls) . Yet to investigate in practice

@mgrauer
Copy link
Contributor

mgrauer commented Oct 12, 2019

@yarikoptic

I'm not sure what you specifically mean by

does girder support receiving compressed payload

Girder considers files to be opaque blobs, so if you want to upload a compressed file or an uncompressed file, Girder doesn't care, nor will it know that the file is compressed or not.

This relates a bit to the discussion on ingest.

@yarikoptic
Copy link
Member Author

I meant something like https://en.m.wikipedia.org/wiki/HTTP_compression, where original file/blob is not compressed, client compresses it for the transfer and lets server (girder) know that the file/blob needs to be uncompressed upon receive.

@mgrauer
Copy link
Contributor

mgrauer commented Oct 13, 2019

Girder does not support this behavior out of the box.

Why not just have the client compress the file and upload and store the compressed file? What is the need to store the uncompressed file on the server?

This discussion has been good for generating requirements for describing an ingest pipeline! We can discuss more when we meet up in person at SfN.

@yarikoptic yarikoptic changed the title Upload: option to compress traffic upload: option to compress traffic Mar 13, 2020
@yarikoptic
Copy link
Member Author

I don't think we would pursue any extra compression ATM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants