upload: option to compress traffic #23

yarikoptic · 2019-10-11T20:25:45Z

In the light of #21 it might be highly benefitial to compress files during upload

@mgrauer - does girder support receiving compressed payload?

@bendichter - do you have a quick way/code to assess if hdf5 file used compression, so we could include that into ls output and dynamically decide either to compress payload to girder?

bendichter · 2019-10-11T20:35:23Z

@yarikoptic HDF5 compression is done by dataset. You could query each dataset to see if it is compressed. Is that what you want to do? pynwb does not have a flag that compresses all datasets. Alternatively, we could just automatically compress the entire HDF5, but that would not require any HDF5 programming.

yarikoptic · 2019-10-11T21:49:33Z

Yep, it is per dataset, but unlikely any user would need/like to choose it per each dataset. I guess for sensing it would be enough to check some (eg 10) datasets within a file and treat compression of any of them as indicator that likely compression was used

yarikoptic · 2019-10-11T21:51:02Z

As for pynwb, I will suggest enabling compression by default (thus for all datasets) unless there was targeted investigation of typical cases which showed significant performance hit on topical operations

bendichter · 2019-10-11T22:28:21Z

It has been a design decision of pynwb to leave datasets plain by default. That means no compression and no chunking. If a user wants compression or chunking they must specify that for each dataset. What is the motivation behind checking datasets to see if some of them have been compressed?

yarikoptic · 2019-10-11T22:33:45Z

Motivation is the observed up to 90% storage and possibly traffic waste.

yarikoptic · 2019-10-11T22:37:12Z

Related observation - in neuroimaging majority of data is compressed (.nii.gz) although uncompressed is an option and used (rarely) for memory mapped access.

yarikoptic · 2019-10-11T22:40:12Z

Re design decision - was there some open discussion or document describing reasoning? I might indeed be fighting windmills if compression would complicate some use cases or cause significant performance degradation. But it would be great to see some reasoning

bendichter · 2019-10-11T23:26:05Z

Ok well checking a few isn't going to tell you whether the biggest ones are compressed, since the command must be made for each dataset. I can't think of any strong reasons why datasets shouldn't be compressed by default. I like the idea of chunking by default because it would allow us to grow datasets in append mode. Good luck.

yarikoptic · 2019-10-11T23:32:02Z

ok then, we will add a mode to ls to get % of compressed datasets. My wild bet is that it is either 0 or very rarely close to 100% and nothing in the middle ;-) since you are the one producing many of them, you can beat me to it and prove that I am wrong! ;-)

bendichter · 2019-10-12T01:13:58Z

Since it's optional you would probably only expect to see it on the large datasets. You would have to do things in an awkward way to get the datasets in DynamicTables for instance to be compressed. Is there a reason you can't just compress the whole HDF5 file when transferring?

yarikoptic · 2019-10-12T02:26:39Z

For transfer - the original question to @mgrauer . But built in compression - for any storage and I would not be surprised if it would cause some operations speed up actually (eg ls) . Yet to investigate in practice

mgrauer · 2019-10-12T20:57:18Z

@yarikoptic

I'm not sure what you specifically mean by

does girder support receiving compressed payload

Girder considers files to be opaque blobs, so if you want to upload a compressed file or an uncompressed file, Girder doesn't care, nor will it know that the file is compressed or not.

This relates a bit to the discussion on ingest.

yarikoptic · 2019-10-13T04:30:26Z

I meant something like https://en.m.wikipedia.org/wiki/HTTP_compression, where original file/blob is not compressed, client compresses it for the transfer and lets server (girder) know that the file/blob needs to be uncompressed upon receive.

mgrauer · 2019-10-13T15:56:21Z

Girder does not support this behavior out of the box.

Why not just have the client compress the file and upload and store the compressed file? What is the need to store the uncompressed file on the server?

This discussion has been good for generating requirements for describing an ingest pipeline! We can discuss more when we meet up in person at SfN.

yarikoptic · 2021-07-13T22:44:53Z

I don't think we would pursue any extra compression ATM

yarikoptic changed the title ~~Upload: option to compress traffic~~ upload: option to compress traffic Mar 13, 2020

jwodder added the cmd-upload label Apr 15, 2021

yarikoptic closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upload: option to compress traffic #23

upload: option to compress traffic #23

yarikoptic commented Oct 11, 2019

bendichter commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

bendichter commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

bendichter commented Oct 11, 2019

yarikoptic commented Oct 11, 2019 •

edited

Loading

bendichter commented Oct 12, 2019

yarikoptic commented Oct 12, 2019 •

edited

Loading

mgrauer commented Oct 12, 2019

yarikoptic commented Oct 13, 2019

mgrauer commented Oct 13, 2019

yarikoptic commented Jul 13, 2021

upload: option to compress traffic #23

upload: option to compress traffic #23

Comments

yarikoptic commented Oct 11, 2019

bendichter commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

bendichter commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

yarikoptic commented Oct 11, 2019

bendichter commented Oct 11, 2019

yarikoptic commented Oct 11, 2019 • edited Loading

bendichter commented Oct 12, 2019

yarikoptic commented Oct 12, 2019 • edited Loading

mgrauer commented Oct 12, 2019

yarikoptic commented Oct 13, 2019

mgrauer commented Oct 13, 2019

yarikoptic commented Jul 13, 2021

yarikoptic commented Oct 11, 2019 •

edited

Loading

yarikoptic commented Oct 12, 2019 •

edited

Loading