[Feature]: Write zarr without using pickle #171

magland · 2024-02-29T07:20:43Z

What would you like to see added to HDMF-ZARR?

I was working on adding nwb-zarr support to neurosift and I ran into a problem where many of the datasets within the zarr were pickle-encoded. It is difficult to read pickled data in languages other than Python, and it is practically impossible using JavaScript in the browser. So I would like to request that hdmf-zarr be updated to avoid using pickling when writing datasets.

Is your feature request related to a problem?

No response

What solution would you like?

@bendichter and I created a script for doing the h5->zarr conversion without using pickle:
https://gist.github.com/magland/39c5c5c5afabb982b5dd5d2367f2179b

and we confirmed that it (a) the output is loadable in hdmf-zarr and (b) it works with Neurosift

We have only tried it on the one nwb file so far.

Perhaps this script could be helpful in making the modifications.

I perused the hdmf-zarr source code, but it wasn't clear to me where the changes needed to take place.

Tagging: @rly

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this change was not already requested?

oruebel · 2024-02-29T21:24:10Z

Thanks for the suggestion. I think this means updating the calls to require_dataset in backend.py to set the object_codec. I believe currently the object_codec is not being specified, so it should use whatever the default in Zarr is. The question is, which encoder is the best one to use, i.e., is most portable and most performant in terms of en/decoding speed and size. Do you have any thoughts on the encoder to use?

oruebel · 2024-02-29T21:25:40Z

@magland is this something you are interested in helping with by contributing a PR or would you like us to take a look at this?

magland · 2024-02-29T21:36:50Z

@oruebel I am happy to create a PR, but I will need some help knowing what to change.

Now that you pointed to require_dataset I am seeing lines like this:

dset = parent.require_dataset(name,
                                              shape=shape,
                                              dtype=object,
                                              object_codec=self.__codec_cls(),
                                              **options['io_settings'])

and

self.__codec_cls = numcodecs.pickles.Pickle if object_codec_class is None else object_codec_class

That should give me something to go on.

I'll working on putting something together for you to look at.

oruebel · 2024-02-29T23:31:01Z

I'll working on putting something together for you to look at.

Great! Thanks for the help!

A first test could be to just set the object_codec_class parameter when creating the ZarrIO object via ZarrIO.__init__. If that produces files that work for you, then this may be as simple as changing the default for the object codec here:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 124 in 556ed12

    
           self.__codec_cls = numcodecs.pickles.Pickle if object_codec_class is None else object_codec_class

I think there may be a couple of places where the object_codec is not being set. I'm not sure right now if we may need to add the object codec as an additional argument to those calls, or whether those calls don't create datasets that require object encoding. However, I think object codec is being ignored by Zarr for all other data (i.e., if we don't store objects), so I think it may be safe to just add it to those calls, from a quick look, I think these are the two calls don't specify the object codec:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1272 in 556ed12

    
           dset = parent.require_dataset(name, shape=data_shape, dtype=dtype, **io_settings)

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1312 in 556ed12

dset = parent.require_dataset(name, shape=(1, ), dtype=dtype, **io_settings)

In addition to datasets, another place where encoding could potentially come up is when writing attributes. I don't think this should be an issue, because a) Zarr stores attributes in JSON, and b) I don't think NWB stores objects in attributes. However, I wanted to mention it just in case. If attribute encoding should come up as an issue, those are written here:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 558 in 556ed12

def write_attributes(self, **kwargs):

magland · 2024-03-05T15:28:37Z

Thanks @oruebel

I tried out changing the default codec class in ZarrIO to numcodecs.JSON as you suggested, and indeed that worked! At least for the dataset I tested.

I'll make a PR and we'll see if it passes the tests.

magland mentioned this issue Mar 5, 2024

default object_codec_class -> JSON #173

Closed

6 tasks

mavaylon1 self-assigned this Apr 16, 2024

mavaylon1 added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Apr 16, 2024

mavaylon1 added this to the 0.8.0 milestone Apr 16, 2024

mavaylon1 modified the milestones: 0.9.0, Next Major Release: 1.0.0 Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Write zarr without using pickle #171

[Feature]: Write zarr without using pickle #171

magland commented Feb 29, 2024

oruebel commented Feb 29, 2024 •

edited

Loading

oruebel commented Feb 29, 2024

magland commented Feb 29, 2024

oruebel commented Feb 29, 2024 •

edited

Loading

magland commented Mar 5, 2024

[Feature]: Write zarr without using pickle #171

[Feature]: Write zarr without using pickle #171

Comments

magland commented Feb 29, 2024

What would you like to see added to HDMF-ZARR?

Is your feature request related to a problem?

What solution would you like?

Do you have any interest in helping implement the feature?

Code of Conduct

oruebel commented Feb 29, 2024 • edited Loading

oruebel commented Feb 29, 2024

magland commented Feb 29, 2024

oruebel commented Feb 29, 2024 • edited Loading

magland commented Mar 5, 2024

oruebel commented Feb 29, 2024 •

edited

Loading

oruebel commented Feb 29, 2024 •

edited

Loading