Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: repack NWB file #892

Open
2 tasks done
bendichter opened this issue Jun 9, 2024 · 2 comments · May be fixed by #1003
Open
2 tasks done

[Feature]: repack NWB file #892

bendichter opened this issue Jun 9, 2024 · 2 comments · May be fixed by #1003

Comments

@bendichter
Copy link
Contributor

bendichter commented Jun 9, 2024

What would you like to see added to NeuroConv?

It would be nice to have a workflow that could take in an NWB file that has already been saved to disk and repack it with recommended chunking and compression.

The first step would be to fetch the current backend configuration from existing Datasets. Maybe this could be a function in _dataset_configuration.py:

get_existing_backend_configuration(nwbfile) -> BackendConfiguration

where the nwbfile must be linked to an on-disk NWB File. The backend should be automatically extracted, so no need to have that as a separate arg.

Then we should have a way to get the recommended configuration for that file. This works already in some cases with get_default_backend_configuration(nwbfile, backend), but does not work in all cases. If you have an ImageSeries with an external file and a (0,0,0) dataset, this triggers an error when the dataset in an h5py Dataset:

from pynwb.testing.mock.file import mock_NWBFile
from neuroconv.tools.nwb_helpers import get_default_backend_configuration
from pynwb.image import ImageSeries

nwbfile = mock_NWBFile()

im_series = ImageSeries(
    name="my_video",
    external_file=["my_video.mp4"],
    starting_frame=[0],
    format="external",
    rate=30.0
)
nwbfile.add_acquisition(im_series)

from pynwb import NWBHDF5IO

with NWBHDF5IO("this_test4.nwb", "w") as io:
    io.write(nwbfile)

io = NWBHDF5IO("this_test4.nwb", "r")
nwbfile = io.read()

get_default_backend_configuration(nwbfile, "hdf5")
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[22], line 1
----> 1 get_default_backend_configuration(nwbfile, "hdf5")

File ~/dev/neuroconv/src/neuroconv/tools/nwb_helpers/_backend_configuration.py:19, in get_default_backend_configuration(nwbfile, backend)
     16 """Fill a default backend configuration to serve as a starting point for further customization."""
     18 BackendConfigurationClass = BACKEND_CONFIGURATIONS[backend]
---> 19 return BackendConfigurationClass.from_nwbfile(nwbfile=nwbfile)

File ~/dev/neuroconv/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_backend.py:61, in BackendConfiguration.from_nwbfile(cls, nwbfile)
     58 @classmethod
     59 def from_nwbfile(cls, nwbfile: NWBFile) -> Self:
     60     default_dataset_configurations = get_default_dataset_io_configurations(nwbfile=nwbfile, backend=cls.backend)
---> 61     dataset_configurations = {
     62         default_dataset_configuration.location_in_file: default_dataset_configuration
     63         for default_dataset_configuration in default_dataset_configurations
     64     }
     66     return cls(dataset_configurations=dataset_configurations)

File ~/dev/neuroconv/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_backend.py:61, in <dictcomp>(.0)
     58 @classmethod
     59 def from_nwbfile(cls, nwbfile: NWBFile) -> Self:
     60     default_dataset_configurations = get_default_dataset_io_configurations(nwbfile=nwbfile, backend=cls.backend)
---> 61     dataset_configurations = {
     62         default_dataset_configuration.location_in_file: default_dataset_configuration
     63         for default_dataset_configuration in default_dataset_configurations
     64     }
     66     return cls(dataset_configurations=dataset_configurations)

File ~/dev/neuroconv/src/neuroconv/tools/nwb_helpers/_dataset_configuration.py:154, in get_default_dataset_io_configurations(nwbfile, backend)
    151 if isinstance(candidate_dataset, np.ndarray) and candidate_dataset.size == 0:
    152     continue
--> 154 dataset_io_configuration = DatasetIOConfigurationClass.from_neurodata_object(
    155     neurodata_object=neurodata_object, dataset_name=known_dataset_field
    156 )
    158 yield dataset_io_configuration

File ~/dev/neuroconv/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py:272, in DatasetIOConfiguration.from_neurodata_object(cls, neurodata_object, dataset_name)
    270     compression_method = "gzip"
    271 elif dtype != np.dtype("object"):
--> 272     chunk_shape = SliceableDataChunkIterator.estimate_default_chunk_shape(
    273         chunk_mb=10.0, maxshape=full_shape, dtype=np.dtype(dtype)
    274     )
    275     buffer_shape = SliceableDataChunkIterator.estimate_default_buffer_shape(
    276         buffer_gb=0.5, chunk_shape=chunk_shape, maxshape=full_shape, dtype=np.dtype(dtype)
    277     )
    278     compression_method = "gzip"

File ~/dev/neuroconv/src/neuroconv/tools/hdmf.py:38, in GenericDataChunkIterator.estimate_default_chunk_shape(chunk_mb, maxshape, dtype)
     35 chunk_bytes = chunk_mb * 1e6
     37 min_maxshape = min(maxshape)
---> 38 v = tuple(math.floor(maxshape_axis / min_maxshape) for maxshape_axis in maxshape)
     39 prod_v = math.prod(v)
     40 while prod_v * itemsize > chunk_bytes and prod_v != 1:

File ~/dev/neuroconv/src/neuroconv/tools/hdmf.py:38, in <genexpr>(.0)
     35 chunk_bytes = chunk_mb * 1e6
     37 min_maxshape = min(maxshape)
---> 38 v = tuple(math.floor(maxshape_axis / min_maxshape) for maxshape_axis in maxshape)
     39 prod_v = math.prod(v)
     40 while prod_v * itemsize > chunk_bytes and prod_v != 1:

ZeroDivisionError: division by zero

We should either adjust this function so it works for this purpose, or create a different function for this specific purpose.

Then, finally, we would need a way to write this to a new file, probably using the export function in pynwb.

It would be nice to have two usage modes, one that completely automates everything, and one that would allow users to repack specific datasets with specific parameters.

This workflow should also allow the user to switch from one backend to another.

Is your feature request related to a problem?

It's somewhat common for other users to upload sub-optimal NWB files. This would also be a suitable workflow for when users create NWB Files in MatNWB and don't know how to configure the datasets properly there.

Do you have any interest in helping implement the feature?

No.

Code of Conduct

@pauladkisson
Copy link
Member

pauladkisson commented Aug 15, 2024

@bendichter, I can't get this error with get_default_backend_configuration to replicate, so maybe it has been fixed in the time since you raised this issue?

@bendichter
Copy link
Contributor Author

yeah, maybe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants