Filters should be a list of dictionaries #65

abarciauskas-bgse · 2024-03-29T17:41:22Z

I believe filters should be an optional list of dictionaries, at least in the case of netcdf4, which is read, in kerchunk, by the h5py library. Futher the zarr spec indicates filters should be a list of json objects

Without this datatype change, I get pydantic type errors which I first reported in #60.

Reproducible example

In this example, I created an artificial dataset with filters as well as used the air dataset from the Usage docs since I knew that worked. It is interesting how the netcdf4 library appears to read filters from both files and the h5py library only reads filters from the artificially generated dataset. I have not yet tracked down why this is.

from netCDF4 import Dataset
import numpy as np
from virtualizarr import open_virtual_dataset
import xarray as xr
import h5py
from netCDF4 import Dataset

# Create some artificial data
data = np.random.rand(100, 100)  # 100x100 array of random numbers

# Create a new NetCDF file
nc_filename = 'artificial_with_filter.nc'
nc_file = Dataset(nc_filename, 'w', format='NETCDF4')

# Define the dimensions of the data
nc_file.createDimension('x', data.shape[0])
nc_file.createDimension('y', data.shape[1])

# Create a variable with zlib compression
data_var = nc_file.createVariable('data', np.float32, ('x', 'y'), zlib=True)

# Assign the data to the variable
data_var[:] = data

# Close the file
nc_file.close()

print(f"NetCDF file '{nc_filename}' created successfully with zlib compression.")

# create an example netCDF4 file from xarray dataset
ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')

files = [('air.nc'), ('artificial_with_filter.nc')]
var_keys = ['air', 'data']
for file in files:
    h5file = h5py.File(file, 'r')
    nc_file = Dataset(file, 'r')
    for group_name in h5file.keys():
        if group_name in var_keys:
            group = h5file[group_name]

            h5filters = group._filters
            print(f"Filters found with hdf5 for {group_name}: {h5filters}")

            var = nc_file.variables[group_name]
            ncfilters = var.filters()
            print(f"Filters found for netcdf for '{group_name}': {ncfilters}")            

    open_virtual_dataset(file)

The text was updated successfully, but these errors were encountered:

abarciauskas-bgse · 2024-03-29T20:58:34Z

closed via #66

abarciauskas-bgse mentioned this issue Mar 29, 2024

Ab/filters dtype #66

Merged

TomNicholas added the bug Something isn't working label Mar 29, 2024

abarciauskas-bgse closed this as completed Mar 29, 2024

abarciauskas-bgse mentioned this issue Mar 29, 2024

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filters should be a list of dictionaries #65

Filters should be a list of dictionaries #65

abarciauskas-bgse commented Mar 29, 2024

abarciauskas-bgse commented Mar 29, 2024

Filters should be a list of dictionaries #65

Filters should be a list of dictionaries #65

Comments

abarciauskas-bgse commented Mar 29, 2024

Reproducible example

abarciauskas-bgse commented Mar 29, 2024