Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_virtual_dataset with dmr++ #113

Merged
merged 28 commits into from
Aug 26, 2024
Merged

Conversation

ayushnag
Copy link
Contributor

@ayushnag ayushnag commented May 14, 2024

@TomNicholas TomNicholas added references generation Reading byte ranges from archival files enhancement New feature or request labels May 14, 2024
@ayushnag ayushnag changed the title basic dmr parsing functionality open_dataset with dmr++ May 14, 2024
@TomNicholas TomNicholas changed the title open_dataset with dmr++ open_virtual_dataset with dmr++ May 14, 2024
virtualizarr/xarray.py Outdated Show resolved Hide resolved
virtualizarr/xarray.py Outdated Show resolved Hide resolved
virtualizarr/dmrpp.py Outdated Show resolved Hide resolved
virtualizarr/dmrpp.py Outdated Show resolved Hide resolved
@agoodm
Copy link
Contributor

agoodm commented May 15, 2024

Thanks for taking a look and giving my suggested changes to the chunk key parsing a try @ayushnag !

Continuing the discussion on performance I think the remaining bottlenecks (aside from your point about I/O in the cloud maybe) with this now lie primarily outside the scope of this work, and I don't expect changing XML readers to make a significant improvement.

Comment on lines 72 to 73
group : str, default None
Group path within the dataset to open. For example netcdf4 and hdf5 groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to separate out the addition of this kwarg into a separate pull request, and implement it for the existing HDF5 reader. Then this PR wouldn't need to change the API of open_virtual_dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #165

virtualizarr/readers/dmrpp.py Show resolved Hide resolved
virtualizarr/readers/dmrpp.py Show resolved Hide resolved
Copy link
Contributor Author

@ayushnag ayushnag Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int32 to int64 change had to be made since I ran into some large byte offsets with the Atlas ICE-SAT dataset. Here is an example error: OverflowError: Python integer 6751178683 out of bounds for int32

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that justifies #177!

@ayushnag
Copy link
Contributor Author

ayushnag commented Jun 27, 2024

Some questions about writing unit tests:

  • How to load test dmrpp’s?
    • These files are available over https but need netrc login (NASA Earthdata authentication)
    • I will check how earthaccess gets creds and does testing
  • What should I compare my result to?

@ayushnag
Copy link
Contributor Author

ayushnag commented Aug 4, 2024

@TomNicholas this PR should be ready to go now. You can take a look at the code again if you wish and I can add updates. I have also added docs and release notes.

Copy link
Collaborator

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I'm still unclear if you prefer to merge into EarthAccess or here.

attrs.update(self._parse_attribute(attr_tag))
return xr.Dataset(
data_vars=data_vars,
coords=xr.Coordinates(coords=coord_vars, indexes={}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the indexes variable here technically be passed through from open_virtual_dataset? Right now it doesn't matter, but it would in order to support #18.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I will fix that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and #113 (comment) are the only comment that I think actually need to be addressed before merging, because they affect public API. Everything else can be left until later if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get this fixed however I think there is an issue which is that the indexes cannot be created since we don't have the path to the actual data file. Technically the dmrpp file does have a data path but it is only the file name and not the full path. E.g. "data.nc" instead of "s3://.../data.nc". So in that case if I add support for the indexes param it will fail when indexes=None when usually in virtualizarr indexes=None indicates auto-create indexes.

What I could do instead is to treat indexes=None and indexes={} as the same but raise a warning if indexes=None that the behavior is not as expected. However I can still accept manually created indexes (e.g. {"time": Index})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just raise a NotImplementedError on indexes={}. But at least something that isn't silently doing an inconsistent behaviour to the other readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought indexes={} would be acceptable since that is how most people indicate that indexes should not be created?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant indexes=None sorry

attr[attr_tag.attrib["name"]] = values[0] if len(values) == 1 else values
return attr

def _parse_filters(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these methods don't technically need to be methods, because they don't use self. It could all be functional instead. But this isn't very important, just a note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right these can be standalone. Even many of the functions that use self are just using the class constants. I will update after the initial merge

"offset": int(chunk_tag.attrib["offset"]),
"length": int(chunk_tag.attrib["nBytes"]),
}
return ChunkManifest(entries=chunkmanifest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could rewrite this to use ChunkManifest.from_arrays()? It would potentially be a lot more performant. (But this could also be left to a later PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice didn't know about that function. I will give it a try

@mdsumner mdsumner mentioned this pull request Aug 20, 2024
docs/api.rst Outdated Show resolved Hide resolved
@ayushnag
Copy link
Contributor Author

@TomNicholas do you know why the test is failing? I have added dmrpp as a valid file type so it should be recognized in the unit test. It also passes when I run network tests locally

@TomNicholas
Copy link
Collaborator

I'm just looking at that. I suspect it's just a bad merge from main. Have you tried pulling the latest changes to this branch and running that locally?

virtualizarr/xarray.py Outdated Show resolved Hide resolved
virtualizarr/xarray.py Outdated Show resolved Hide resolved
@TomNicholas TomNicholas merged commit d7f0c57 into zarr-developers:main Aug 26, 2024
8 checks passed
@TomNicholas
Copy link
Collaborator

Amazing piece of work here @ayushnag !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading from dmrcp index files?
3 participants