-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening virtual datasets (dmr-adapter) #606
Draft
ayushnag
wants to merge
5
commits into
nsidc:main
Choose a base branch
from
ayushnag:dmr-adapter
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 1 commit
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
from __future__ import annotations | ||
|
||
import fsspec | ||
import xarray as xr | ||
|
||
import earthaccess | ||
|
||
|
||
def _parse_dmr( | ||
fs: fsspec.AbstractFileSystem, | ||
data_path: str, | ||
dmr_path: str = None | ||
) -> xr.Dataset: | ||
""" | ||
Parse a granule's DMR++ file and return a virtual xarray dataset | ||
|
||
Parameters | ||
---------- | ||
granule : earthaccess.results.DataGranule | ||
The granule to parse | ||
fs : fsspec.AbstractFileSystem | ||
The file system to use to open the DMR++ | ||
|
||
Returns | ||
---------- | ||
xr.Dataset | ||
The virtual dataset (with virtualizarr ManifestArrays) | ||
|
||
Raises | ||
---------- | ||
Exception | ||
If the DMR++ file is not found or if there is an error parsing the DMR++ | ||
""" | ||
from virtualizarr.readers.dmrpp import DMRParser | ||
|
||
dmr_path = data_path + ".dmrpp" if dmr_path is None else dmr_path | ||
with fs.open(dmr_path) as f: | ||
parser = DMRParser(f.read(), data_filepath=data_path) | ||
return parser.parse() | ||
|
||
|
||
def open_virtual_mfdataset( | ||
granules: list[earthaccess.results.DataGranule], | ||
access: str = "indirect", | ||
preprocess: callable | None = None, | ||
parallel: bool = True, | ||
**xr_combine_nested_kwargs, | ||
) -> xr.Dataset: | ||
""" | ||
Open multiple granules as a single virtual xarray Dataset | ||
|
||
Parameters | ||
---------- | ||
granules : list[earthaccess.results.DataGranule] | ||
The granules to open | ||
access : str | ||
The access method to use. One of "direct" or "indirect". Direct is for S3/cloud access, indirect is for HTTPS access. | ||
xr_combine_nested_kwargs : dict | ||
Keyword arguments for xarray.combine_nested. | ||
See https://docs.xarray.dev/en/stable/generated/xarray.combine_nested.html | ||
|
||
Returns | ||
---------- | ||
xr.Dataset | ||
The virtual dataset | ||
""" | ||
if access == "direct": | ||
fs = earthaccess.get_s3fs_session(results=granules) | ||
else: | ||
fs = earthaccess.get_fsspec_https_session() | ||
if parallel: | ||
# wrap _parse_dmr and preprocess with delayed | ||
import dask | ||
open_ = dask.delayed(_parse_dmr) | ||
if preprocess is not None: | ||
preprocess = dask.delayed(preprocess) | ||
else: | ||
open_ = _parse_dmr | ||
vdatasets = [open_(fs=fs, data_path=g.data_links(access=access)[0]) for g in granules] | ||
if preprocess is not None: | ||
vdatasets = [preprocess(ds) for ds in vdatasets] | ||
if parallel: | ||
vdatasets = dask.compute(vdatasets)[0] | ||
if len(vdatasets) == 1: | ||
vds = vdatasets[0] | ||
else: | ||
vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs) | ||
return vds | ||
|
||
|
||
def open_virtual_dataset( | ||
granule: earthaccess.results.DataGranule, access: str = "indirect" | ||
) -> xr.Dataset: | ||
""" | ||
Open a granule as a single virtual xarray Dataset | ||
|
||
Parameters | ||
---------- | ||
granule : earthaccess.results.DataGranule | ||
The granule to open | ||
access : str | ||
The access method to use. One of "direct" or "indirect". Direct is for S3/cloud access, indirect is for HTTPS access. | ||
|
||
Returns | ||
---------- | ||
xr.Dataset | ||
The virtual dataset | ||
""" | ||
return open_virtual_mfdataset( | ||
granules=[granule], access=access, parallel=False, preprocess=None | ||
) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
import logging | ||
import os | ||
import unittest | ||
|
||
import earthaccess | ||
import pytest | ||
|
||
pytest.importorskip("virtualizarr") | ||
pytest.importorskip("dask") | ||
|
||
logger = logging.getLogger(__name__) | ||
assertions = unittest.TestCase("__init__") | ||
|
||
assertions.assertTrue("EARTHDATA_USERNAME" in os.environ) | ||
assertions.assertTrue("EARTHDATA_PASSWORD" in os.environ) | ||
|
||
logger.info(f"Current username: {os.environ['EARTHDATA_USERNAME']}") | ||
logger.info(f"earthaccess version: {earthaccess.__version__}") | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def granules(): | ||
granules = earthaccess.search_data( | ||
count=2, | ||
short_name="MUR-JPL-L4-GLOB-v4.1", | ||
cloud_hosted=True | ||
) | ||
return granules | ||
|
||
|
||
@pytest.mark.parametrize("output", "memory") | ||
def test_open_virtual_mfdataset(tmp_path, granules, output): | ||
xr = pytest.importorskip("xarray") | ||
# Open directly with `earthaccess.open` | ||
expected = xr.open_mfdataset(earthaccess.open(granules), concat_dim="time", combine="nested", combine_attrs="drop_conflicts") | ||
|
||
result = earthaccess.open_virtual_mfdataset(granules=granules, access="indirect", concat_dime="time", parallel=True, preprocess=None) | ||
# dimensions | ||
assert result.sizes == expected.sizes | ||
# variable names, variable dimensions | ||
assert result.variables.keys() == expected.variables.keys() | ||
# attributes | ||
assert result.attrs == expected.attrs | ||
# coordinates | ||
assert result.coords.keys() == expected.coords.keys() | ||
# chunks | ||
assert result.chunks == expected.chunks | ||
# encoding | ||
assert result.encoding == expected.encoding |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that you're not actually using the
filetype='dmr++'
option tovirtualizarr.open_virtual_dataset
here. It seems to me that one alternative option would be for everything in zarr-developers/VirtualiZarr#113 to also live in this library, as it already pretty much entirely uses public virtualizarr API... But I guess that depends whether you think the dmr++ option tovirtualizarr.open_virtual_dataset
is likely to be useful outside of the context of theearthaccess
library.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason is because the parser has an additional kwarg
data_filepath
which is required in cases where the dmr path cannot be simply derived by just adding.dmrpp
. If there is a way forengine
specific args invirtualizarr.open_dataset
then I can switch to thatThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand - why is the main
filepath
you pass tovirtualizarr.open_virtual_dataset
not sufficient?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case:
virtualizarrr.open_dataset(filepath=“s3://air.dmrpp”, data_filepath=“s3://datafiles/air.nc”, engine="dmr++")
when the dmr path is independent from data path. The chunk manifest needs to store thedata_filepath
instead of the dmr filepathThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh - does the dmr++ data not contain the path to the original data??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the dmr file only contains the file name and not the full path. This is an example of the name that a dmr file contains
name="20210715090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that virtualizarr renaming paths was added which can solve this issue. I will just call
vz.open_dataset
and then rename the data paths usingearthaccess
results. Then I can switch to the publicvirtualizarr
API now and remove the_parse_dmr
functionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what does that imply for my original question:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betolink How about if I move the parser code into
earthaccess
? Now that I'm thinking about it makes more sense to be in the NASA related repository. It would also make unit tests easier sinceearthaccess
can easily access NASA dmrpp files. Then this PR will adddmrpp.py
andvirtualizarr.py