Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to new cds #62

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open

Move to new cds #62

wants to merge 39 commits into from

Conversation

SarahAlidoost
Copy link
Member

@SarahAlidoost SarahAlidoost commented Oct 16, 2024

This PR introduces the main changes:

  • support new changes to CDS api
  • support new changes in xarray_regrid
  • add support to select spatial bound for land cover
  • update docs
  • fix a few issues

The recipe STEMMUS_SCOPE_input.yml run was successful. Although I reduced the time in the recipe (1 month instead of 6 months), it seems that the CDS is fast now.

closes #59
closes #63

@SarahAlidoost SarahAlidoost marked this pull request as ready for review October 22, 2024 14:23
@SarahAlidoost
Copy link
Member Author

@BSchilperoort when you have time, can you please review this PR? I explained most of the changes in this comment.

Copy link
Contributor

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try to run it yet, but here are some things I noticed in the changes.

docs/index.md Outdated Show resolved Hide resolved
src/zampy/datasets/cds_utils.py Show resolved Hide resolved
src/zampy/datasets/ecmwf_dataset.py Show resolved Hide resolved
@@ -150,8 +150,7 @@ def load(
variable_names: list[str],
) -> xr.Dataset:
files = list((ingest_dir / self.name).glob("*.nc"))

ds = xr.open_mfdataset(files, parallel=True)
ds = xr.open_mfdataset(files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel kwarg was to ensure the dataset being opened as Dask arrays, to avoid any memory issues. Will that go OK now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the parallel kwargs is True, the open and preprocess steps will be performed in parallel using dask. So, we need to configure Dask for the number of jobs or CPUs for parallel processing. Otherwise dask.distributed.Client() will use all available cores by default and this causes a segmentation fault error. This is the case in cli.py where n_workers are not set, see here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I do think that making use of dask.distributed is quite important for performance. You can configure the defaults with a config file https://docs.dask.org/en/latest/configuration.html#specify-configuration

dask.distributed.Client() will use all available cores by default and this causes a segmentation fault error

I guess on some systems? Which seems like something else should be causing the segfault. On my laptop it's no problem starting as many workers as cpu threads, and it's the default behavior of dask for a reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I did some tests, and it looks like the Dask configuration didn’t fix the issue. When I added parallel=True back, the segmentation fault errors showed up again on macOS and Linux, but not on Windows, see GA workflows below. We only use parallel=True for the fapar_lai.py and not for others. Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?

client = Client()
ds = xr.open_mfdataset(files, parallel=True)
client.close()

if "flag_values" in da.attrs:
regrid_values = da.attrs["flag_values"]
else:
regrid_values = np.unique(da.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do note that this is a very expensive operations, as every single value in the dataarray has to be checked...

It might be better to drop some variables or handle this better, otherwise it won't work on a large area (or be very slow).

The vars current_pixel_state and change_count have a valid_min and valid_max attr.

observation_count does too, but the range is very large, and it is probably irrelevant after regridding (doesn't make physical sense anymore).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I replaced np.unique with da.unique.

src/zampy/datasets/land_cover.py Show resolved Hide resolved
raw_variable = "lccs_class"
var_list.remove(raw_variable)
ds = ds.drop_vars(var_list) # noqa: PLW2901

ds = ds.sortby(["lat", "lon"]) # noqa: PLW2901
ds = ds.rename({"lat": "latitude", "lon": "longitude"}) # noqa: PLW2901
new_grid = xarray_regrid.Grid(
north=90,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally regridded the full world, as this would then only have to be done once during ingestion to the zampy format. Now this has to be redone every time the config is different

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that spatial bounds can be specified for landcover, see issue #63, a regridding to the whole globe is not needed which is also memory expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test data is still based on the old cds files. Might be good to replace those with the new format...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link

sonarcloud bot commented Nov 15, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
72.9% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spatial_bounds in cds_utils.cds_request_land_cover Move to new CDS and ADS
2 participants