-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to new cds #62
base: main
Are you sure you want to change the base?
Move to new cds #62
Conversation
…version xarray_regrid v0.4.0
…n ingest function of land cover to avoid memory problems
@BSchilperoort when you have time, can you please review this PR? I explained most of the changes in this comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try to run it yet, but here are some things I noticed in the changes.
src/zampy/datasets/fapar_lai.py
Outdated
@@ -150,8 +150,7 @@ def load( | |||
variable_names: list[str], | |||
) -> xr.Dataset: | |||
files = list((ingest_dir / self.name).glob("*.nc")) | |||
|
|||
ds = xr.open_mfdataset(files, parallel=True) | |||
ds = xr.open_mfdataset(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parallel kwarg was to ensure the dataset being opened as Dask arrays, to avoid any memory issues. Will that go OK now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the parallel kwargs is True, the open and preprocess steps will be performed in parallel using dask. So, we need to configure Dask for the number of jobs or CPUs for parallel processing. Otherwise dask.distributed.Client()
will use all available cores by default and this causes a segmentation fault error. This is the case in cli.py
where n_workers
are not set, see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I do think that making use of dask.distributed is quite important for performance. You can configure the defaults with a config file https://docs.dask.org/en/latest/configuration.html#specify-configuration
dask.distributed.Client()
will use all available cores by default and this causes a segmentation fault error
I guess on some systems? Which seems like something else should be causing the segfault. On my laptop it's no problem starting as many workers as cpu threads, and it's the default behavior of dask for a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I did some tests, and it looks like the Dask configuration didn’t fix the issue. When I added parallel=True
back, the segmentation fault errors showed up again on macOS and Linux, but not on Windows, see GA workflows below. We only use parallel=True
for the fapar_lai.py
and not for others. Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?
client = Client()
ds = xr.open_mfdataset(files, parallel=True)
client.close()
src/zampy/datasets/land_cover.py
Outdated
if "flag_values" in da.attrs: | ||
regrid_values = da.attrs["flag_values"] | ||
else: | ||
regrid_values = np.unique(da.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do note that this is a very expensive operations, as every single value in the dataarray has to be checked...
It might be better to drop some variables or handle this better, otherwise it won't work on a large area (or be very slow).
The vars current_pixel_state
and change_count
have a valid_min and valid_max attr.
observation_count
does too, but the range is very large, and it is probably irrelevant after regridding (doesn't make physical sense anymore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I replaced np.unique
with da.unique
.
raw_variable = "lccs_class" | ||
var_list.remove(raw_variable) | ||
ds = ds.drop_vars(var_list) # noqa: PLW2901 | ||
|
||
ds = ds.sortby(["lat", "lon"]) # noqa: PLW2901 | ||
ds = ds.rename({"lat": "latitude", "lon": "longitude"}) # noqa: PLW2901 | ||
new_grid = xarray_regrid.Grid( | ||
north=90, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally regridded the full world, as this would then only have to be done once during ingestion to the zampy format. Now this has to be redone every time the config is different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that spatial bounds can be specified for landcover, see issue #63, a regridding to the whole globe is not needed which is also memory expensive.
tests/test_datasets/test_era5.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test data is still based on the old cds files. Might be good to replace those with the new format...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Co-authored-by: Bart Schilperoort <[email protected]>
Quality Gate failedFailed conditions |
This PR introduces the main changes:
The recipe
STEMMUS_SCOPE_input.yml
run was successful. Although I reduced the time in the recipe (1 month instead of 6 months), it seems that the CDS is fast now.closes #59
closes #63