Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to new cds #62
base: main
Are you sure you want to change the base?
Move to new cds #62
Changes from 29 commits
2950aee
2c3dac3
bba68a9
b63505f
c17eb37
1b99791
692bae5
dcb9bb0
5ce238c
47b57ce
00aa356
d9ab924
4e6fe3e
ddb7c58
868bc0e
1459245
d0d57a7
71c2224
75e5a67
5b82327
75df274
f4a5882
540b9e7
3ae6603
5ba1f7f
42edfc3
1eab581
49637d1
c24bbf7
0cab2df
b3bdeb7
56c5a5b
0f70d77
8b6070f
09ddc53
d03b0f4
ab1cbc3
09d38f6
eca46d1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parallel kwarg was to ensure the dataset being opened as Dask arrays, to avoid any memory issues. Will that go OK now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the parallel kwargs is True, the open and preprocess steps will be performed in parallel using dask. So, we need to configure Dask for the number of jobs or CPUs for parallel processing. Otherwise
dask.distributed.Client()
will use all available cores by default and this causes a segmentation fault error. This is the case incli.py
wheren_workers
are not set, see here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I do think that making use of dask.distributed is quite important for performance. You can configure the defaults with a config file https://docs.dask.org/en/latest/configuration.html#specify-configuration
I guess on some systems? Which seems like something else should be causing the segfault. On my laptop it's no problem starting as many workers as cpu threads, and it's the default behavior of dask for a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I did some tests, and it looks like the Dask configuration didn’t fix the issue. When I added
parallel=True
back, the segmentation fault errors showed up again on macOS and Linux, but not on Windows, see GA workflows below. We only useparallel=True
for thefapar_lai.py
and not for others. Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do note that this is a very expensive operations, as every single value in the dataarray has to be checked...
It might be better to drop some variables or handle this better, otherwise it won't work on a large area (or be very slow).
The vars
current_pixel_state
andchange_count
have a valid_min and valid_max attr.observation_count
does too, but the range is very large, and it is probably irrelevant after regridding (doesn't make physical sense anymore).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I replaced
np.unique
withda.unique
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally regridded the full world, as this would then only have to be done once during ingestion to the zampy format. Now this has to be redone every time the config is different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that spatial bounds can be specified for landcover, see issue #63, a regridding to the whole globe is not needed which is also memory expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test data is still based on the old cds files. Might be good to replace those with the new format...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.