Spatial partitioning, sorting and shuffling #84

benbovy · 2024-09-27T10:41:04Z

When dealing with large sets of geometries it would be nice if we could partition (chunk) the geometry coordinate and the GeometryIndex based on spatial locality (thus requiring spatial sorting or shuffling), like explained for geo-dataframes in dask-geopandas' spatial partitioning user guide.

This would require a good amount of work both here and upstream, though:

refactor xarray.IndexVariable to allow dask arrays or other lazy arrays More flexible index variables pydata/xarray#8124 (although we can work around this by using regular xarray.Variable objects for now)
add some sort of generic, out-of-core index in Xarray Low memory/out-of-core index? pydata/xarray#1650, or add our own ad-hoc version here
compute space-filling curve values from geometries (copied or factored out from dask-geopandas?)
implement spatial shuffling, probably reusing what is being added in Xarray Add GroupBy.shuffle() pydata/xarray#9320

The text was updated successfully, but these errors were encountered:

dcherian · 2024-09-27T13:19:19Z

spatial shuffling,

Can you add a comment to pydata/xarray#9546 describing the workload and what API might be useful please

benbovy · 2024-09-27T15:09:21Z

Yes sure, I don't have any precise idea yet but I will do when I'll think more about it.

martinfleis · 2024-09-27T16:10:17Z

Doesn't this just mean computing Hilbert distance (we can use vanilla Geopandas if needed or vendor that code) and using sortby along that? That way you can partition the data along the spatial order without any changes on xarray side.

That is also how dask_geopandas does that and how sort_values(by="geometry") works in geopandas. But I may as well be missing something.

dcherian · 2024-09-27T16:22:21Z

sort_values(by="geometry")

yeah probably best to reuse dask-geopandas here, we'd need some new dask array API to shuffle by unknown values.

benbovy · 2024-09-30T14:35:36Z

To be honest I didn't look closely into either dask_geopandas' GeoDataFrame.spatial_shuffle, dask.array.shuffle or pydata/xarray#9546.

I have looked into that a bit more now but probably not enough yet.

we'd need some new dask array API to shuffle by unknown values.

@dcherian -- What do you mean exactly by "unknown values"?

What we want to shuffle here are "distance" values (encoded as integer indices) along a 1-dimensional space filling curve of a fixed resolution (level), after computing it from the geometries. For very coarse resolutions we could probably use those integer indices directly as group labels, but since the range of possible index values grows very rapidly with level we might better want binning the data (where I guess the number of bins and their extents could be calculated from the chunks).

Would it be possible to use something like ds.shuffle(ds.groupby_bins(...)) with pydata/xarray#9320?

Doesn't this just mean computing Hilbert distance (we can use vanilla Geopandas if needed or vendor that code) and using sortby along that? That way you can partition the data along the spatial order without any changes on xarray side. That is also how dask_geopandas does that and how sort_values(by="geometry") works in geopandas. But I may as well be missing something.

@martinfleis -- Hmm from what I see dask_geopandas.GeoDataFrame.spatial_shuffle relies on dask.dataframe.DataFrame.set_index, which involves complex sorting / shuffling logic implemented in https://github.com/dask/dask/blob/main/dask/dataframe/shuffle.py.

I guess we might as well reuse dask.dataframe here (i.e., convert dask.array.Array <-> dask.dataframe.DataFrame) or reuse dask-geopandas.

Computing Hilbert distance is more embarrassingly parallel, if we can do it with vanilla Geopandas we should do it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spatial partitioning, sorting and shuffling #84

Spatial partitioning, sorting and shuffling #84

benbovy commented Sep 27, 2024

dcherian commented Sep 27, 2024

benbovy commented Sep 27, 2024

martinfleis commented Sep 27, 2024

dcherian commented Sep 27, 2024

benbovy commented Sep 30, 2024 •

edited

Loading

Spatial partitioning, sorting and shuffling #84

Spatial partitioning, sorting and shuffling #84

Comments

benbovy commented Sep 27, 2024

dcherian commented Sep 27, 2024

benbovy commented Sep 27, 2024

martinfleis commented Sep 27, 2024

dcherian commented Sep 27, 2024

benbovy commented Sep 30, 2024 • edited Loading

benbovy commented Sep 30, 2024 •

edited

Loading