Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: allow other predicates for determining leaf nodes than node capacity #6

Open
asinghvi17 opened this issue Oct 24, 2024 · 4 comments

Comments

@asinghvi17
Copy link

I want to partition a dataset of geometries into, finally, leaf nodes whose areas are approximately the same. I don't really care about the number of geometries per node, since my problem is I/O on the raster dataset I'm using.

My understanding is that this would need a custom predicate that decides when to split, and it could receive a vector of the extents of the geometries as input.

Would there be interest in a PR that adds this feature? I could code it up, but curious to hear your thoughts @maxfreu.

@maxfreu
Copy link
Owner

maxfreu commented Oct 24, 2024

Hi! :) Why do you think this would increase the io perf on your ds? Can you explain a bit more where the bottleneck is?

@asinghvi17
Copy link
Author

asinghvi17 commented Oct 24, 2024

I did a bit more investigating, it seems my problem is that the extents of my STRtree's leaf nodes look like this:

download-9

(note the large y-extents when it doesn't really seem necessary)

@asinghvi17
Copy link
Author

I have a problem where I need to compute a statistic across a large, lazy, very slow to load raster (a 30m height map) over 250,000 polygons. I want to partition these polygons in space such that I can load sufficiently small segments of the raster, process that over the polygons in that segment, and then move on to the next segment. This drastically increases my processing times and can be made generic to essentially any dataset.

This means that I basically do a depth-first search across a tree, and stop the search for a particular path down the tree if I either encounter a leaf node, or a node with small enough extent that it would fit into memory.

The problem is that if the leaf node is too large, I can't fit my data into memory, and so I end up needing to process each polygon individually and load data within its bounding box from the raster. This is a time-consuming and painful operation.

For reference, if I manually map the computation across 2x2 degree segments of the raster, I can get this done in an hour. Using the tiling approach, it takes me around 11 hours for the same thing.

@maxfreu
Copy link
Owner

maxfreu commented Oct 25, 2024

Ok, I see. The striping is bad indeed. Maybe an R* tree would perform better, might be worth to look into https://infolab.usc.edu/csci599/Fall2001/paper/rstar-tree.pdf

R* trees seem to look less striped (best viewed with white background):

Actually, I'm not an expert in spatial trees, just coded this up in an afternoon cause I needed some minimal solution. So there could also be a bug in the splitting logic, but I'm pretty sure I nailed https://apps.dtic.mil/sti/tr/pdf/ADA324493.pdf Feel free to check!

I don't know what the best way forward is:

  • Implement and try the R* tree
  • Implement custom, area-based splitting heuristic
  • Put things in separate package or here?

The easiest way to achieve what you need would probably be to implement the custom predicate you mentioned. Maybe like so:

  1. Ingest all objects and their extents
  2. Sort them by x center
  3. Aggregate objects until they cover at most a set maximum x range
  4. Sort these objects by y center
  5. Make new leaf node with these objects and aggregate again, until they cover at most a y range that is computed based on the chosen x split. Then start the next leaf node until all objects for this x split are consumed.

Or how did you imagine it?

If you can come up with a clean idea how different splitting logics could be implemented, I wouldn't hold up the merge ;-) However, this sounds like it's rather something for a more abstract package like an imaginary "SpatialTrees".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants