Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepared Geometries #803

Open
urschrei opened this issue Apr 13, 2022 · 11 comments · May be fixed by #1197
Open

Prepared Geometries #803

urschrei opened this issue Apr 13, 2022 · 11 comments · May be fixed by #1197

Comments

@urschrei
Copy link
Member

JTS (and, subsequently, GEOS and several of its related downstream tools such as Shapely and PostGIS) has had the concept of a "prepared geometry" for a long time now. The idea is that upon creation, the "prepared" geometry creates a spatial index (an rstar r*-tree, in our case), which is a relatively expensive operation whose cost is amortised by the time savings from repeated calls to the index when e.g. checking intersections against a large number of geometries.

The types of "real-world" operations that benefit from this approach are pretty common, so it's worth experimenting with something similar in geo. Having thought about it for 30 seconds, I think it would have to be a struct rather than a trait.

This what the geos crate does, currently: https://docs.rs/geos/latest/geos/struct.PreparedGeometry.html

@audunska
Copy link

audunska commented May 18, 2022

This would be great for us at cognite. We have an intersection operation that amounts to taking a big Geometry object, and intersecting it with a bunch of different Lines. Being able to cache all the graph information in a prepared geometry would probably speed things up a lot.

@apendleton
Copy link
Contributor

@urschrei I've gotten started on a possible implementation of this, but wanted to check and see if I'm on the right track before sinking too much time into it.

From profiling, it looks like a big chunk of the time, at least for the contains operations that are a bottleneck in my current application, comes from building the GeometryGraph objects at the beginning of the RelateOperation constructor. I was thinking of creating a PreparedGeometry<T> struct that would wrap geometries, and would prebuild one of these GeometryGraph and stash it at build time. Conveniently, the only operation that currently needs to mutate a GeometryGraph after construction at present looks like the compute_self_nodes method, and it doesn't depend on having access to the other geometry being compared, so I ought to just be able to move that operation into the GeometryGraph constructor and then have everything else be able to operate on an immutable graph (I'd probably then add a trait to get a GeometryGraph for a given geometry, which would build one on the fly for regular geometries and return a reference to the already-built one for the PreparedGeometry, and then use that in the RelateOperation constructor).

One potentially serious issue with this approach, though: currently the RelateOperation constructor builds its left argument's graph with the index 0 and its right with index 1 as arguments, and these then get used all over the place, to index into arrays in Label and elsewhere, to know which slots to write results into in the IntersectionMatrix, etc. In this new scheme the same GeometryGraph might sometimes be in the left position and other times in the right position, so it seems like that won't work. I was thinking of changing these indexes from 0 and 1 to an automatically-assigned globally auto-incrementing usize ID (maybe a static newtype-wrapped AtomicUsize?). Any methods that currently take a geom_index: usize would need to take a graph_id: GraphId or whatnot instead, and internally map from that to a position. For things like labels, it seems like we pretty much always know the ID of at least one graph, so we could store that ID on each label, and use slot 0 for that ID, and slot 1 for not-that ID, if that makes sense. For intersection matrixes and the like, when we construct them, we'll need to keep track of which ID should be used for the left side and which for the right, and then map from input IDs to array slots (I think we can do this when we construct intersection matrixes inside the RelateOperation constructor, when we know the IDs of both graphs, whether prebuilt or not).

All of this is looking like it's going to be pretty invasive, though: lots of places are going to need to accept IDs that don't currently, and there's going to need to be a new layer of indirection introduced in several places to map from IDs to 0/1 indexes. Does this generally seem like an okay approach, or should I be pre-building/caching something at a different layer of abstraction? (Or alternatively, still pre-building a GeometryGraph, but changing its design in some more-fundamental way so it doesn't need IDs or indexes or whatnot.)

@urschrei
Copy link
Member Author

I was thinking of creating a PreparedGeometry struct that would wrap geometries, and would prebuild one of these GeometryGraph and stash it at build time. Conveniently, the only operation that currently needs to mutate a GeometryGraph after construction at present looks like the compute_self_nodes method, and it doesn't depend on having access to the other geometry being compared, so I ought to just be able to move that operation into the GeometryGraph constructor and then have everything else be able to operate on an immutable graph (I'd probably then add a trait to get a GeometryGraph for a given geometry, which would build one on the fly for regular geometries and return a reference to the already-built one for the PreparedGeometry, and then use that in the RelateOperation constructor).

This seems like a good approach.

All of this is looking like it's going to be pretty invasive, though: lots of places are going to need to accept IDs that don't currently, and there's going to need to be a new layer of indirection introduced in several places to map from IDs to 0/1 indexes. Does this generally seem like an okay approach, or should I be pre-building/caching something at a different layer of abstraction? (Or alternatively, still pre-building a GeometryGraph, but changing its design in some more-fundamental way so it doesn't need IDs or indexes or whatnot.)

My first instinct would be to revisit the design of GeometryGraph to try to simplify these operations if that lets us avoid some of the complexity of the approach you outlined above. It may be the case that the unique ID design you sketched out is necessary, but as you say it involves quite a lot of invasive incidental complexity. I'm going to tag @michaelkirk in here as he's the author of the Relate functionality and its underlying mechanism, and may have some input here. Could you also link to a WIP branch if you have one?

@michaelkirk
Copy link
Member

michaelkirk commented Jul 13, 2022 via email

@michaelkirk
Copy link
Member

michaelkirk commented Jul 14, 2022

I had just a little bit of time to look at this before next week, so here are some of my thoughts so far:

Some background: the existing Relate trait and GeometryGraph is strongly inspired by the approach used in JTS. So it might be useful to see how JTS or GEOS handles this (if it indeed handles this at all).

a bottleneck in my current application, comes from building the GeometryGraph

Yes indeed, the GeometryGraph is pretty much the whole enchilada, and I think it makes sense to see what work within there we can cache.

After briefly reviewing the implementation in geo: You're right - the notion of geom_index is used heavily throughout the implementation, which makes it non-trivial to cache on the geometry. The idea of some shared global state to track the ID of each polygon sounds plausible, but "shared global" gives me pause. I'm not 100% allergic to it, but it'd be really great if we can find an approach which doesn't require any kind of global mutex thing to protect a counter.

the only operation that currently needs to mutate a GeometryGraph after construction at present looks like the compute_self_nodes method

In theory, I think caching the "self-noding" work could be helpful. However note that, the current data structure holds some other pieces of state on the graph that later get mutated (after compute_self_nodes). This is sometimes done via RefCell's. We're using RefCells due to a somewhat literal translation of the Java code to rust. The Java implementation is happy to handle all these aliased references, but in Rust we have to do runtime checks, so beyond just the &mut graph, we'll also need to check for borrow_mut calls when considering when the graphs get mutated: https://github.com/georust/geo/blob/main/geo/src/algorithm/relate/relate_operation.rs#L374

changing [the geometry graph] design in some more-fundamental way so it doesn't need IDs or indexes or whatnot.)

Inuitively this sounds appealing! But I haven't quite worked out how this would work. Like maybe instead of caching the entire GeometryGraph on the PreparedGeometry, we extract some subset of cacheable state called something like SelfNoding which can be used to more cheaply build a GeometryGraph.

I can respond more in depth next week!

@rmanoka
Copy link
Contributor

rmanoka commented Jul 14, 2022

Assuming the prepared geom holds a reference to the original geom, we could use the reference pointer casted as usize. Would that work as a pseudo unique index?

@apendleton
Copy link
Contributor

@urschrei

Could you also link to a WIP branch if you have one?

Still thinking through some other design aspects and don't have anything working yet, but definitely will drop a link when I do.

@michaelkirk

it might be useful to see how JTS or GEOS handles this

Yeah will do. I did notice when profiling the application whose slowness motivated me to jump down this rabbithole that when attempting the same set of operations with geos instead to see if that was faster, it looks like contains on prepared-geometry polygons in GEOS just delegates to the underlying ordinary (non-prepared) polygon implementation, so I sort of assumed they had punted on trying to solve this problem, but I should actually look instead of just assuming.

but "shared global" gives me pause

Oh, me too 😅 . I'm not sure I have a better idea yet, but will keep thinking about it.

it'd be really great if we can find an approach which doesn't require any kind of global mutex thing to protect a counter

FWIW, the types in std::atomic typically use instruction-level compare-and-swap on most architectures from what I understand, and are guaranteed to be lock-free per their docs, so there shouldn't be any concerns about contention around a mutex.

beyond just the &mut graph, we'll also need to check for borrow_mut calls when considering when the graphs get mutated

This is a good call-out... I had gotten something kind-of-sort-of-working with a subset of the changes proposed above, and tests pass, but I think it's because none of the tests try intersecting the same geometries more than once, so probably I've just lucked out, and need to think more about how to deal with this potential mutation.

@rmanoka

Assuming the prepared geom holds a reference to the original geom, we could use the reference pointer casted as usize. Would that work as a pseudo unique index?

If possible I'd love for it to at least be possible for the prepared geometry to own the underlying geometry... I think as a practical matter, a big motivation for having a mechanism like this will be because someone wants to hang onto a prepared geometry for awhile so they can repeatedly query it, and I think if the prepared geometry just has a reference to some other geometry that you also have to keep around (but can't move because of the outstanding reference, and can't put in the same struct as the prepared geometry due to self-referential structs being disallowed), it will pretty significantly diminish the utility. It seems like maybe we'd want regular geometries to have an .into_prepared() that moves them into an owning prepared structure? It might also be possible to have the whole structure be generic over Borrow<Something> to allow using it either way, but I haven't quite figured out how that will work in practice yet. It seems intuitively like Cow/GeometryCow would fit the bill, but it seems like in practice the ergonomics of using Cow for owned things aren't great (you still end up with weird lifetime bounds everywhere).

Anyhow, I think that'd make it hard to do the pointer-casting ID thing; with owned data, there's at least some risk that more than one geometry gets created with the same memory address as an earlier one that had been created and subsequently moved, so I don't think we'd be able to guarantee uniqueness.

@michaelkirk
Copy link
Member

michaelkirk commented Jul 17, 2022

I have been experimenting with some things here:
https://github.com/georust/geo/compare/mkirk/prepared-geom?expand=1

That branch is currently quite a mess and includes a lot of dead ends that I'd want to remove, but I was curious just to get an idea for the kinds of savings we could hope to see.

By caching both the RTree and the self-intersection step, and being smart about when we clone things, I'm seeing savings of about 80%.

    relate prepared polygon time:   [210.15 ms 210.28 ms 210.42 ms]
    relate unprepared polygon
                            time:   [1.0569 s 1.0574 s 1.0579 s]

The basic approach I took was that a PreparedGeometry stores a partially constructed GeometryGraph, from which we can efficiently create a jump-started Relate process. Notably it stores the precomputed RTree and the PlanarGraph after self noding.

My solution to the array indexing problem @apendleton highlighted was to clone the partially constructed geometry graph and just swap all the labels around as necessary depending on wether the geometry is in the A or B position. It's admittedly brain dead, but it seems to work.

I'm not tied to much of the exact approach laid out in that branch — in fact I'm not really happy with a lot of the changes as they currently exist, but I think it serves as a good POC of the savings we can hope to achieve.

Beyond cleanup, additional work exists to make sure it's ergonomic. Like I expect people might want to mix and match prepared and unprepared polygons.

currently you can either:

prepared_polygon_a.relate(&prepared_polygon_b);
// or
unprepared_polygon_a.relate(&unprepared_polygon_b);

but I expect people would want to mix-and-match, like:

prepared_polygon_a.relate(&unprepared_polygon_b);
unprepared_polygon_a.relate(&prepared_polygon_b);

@JosiahParry
Copy link
Contributor

I'd be keen to see this in geo crate :)

@gauteh
Copy link

gauteh commented May 21, 2024

Chipping in to say that this would be very useful! (#1184)

@michaelkirk michaelkirk linked a pull request Jul 4, 2024 that will close this issue
2 tasks
@frewsxcv
Copy link
Member

frewsxcv commented Jul 6, 2024

#1197

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants