-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove suspicious wait set locking code #119
base: rolling
Are you sure you want to change the base?
Conversation
Since inuse is not volatile, it cannot be used instead of a lock.
The way I read it, It serves a dual purpose: the first is to prevent concurrent calls to Preventing this is useful because it guarantees that no-one is mucking about with the waitset while it is blocked in Secondly, there is the caching of attached entries — the assumption that the set of subscriptions, clients, services, guard conditions and events will usually not change from call to call, and that for realistic use-cases, it is cheaper to do compare arrays on entry than it is to attach/detach each entry every call. That's based on an experiment I once did (must've been a microbenchmark, like a hacked talker/listener pair). I would be good to re-do that check: the price paid in complexity is not insignificant. The worst part of the complexity is having some rather delicate dependencies between entities, waitsets and behaviour of rcl. That's also the source of the "I'm assuming one is not allowed to delete an entity while it is still being used ..." comment in The change you propose has an issue of itself as well: by extending the scope of In short, I'm not quite ready to accept your PR. But if you'd have the ability to do a bit of measuring the cost of doing attach/detach for each call vs this caching strategy, that'd be great. Then we can either throw it away or we'll at least have decent evidence of the cost. |
The issue I see is that the lock is only held for a short time, so it doesn’t prevent threads from using the same waitset. And I think the compiler can optimize out the write to inuse in a way that other threads never see it.
|
I want to make sure I understand your aside "(The rmw_wait interface really is a disastrously bad one — and one that the Unix select call also suffers from and that has prompted umpteen different interface to try and avoid it …)". The idea behind waitsets seems to be "I'm interested in when one of these conditions becomes true. I don't care which one" and having to iterate through (and god forbid poll) all the events is expensive. This is compared to an event queue, where, when a condition becomes true, that change becomes immediately visible with minimal work to see what has changed. I assume this is what "attaching" a condition to the waitset with And the specific horror of the |
Yes, although I wouldn't call it an event queue: you wait until any of the attached conditions become true, and the idea is that the set of conditions you want to wait changes only rarely. On return, you get a list of all conditions that triggered. Typically it's very few of them, but there can be multiple. The spec'd DDS API then returns the objects that triggered; Cyclone DDS instead returns an associated struct closure { ... };
{ struct closure *c = malloc(sizeof(*c)); c->func = mycallback; c->arg = myarg; }
dds_waitset_attach(ws, reader, (intptr_t) c);
nxs = dds_waitset_wait(ws, xs, lengthof(xs), timeout);
for (int32_t i = 0; i < nxs; i++) {
struct closure *c = (struct closure *) xs[i];
c->func (c->arg);
} but this is definitely something that is not directly supported by the spec'd DDS API. And all this matches exactly with what the upper layers of rcl are trying to achieve ... create a waitset, add some objects to it with callbacks you want to invoke when there's some work to do, then invoke them. But instead of building on a sensible primitive (even the one in the DDS API is sensible, all it takes is a small map to locate the callback given the object pointer), it constructs that waitset object, deconstructs it into arrays of pointers to objects, passes those to The RMW layers therefore have to assume the set of pointers can change from call to call; and one further has to assume that memory may be reused (hence the need for Literally all of it could have been skipped ... A good cleaning up would be valuable, but that amounts to a pretty massive rewrite of the ROS2 core executor code, so I guess it's not going to happen. Which leaves as the only realistic option extending the RMW interface to inform the lower layer of the attach/detach events. |
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
@@ -2221,8 +2221,10 @@ static void clean_waitset_caches() | |||
used ... */ | |||
std::lock_guard<std::mutex> lock(gcdds.lock); | |||
for (auto && ws : gcdds.waitsets) { | |||
std::lock_guard<std::mutex> lock2(ws->lock); | |||
waitset_detach(ws); | |||
std::unique_lock<std::mutex> lock2(ws->lock, std::try_to_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"try lock" is generally a tricky game: all a failure to lock proves is that the lock was held at the time of trying. By the time you start executing the next instruction it may already have been released. And I would argue that this gives you essentially the same behaviour as the original code.
In the original code there is the issue that the schedule:
T0: rmw_wait / set busy / do attach / wait
T1: delete X / clean_waitset_caches / skip waitset_detach
T0: wakeup / erase slots for non-triggering entities / clear busy
ends up with the (rmw) waitset entry cache containing a dangling pointer a deleted rmw entity [*]. What will happen most of the time is that a subsequent call to rmw_wait will notice a change in the set of pointers (because it does a memcmp, the fact that it is dangling is not really an issue, and by accident more than design, the events will squeeze through without dereferencing a dangling pointer, too).
The interesting bit is when we then continue as follows:
- T1: create X', and it happens to get the same address as X
- T0: rmw_wait
because the cache hasn't been invalidated and X is at the same address as X' (and presumably may moreover be in the same position in the arguments to rmw_wait) the cache checking logic can end up concluding that the cache is correct for the set being waited for, skipping the attach calls, and therefore not waiting for X' to trigger.
Your change to unconditionally call waitset_detach, blocking until a concurrent call to rmw_wait woke up, would fix that problem. But with this change, you're back to the original behaviour ... and thus get back the original problem.
But there may be a saving grace: it may (as my old comment noted) well be that ROS2 will never delete an entity while another thread is using it in rmw_wait. In that case, all this is a moot point. I think it is likely that this is the case, because at the RMW layer they're all simply pointers, not, e.g., weak references. (Cyclone doesn't make such assumptions in its API, but it exposes 31-bit random numbers as handles on its API instead of pointers. It doesn't guarantee anything about entity handle reuse, but it does use a decent PRNG and with the typical application having a few thousands handles at most, the likelihood of aliasing is pretty small.)
The other thing is that (as I mentioned before) the entire caching code exists because of a microbenchmark I once did. Perhaps it isn't worth it at all. If one simply rips it out, the problem disappears along with it.
(What would also work, and what I would consider doing if the caching is worth keeping, is to retain the busy flag, and depending on whether it is clear or set, either invalidate the cache or mark it as invalid/cleanup-deferred. And then, in rmw_wait, always do the "require reattach" path if it was invalid/cleanup-deferred.)
[*] It has been detached from the DDS waitset, that bit is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is not a mistake with how I'm locking here, and the code is correct as-is.
unique_lock and lock_guard are both raii objects. That is they continue to hold the lock until the object is destroyed. In the case of clean_waitset_caches
, that's one iteration of the loop (no worry that the lock might be snatched from us during waitset_detach
) In the case of rmw_wait
, that's until the function returns. Again, no worry that someone else might acquire the lock.
I do think I can make this code much better by redoing how we cache, but that's a bigger PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, like I tried to explain in my comment, a detail in how this interacts with a concurrent call to wait because of which it doesn't actually improve over what's currently in master. This not that it accesses things without holding a lock or anything, it is rather about the consequence of not calling waitset_detach
when the waitset is being waited on. Other than the bad smell of a busy
flag that's the only thing really wrong with the current code — and so the PR as it stands really just replaces that bad small by the smell of holding a lock across a wait operation.
In my view, there are two options to move forward:
- Investigate whether there is any possibility of
clean_waitset_caches
running into a locked waitset where it actually needs to do any work. That function is called only when subscriptions or guard conditions get deleted and waitsets are only locked that function andrmw_wait
. Ergo, it can ignore any locked waitsets if it is certain that the subscription or guard condition being deleted is currently in use in any waitset. If this is certain, you're effectively replacing thebusy
flag with atrylock
with no downsides. I can live with that. - Investigate whether there is sufficient reason to do this caching in the first place. If there isn't, then I propose we throw the code being discussed away entirely. That would certainly remove a ton of code smell.
Since inuse is not volatile, it cannot be used instead of a lock.