Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cancel in exception handling code #96

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

parsonsmatt
Copy link
Contributor

@parsonsmatt parsonsmatt commented May 27, 2022

Initial discussion: fpco/safe-exceptions#3

Follow-up on GHC: https://gitlab.haskell.org/ghc/ghc/-/issues/18899

🤔

Fixes #95

Copy link
Contributor Author

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is a bit contrived, and possibly is just an abuse of the API. I've recommended that we switch to catch instead of withException as I wasn't aware the semantics were different.

To motivate it a bit more: our original code looked like this:

action `wtihException` sendAlert

sendAlert = runDB . sendMessageToSlack MyAlertChannel

Meanwhile, we've got some instrumenting code in runDB that adds to a timer when we're waiting on a connection.

runDB action = do
  let 
    timerAction = 
      forever do
        threadDelay 1_000
        recordAnotherMillisecond
  withAsync (timerAction `finally` flushMetrics) \a -> do
    runSqlPool (liftIO (cancel a) >> action) =<< askSqlPool

But, since withAsync can't actually be canceled, if we run any database action in a onException or withException handler, then we just... hang forever.

This can be fixed locally, using withAsyncWithUnmask (\unmask -> unmask $ timerThread 0) \a -> ...

This seems deeply unsatisfying to me. But hey it's Haskell exceptions 😂

@@ -422,7 +422,7 @@ withException thing after = withRunInIO $ \run -> EUnsafe.uninterruptibleMask $
case res1 of
Left e1 -> do
-- see explanation in bracket
_ :: Either SomeException b <- EUnsafe.try $ run $ after e1
_ :: Either SomeException b <- EUnsafe.try $ restore $ run $ after e1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this line fixes the test. But it also removes the guarantee that action `onException` cleanup will definitely complete cleanup.

when (n < 10) $ do
timerAction (n + 1)
withAsync (timerAction 0) $ \a -> do
cancel a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be replaced with pure (). withAsync is also unable to cancel the thread.

withAsync (timerAction 0) $ \a -> do
cancel a
eresult <-
race
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This race call also is unable to cancel one of the threads - so, with the original code, this test hangs for 10 seconds waiting for timerAction to complete before returning Right 10.

Comment on lines +78 to +84
withAsync a b = withRunInIO $ \run -> do
maskingState <- E.getMaskingState
case maskingState of
E.MaskedUninterruptible ->
A.withAsyncWithUnmask (\unmask -> unmask (run a)) (run . b)
_ ->
A.withAsync (run a) (run . b)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an alternative fix to using restore in the handler of withException and bracket.

The Async API is totally dependent on being able to throw async exceptions to the other thread. I don't think Async makes any sense if it cannot do that.

While this single fix works here, it won't work for eg race or any of the other code, so we may want to patch those as well.

I'm almost curious if this is a thing that needs to get raised with upstream Async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented here simonmar/async#67

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be very worried about two things here:

  1. A very subtle change to behavior that can take production code down
  2. Diverging from async's handling of these cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream hasn't responded, so, we may want to handle this here?

I have a hard time imagining any code that depends on withAsync creating an unkillable zombie and being unable to exit.

This is a decently large footgun that is only really likely to happen if you are using unliftio or safe-exceptions. While I do think that async should really be launching threads in at most a MaskedInterruptible state, it's not super likely to be triggered - uninterruptibleMask is pretty rare to see, and has lots of warning labels.

Since unliftio uses it a bunch in cleanup functions, though, I think it does make sense to make the combination of UnliftIO.Exception and UnliftIO.Async not quite so dangerous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.

I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.

One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.

It's unlikely because most people read the docs for uninterruptibleMask and decide they probably don't need it. So the main way it becomes a part of people's programs is when safe-exceptions or unliftio style cleanups enter the picture and puts it in with bracket and friends.

But, yeah, anyone that runs uninterruptibleMask_ $ async $ do ... is going to create an unkillable Async.

I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.

I'm coming around that it should probably be unmasked in the async and withAsync functions, and definitely unmasked in race and race-like functions. If the intent is "one of these threads should die," then unmasking should happen. But concurrently poses a slightly different question: Simon Marlow points out that finally foo (a >> b) and finally foo (concurrently a b should have the same semantics, but if concurrently unmasks, then the action is no longer protected. So the appropriate default for concurrently may be different than for race, since the expectations/assumptions are different.

Likewise, withAsync seems to demand the unmasking much more - uninterruptibleMask_ $ withAsync foo bar cannot terminate if foo is a forever loop - you're deadlocked. But uninterruptibleMask_ $ async foo is less likely to deadlock, since it's entirely possible that you never cancel it, and thus never have the problem.

I'm not really sure the best way to approach it. Maybe separate modules with different behavior is the right answer? A separate opt-in package?

My chaos brain says something like UnliftIO.Async needs to export these functions with a MaskingBehavior argument, forcing all users to upgrade, and maybe give a warning/deprecation message that points to alternative modules. But that's a big breaking change for probably little benefit for those that aren't already experiencing the pain.

One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?

Yeah, that's a good pont - unfortunately, we're missing a primop.

I have opened an upstream issue: ideally, we'd have a primop unsafeUnUninterruptibleMask# that would downgrade a UninterruptibleMask to a InterruptibleMask. But we don't have that, so the alternative of asyncWithUnmask $ \unmask -> unmask $ mask_ action actually is vulnerable to an async exception at the unmask point, since unsafeUnmask (the function provided as unmask) triggers any queued async exceptions in the C--. In the case of async, this is really unlikely - I think you'd need to cancel =<< async action to observe the async exception slipping between the unmask $ mask_.

Switching from mask_ (which checks masking state; should be unnecessary after unmask) to block would also accomplish this.

@parsonsmatt
Copy link
Contributor Author

Hate to be a squeaky wheel, but this has now bitten us 3 times, resulting in non-graceful SIGKILL shutdowns

I've made the patch locally to ensure that all Async functions in our codebase are going to be unmasking their actions (if cancel is an important part of the API - so withAsync, race, etc).

@snoyberg
Copy link
Member

Fair enough, let's try to move forward on this then. Can I ask you to put together a ChangeLog entry in here that makes it clear that there's a subtle behavioral change in this release?

Also: what do you think of posting this on the Haskell Foundation Discourse and giving people a heads-up that this change will be coming before releasing? I'm a bit worried about encouraging a debate on it, but also a bit worried about this one sneaking under people's radars.

@parsonsmatt
Copy link
Contributor Author

I think getting some feedback would be great. At the very least, maybe we can get some more eyes on this ticket: https://gitlab.haskell.org/ghc/ghc/-/issues/22389 the solution to which would introduce a new primitive that should make this a bit safer.

@snoyberg
Copy link
Member

Would you be able to publish such a message? I'll be happy to boost with links from Twitter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Async.cancel does not work inside of withException
2 participants