Fix `cancel` in exception handling code #96

parsonsmatt · 2022-05-27T22:11:09Z

Initial discussion: fpco/safe-exceptions#3

Follow-up on GHC: https://gitlab.haskell.org/ghc/ghc/-/issues/18899

🤔

Fixes #95

parsonsmatt

The code here is a bit contrived, and possibly is just an abuse of the API. I've recommended that we switch to catch instead of withException as I wasn't aware the semantics were different.

To motivate it a bit more: our original code looked like this:

action `wtihException` sendAlert

sendAlert = runDB . sendMessageToSlack MyAlertChannel

Meanwhile, we've got some instrumenting code in runDB that adds to a timer when we're waiting on a connection.

runDB action = do
  let 
    timerAction = 
      forever do
        threadDelay 1_000
        recordAnotherMillisecond
  withAsync (timerAction `finally` flushMetrics) \a -> do
    runSqlPool (liftIO (cancel a) >> action) =<< askSqlPool

But, since withAsync can't actually be canceled, if we run any database action in a onException or withException handler, then we just... hang forever.

This can be fixed locally, using withAsyncWithUnmask (\unmask -> unmask $ timerThread 0) \a -> ...

This seems deeply unsatisfying to me. But hey it's Haskell exceptions 😂

parsonsmatt · 2022-05-27T22:19:59Z

unliftio/src/UnliftIO/Exception.hs

@@ -422,7 +422,7 @@ withException thing after = withRunInIO $ \run -> EUnsafe.uninterruptibleMask $
    case res1 of
        Left e1 -> do
            -- see explanation in bracket
-            _ :: Either SomeException b <- EUnsafe.try $ run $ after e1
+            _ :: Either SomeException b <- EUnsafe.try $ restore $ run $ after e1


So, this line fixes the test. But it also removes the guarantee that action `onException` cleanup will definitely complete cleanup.

parsonsmatt · 2022-05-27T22:20:36Z

unliftio/test/UnliftIO/ExceptionSpec.hs

+                  when (n < 10) $ do
+                    timerAction (n + 1)
+              withAsync (timerAction 0) $ \a -> do
+                cancel a


This line can be replaced with pure (). withAsync is also unable to cancel the thread.

parsonsmatt · 2022-05-27T22:21:41Z

unliftio/test/UnliftIO/ExceptionSpec.hs

+              withAsync (timerAction 0) $ \a -> do
+                cancel a
+      eresult <-
+        race


This race call also is unable to cancel one of the threads - so, with the original code, this test hangs for 10 seconds waiting for timerAction to complete before returning Right 10.

parsonsmatt · 2022-05-27T22:55:41Z

unliftio/src/UnliftIO/Internals/Async.hs

+withAsync a b = withRunInIO $ \run -> do
+  maskingState <- E.getMaskingState
+  case maskingState of
+    E.MaskedUninterruptible ->
+      A.withAsyncWithUnmask (\unmask -> unmask (run a)) (run . b)
+    _ ->
+      A.withAsync (run a) (run . b)


This is an alternative fix to using restore in the handler of withException and bracket.

The Async API is totally dependent on being able to throw async exceptions to the other thread. I don't think Async makes any sense if it cannot do that.

While this single fix works here, it won't work for eg race or any of the other code, so we may want to patch those as well.

I'm almost curious if this is a thing that needs to get raised with upstream Async.

Commented here simonmar/async#67

I'd be very worried about two things here:

A very subtle change to behavior that can take production code down

Diverging from async's handling of these cases

Upstream hasn't responded, so, we may want to handle this here?

I have a hard time imagining any code that depends on withAsync creating an unkillable zombie and being unable to exit.

This is a decently large footgun that is only really likely to happen if you are using unliftio or safe-exceptions. While I do think that async should really be launching threads in at most a MaskedInterruptible state, it's not super likely to be triggered - uninterruptibleMask is pretty rare to see, and has lots of warning labels.

Since unliftio uses it a bunch in cleanup functions, though, I think it does make sense to make the combination of UnliftIO.Exception and UnliftIO.Async not quite so dangerous.

I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.

I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.

One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?

I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.

It's unlikely because most people read the docs for uninterruptibleMask and decide they probably don't need it. So the main way it becomes a part of people's programs is when safe-exceptions or unliftio style cleanups enter the picture and puts it in with bracket and friends.

But, yeah, anyone that runs uninterruptibleMask_ $ async $ do ... is going to create an unkillable Async.

I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.

I'm coming around that it should probably be unmasked in the async and withAsync functions, and definitely unmasked in race and race-like functions. If the intent is "one of these threads should die," then unmasking should happen. But concurrently poses a slightly different question: Simon Marlow points out that finally foo (a >> b) and finally foo (concurrently a b should have the same semantics, but if concurrently unmasks, then the action is no longer protected. So the appropriate default for concurrently may be different than for race, since the expectations/assumptions are different.

Likewise, withAsync seems to demand the unmasking much more - uninterruptibleMask_ $ withAsync foo bar cannot terminate if foo is a forever loop - you're deadlocked. But uninterruptibleMask_ $ async foo is less likely to deadlock, since it's entirely possible that you never cancel it, and thus never have the problem.

I'm not really sure the best way to approach it. Maybe separate modules with different behavior is the right answer? A separate opt-in package?

My chaos brain says something like UnliftIO.Async needs to export these functions with a MaskingBehavior argument, forcing all users to upgrade, and maybe give a warning/deprecation message that points to alternative modules. But that's a big breaking change for probably little benefit for those that aren't already experiencing the pain.

One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?

Yeah, that's a good pont - unfortunately, we're missing a primop.

I have opened an upstream issue: ideally, we'd have a primop unsafeUnUninterruptibleMask# that would downgrade a UninterruptibleMask to a InterruptibleMask. But we don't have that, so the alternative of asyncWithUnmask $ \unmask -> unmask $ mask_ action actually is vulnerable to an async exception at the unmask point, since unsafeUnmask (the function provided as unmask) triggers any queued async exceptions in the C--. In the case of async, this is really unlikely - I think you'd need to cancel =<< async action to observe the async exception slipping between the unmask $ mask_.

Switching from mask_ (which checks masking state; should be unnecessary after unmask) to block would also accomplish this.

parsonsmatt · 2023-02-10T00:20:10Z

Hate to be a squeaky wheel, but this has now bitten us 3 times, resulting in non-graceful SIGKILL shutdowns

I've made the patch locally to ensure that all Async functions in our codebase are going to be unmasking their actions (if cancel is an important part of the API - so withAsync, race, etc).

snoyberg · 2023-02-10T13:00:18Z

Fair enough, let's try to move forward on this then. Can I ask you to put together a ChangeLog entry in here that makes it clear that there's a subtle behavioral change in this release?

Also: what do you think of posting this on the Haskell Foundation Discourse and giving people a heads-up that this change will be coming before releasing? I'm a bit worried about encouraging a debate on it, but also a bit worried about this one sneaking under people's radars.

parsonsmatt · 2023-02-10T18:00:33Z

I think getting some feedback would be great. At the very least, maybe we can get some more eyes on this ticket: https://gitlab.haskell.org/ghc/ghc/-/issues/22389 the solution to which would introduce a new primitive that should make this a bit safer.

snoyberg · 2023-02-12T12:07:28Z

Would you be able to publish such a message? I'll be happy to boost with links from Twitter.

parsonsmatt added 2 commits May 27, 2022 16:10

Test demonstrating problem

3e01da2

remove focus

4890a73

parsonsmatt commented May 27, 2022

View reviewed changes

alternate fix

f634175

parsonsmatt commented May 27, 2022

View reviewed changes

parsonsmatt mentioned this pull request May 27, 2022

withAsync hangs if called during uninterruptibleMask simonmar/async#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `cancel` in exception handling code #96

Fix `cancel` in exception handling code #96

parsonsmatt commented May 27, 2022 •

edited

Loading

parsonsmatt left a comment

parsonsmatt May 27, 2022

parsonsmatt May 27, 2022

parsonsmatt May 27, 2022

parsonsmatt May 27, 2022

parsonsmatt May 27, 2022

snoyberg May 31, 2022

parsonsmatt Oct 29, 2022

snoyberg Nov 1, 2022

parsonsmatt Nov 1, 2022

parsonsmatt commented Feb 10, 2023

snoyberg commented Feb 10, 2023

parsonsmatt commented Feb 10, 2023

snoyberg commented Feb 12, 2023

Fix cancel in exception handling code #96

Are you sure you want to change the base?

Fix cancel in exception handling code #96

Conversation

parsonsmatt commented May 27, 2022 • edited Loading

parsonsmatt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parsonsmatt commented Feb 10, 2023

snoyberg commented Feb 10, 2023

parsonsmatt commented Feb 10, 2023

snoyberg commented Feb 12, 2023

Fix `cancel` in exception handling code #96

Fix `cancel` in exception handling code #96

parsonsmatt commented May 27, 2022 •

edited

Loading