Builder: unsound reuse of buffers #690

adamgundry · 2024-08-02T10:25:30Z

Summary

toLazyByteString :: Builder -> LazyByteString has a race condition that may generate wrong results if two threads concurrently evaluate the result. This originally manifested itself in an application using Cassava, which would produce CSV files in which part of the file was overwritten with data from later in the file. The underlying cause appears to be an interaction between unsafeDupablePerformIO and buffer reuse in buildStepToCIOS arising since version 0.11.5.0 (specifically #581).

Reproducer

Since the race condition is non-deterministic, it is not entirely trivial to reproduce, but the following module races two threads parsing the same lazy bytestring (which should get the same result) and typically hits a "bad" case after a few iterations.

Full code

#!/usr/bin/env cabal
{- cabal:
build-depends:
  base, async, bytestring, cassava
ghc-options: -threaded -with-rtsopts=-N
-}
{- project:
with-compiler: ghc-9.8.1
optimization: False
-}
-- It is important we disable optimization for this reproducer, as otherwise the
-- bug goes away (presumably due to inlining, CSE or suchlike).  But the issue
-- has been observed with optimizations enabled as well.
--
-- I've reproduced on bytestring 0.11.5.3 (GHC 9.6.4, cassava 0.5.3.0) and
-- bytestring HEAD (594966bbfa84255adfb644d43212e5e05f73c235) (GHC 9.8.1, cassava 0.5.3.1).

import qualified Control.Concurrent.Async as Async
import qualified Control.Monad            as Monad
import qualified Data.ByteString          as B
import qualified Data.ByteString.Builder  as BB
import           Data.ByteString.Builder  (Builder)
import qualified Data.ByteString.Char8    as BS
import qualified Data.ByteString.Lazy     as LBS
import qualified Data.Csv                 as CSV
import qualified Data.List as List
import Prelude hiding (unlines)

-- Generate some test data.  This is based on Cassava code but minimized to make
-- clear there is no dependency on anything unsafe.
build :: IO LBS.LazyByteString
build = encode <$> mkRecord

-- The magic number 1853 means that show [1..1853] occupies 8159 bytes.
-- Thus we end up with chunks close to maximalCopySize = 8160.
mkRecord :: IO [[BS.ByteString]]
mkRecord = pure $ let xs = [ 1 .. 1853 ] :: [Int]
                  in [[BS.pack (show xs), BS.pack (show xs)]]

encode :: [[BS.ByteString]] -> LBS.LazyByteString
encode = BB.toLazyByteString
       . unlines (BB.word8 10)
       . map (mconcat . intersperse (BB.word8 44) . map BB.byteString . map escape)

unlines :: Builder -> [Builder] -> Builder
unlines _ [] = mempty
unlines sep (b:bs) = b <> sep <> unlines sep bs

intersperse :: Builder -> [Builder] -> [Builder]
intersperse _   []      = []
intersperse sep (x:xs)  = x : prependToAll sep xs

prependToAll :: Builder -> [Builder] -> [Builder]
prependToAll _   []     = []
prependToAll sep (x:xs) = sep <> x : prependToAll sep xs

escape :: BS.ByteString -> BS.ByteString
escape s
         = LBS.toStrict . BB.toLazyByteString $
            BB.word8 dquote
            <> B.foldl
                (\ acc b -> acc <> if b == dquote
                    then BB.byteString (BS.pack "\"\"")
                    else BB.word8 b)
                mempty
                s
            <> BB.word8 dquote
  where
    dquote = 34
    cr     = 13


-- Consume the resulting bytesting by parsing the CSV file.  I haven't managed
-- to minimize the reproducer to avoid the need for this, but I don't believe
-- Cassava is doing anything relevant here, we just need a suitable demand
-- pattern from both threads to trigger the race.

consume :: LBS.LazyByteString -> IO CSV.Csv
consume = either fail pure . CSV.decode CSV.NoHeader

main :: IO ()
main = Monad.replicateM_ 100 $ do
    bs <- build
    -- Forcing the length of the bytestring makes the problem disappear
    -- print (LBS.length bs)
    (r1, r2) <- Async.concurrently (consume bs) (consume bs)
    if r1 == r2
        then putStrLn "Ok!"
        else do fail "Bad!"

Diagnosis

Due to the use of unsafeDupablePerformIO for performance, it is critical that any IO actions executed when running a Builder can be interrupted or executed multiple times. In principle, filling a buffer is safe provided the buffer is used only once and the same bytes are written each time. However, wrapChunk in buildStepToCIOS re-uses a buffer in the trimming case.

More precisely, I believe what is happening is the following:

The Builder includes large chunks using insertChunk that hit the trim case of wrapChunk.
Two threads are racing to insert the same chunk, sharing a buffer fpbuf.
Thread A creates a fresh fpbuf' and starts copying bytes from fpbuf.
Thread B also creates a fresh fpbuf', copies it, yields bs and then continues using the same buffer.
Thread B starts filling fpbuf with more data from later in the output.
Thread A is still copying bytes from fpbuf, so it copies bogus data from later in the output to earlier in (its version of) the output.

It looks like the bug was introduced in #581 which is first present in 0.11.5.0 (see 5c4d236 and 0c030bb).

Solution

An easy fix is for the trim case of wrapChunk to unconditionally allocate a new buffer after trimming, rather than re-using the old buffer. This appears to fix the problem in my reproducer. (I think it is safe to use the same buffer if the chunk is actually empty, though.) I don't know how much impact this will have on performance, but presumably it will slightly reduce it for builders inserting many trimmed chunks.

Replacing the unsafeDupablePerformIO calls with unsafePerformIO should also fix the issue, but is likely to significantly impact performance (4ac4be0 claimed introducing unsafeDupablePerformIO "increases the performance of bytestring chunk insertion by 20%.").

Perhaps users who know that they will not evaluate the resulting bytestring in multiple threads concurrently could somehow indicate this and opt in to more aggressive buffer reuse, e.g. via a custom AllocationStrategy?

I notice that using customStrategy for AllocationStrategy can in principle choose to reuse buffers "if it can guarantee that this referentially transparent" (sic). It's not clear to me what the caller needs to guarantee, but this issue suggests that (amongst other things?) the caller must ensure that the resulting lazy bytestring is evaluated only by a single thread. Could we document this more clearly?

The text was updated successfully, but these errors were encountered:

`toLazyByteString :: Builder -> LazyByteString` had a race condition that could generate wrong results if two threads concurrently evaluated the result. This bug was introduced in haskell#581 (5c4d236) and first present in release 0.11.5.0 (as 0c030bb). Due to the use of `unsafeDupablePerformIO` for performance, it is critical that any IO actions executed when running a `Builder` can be interrupted or executed multiple times. In principle, filling a buffer is safe provided the buffer is used only once and the same bytes are written each time. However, `wrapChunk` in `buildStepToCIOS` would re-use a buffer in the trimming case after copying its contents to produce a new trimmed chunk. This is safe when run in a single thread, but if two threads simultaneously execute the code, one of them may still be copying the contents while the other starts overwriting the buffer. This patch fixes `wrapChunk` to unconditionally allocate a new buffer after trimming, rather than re-using the old buffer. This will presumably come at a slight performance cost for builders inserting many trimmed chunks.

clyring · 2024-08-09T03:32:19Z

noDuplicate# as in unsafePerformIO is indeed a rather large and expensive hammer, if memory serves. We don't really need its full power since we don't care if the initial "allocate a new buffer" effect gets duplicated: I'm reasonably sure eagerly black-holing the "continue work on the next chunk" thunks would be sufficient, and this only requires a single CAS-write and branch. But I don't think there is currently any way to request eager black-holing of a specific thunk from Haskell code.

Perhaps users who know that they will not evaluate the resulting bytestring in multiple threads concurrently could somehow indicate this and opt in to more aggressive buffer reuse, e.g. via a custom AllocationStrategy?

I notice that using customStrategy for AllocationStrategy can in principle choose to reuse buffers "if it can guarantee that this referentially transparent" (sic). It's not clear to me what the caller needs to guarantee, but this issue suggests that (amongst other things?) the caller must ensure that the resulting lazy bytestring is evaluated only by a single thread. Could we document this more clearly?

In practice, I think the only practical ways for a client to guarantee this are:

...running the entire Haskell program single-threaded
...using buildStepToCIOS directly and consuming the chunks in an obviously single-threaded or non-duplicable way

Well, option 2 could have reasonable applications, but I'm not entirely sure about the merits of documenting that this is possible. It seems less dangerous to for such applications to use their own analogue of the AllocationStrategy type, to emphasize that buffer-reuse is generally not compatible with the consumers in the bytestring Builder internal modules.

adamgundry · 2024-09-04T14:28:55Z

I agree with the above. But I'm wondering what the concrete plan is to address this issue @clyring @Bodigrim? As a soundness bug leading to wrong-results it seems fairly severe. I've proposed a fix in #691 on which I'd appreciate review.

`toLazyByteString :: Builder -> LazyByteString` had a race condition that could generate wrong results if two threads concurrently evaluated the result. This bug was introduced in #581 (5c4d236) and first present in release 0.11.5.0 (as 0c030bb). Due to the use of `unsafeDupablePerformIO` for performance, it is critical that any IO actions executed when running a `Builder` can be interrupted or executed multiple times. In principle, filling a buffer is safe provided the buffer is used only once and the same bytes are written each time. However, `wrapChunk` in `buildStepToCIOS` would re-use a buffer in the trimming case after copying its contents to produce a new trimmed chunk. This is safe when run in a single thread, but if two threads simultaneously execute the code, one of them may still be copying the contents while the other starts overwriting the buffer. This patch fixes `wrapChunk` to unconditionally allocate a new buffer after trimming, rather than re-using the old buffer. This will presumably come at a slight performance cost for builders inserting many trimmed chunks.

adamgundry added a commit to adamgundry/bytestring that referenced this issue Aug 2, 2024

Builder: avoid unsound buffer reuse (haskell#690)

f2378bd

adamgundry mentioned this issue Aug 2, 2024

Builder: avoid unsound buffer reuse #691

Merged

clyring closed this as completed in #691 Sep 18, 2024

clyring added this to the 0.11.5.4 milestone Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builder: unsound reuse of buffers #690

Builder: unsound reuse of buffers #690

adamgundry commented Aug 2, 2024

clyring commented Aug 9, 2024

adamgundry commented Sep 4, 2024

Builder: unsound reuse of buffers #690

Builder: unsound reuse of buffers #690

Comments

adamgundry commented Aug 2, 2024

Summary

Reproducer

Diagnosis

Solution

clyring commented Aug 9, 2024

adamgundry commented Sep 4, 2024