Making ByteString.hGetLine behave like System.IO.hGetLine #327

dbramucci · 2020-11-22T03:14:20Z

This changes ByteString.hGetLine such that it respects the newline mode in the provided handle. (Specifically Handle__{haInputNL}).
Before, whether newlines were LF (Linux convention) or CRLF (Windows convention), ByteString.hGetLine would behave as though the only line ending was LF.

Now, when hGetLine is passed a handle, where the input newline is CRLF, it will treat both \r\n and \n as line feeds, just like System.IO.hGetLine does.
Note that this means that even in CRLF mode, "foo\nbar" will be treated as 2 separate lines.

The technical details are straightforward.

haveBuf now informs findEOL what haInputNL is.
haveBuf no longer assumes that findEOL only reads 1 char at a time (it can be 2 chars for \r\n now)
findEOL checks for \r\n in CRLF mode and jumps 2 forward, otherwise inching forward like it did previously.

Furthermore, a property test and lined ascii text generator have been added to test that ByteString.hGetLine behaves like System.IO.hGetLine does.

This is intended to close issue #13

This closes issue haskell#13. The changes can be summarized as updating `findEOL` to look for "\r\n" in CRLF mode and updating the logic of `haveBuf` to resize the buffer according to the size of the newline. Additionally, tests were added to verify that both `hGetLine`s produce the same behavior. Some of the edge-cases to worry about here include * '\n' still counts as a line end. Thus line endings' length vary between 1 and 2 in CRLF mode. * "\r\r\n" can give a false-start. This means you can't always skip 2 characters when `c' /= '\n'`. * '\r' when not followed by '\n' isn't part of a newline. * Not reading out of the buffer when '\r' is the last character.

The old test had wrote a special file filled with strange line endings. Now that there is a reliable, and consistent property test available for hGetLine, this code can be removed at little cost.

dbramucci · 2020-11-22T03:18:19Z

Additional work to do includes

Benchmark this to make sure it doesn't harm performance
Improve the LinedASCII generator to also generate "normal sized" lines (~ 100 char) and occasional pathologically long lines. As it is now, it generates a lot of 1 to 5 char long lines.

Bodigrim

This looks very decent and well-thought! Here are some suggestions about tests.

Bodigrim · 2020-11-23T19:33:15Z

tests/Properties.hs

@@ -1614,6 +1614,36 @@ prop_read_write_file_D x = unsafePerformIO $ do
        (const $ do y <- D.readFile f
                    return (x==y))

+prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do
+    tid <- myThreadId
+    let f = "qc-test-"++show tid


Use openTempFile, e. g.,

bytestring/tests/LazyHClose.hs

Line 18 in 54133b3

(fn, h) <- openTempFile "." "lazy-hclose-test.tmp"

Bodigrim · 2020-11-23T19:34:56Z

tests/Properties.hs

+prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do
+    tid <- myThreadId
+    let f = "qc-test-"++show tid
+    let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF)


It's a pity that QuickCheck does not provide Arbitrary instances for Newline and NewlineMode.

I raised an issue (nick8325/quickcheck#322) for this, so hopefully QuickCheck will provide those instances soon.
But I don't know what the timeline would look like to get rid of the Bools here.

Yeah, it is not worth for bytestring to depend on the very latest QuickCheck only for the sake of these instances, so we should probably keep if lineEndIn then LF else CRLF. But it would be nice to submit a PR to QuickCheck adding them, for generations to come.

Pull request made, in the process I realized I should flip the conditional because QuickCheck attempts to shrink True into False and I would prefer shrinking CRLF into LF given that CRLF is the more complicated case.

Bodigrim · 2020-11-23T19:38:26Z

tests/Properties.hs

+    tid <- myThreadId
+    let f = "qc-test-"++show tid
+    let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF)
+    bracket_


Not much point to use bracket_ in tests. In this particular case it even makes things worse, because if the test fails with an exception, I would rather not removeFile f to facilitate investigation.

Bodigrim · 2020-11-23T19:39:42Z

tests/Properties.hs

+            sLines <- withFile f ReadMode (\h -> do
+                hSetNewlineMode h newlineMode
+                readByLines System.IO.hGetLine h
+              )


Would it be possible to reduce code duplication here?

One way would be to write a readFileByLines function.

readFileByLines filename getLine = withFile filename ReadMode (\h -> do hSetNewlineMode h newlineMode readByLines getLine h )

Alternatively, because it is expensive to open files on Windows it would be possible to use just the Handle from when the file was written and just seek the beginning after the initial write and read.
Something like

hPutStr h_ filetext hFlush h_ let readByLinesFromStart getLine = hSeek h_ AbsoluteSeek 0 *> readByLines getLine h_ hSetNewlineMode h_ newlineMode bsLines <- readByLinesFromStart C.hGetLine sLines <- readByLinesFromStart System.IO.hGetLine

Then instead of open 3 files per test-case, we open 1 file.
I can try running that on Windows later to see if it is a large improvement (I would expect it is close to 3x for this short test).
Downside being that it isn't as obvious as opening files from scratch.

I do not particularly care about performance of the test suite, I'd rather keep tests as straightforward and focused as possible.

Bodigrim · 2020-11-23T19:48:05Z

tests/Properties.hs

+            return $ map C.unpack bsLines === sLines
+
+  where
+    readByLines getLine h_ = go []


Is it better than a more naive implementation?

readByLines getLine h_ = do isEnd <- hIsEOF h_ if isEnd then return [] else (:) <$> getLine h_ <*> readByLines getLine h_

tests/Properties.hs

… bug. On Windows, the test data would be written using the platform newlines. This means that any lone \n would become a \r\n. The consequence is that the property would fail to test the implementation on linux line endings when developing on windows. The fix is to set the newlineMode to noNewlineTranslation before writing the test data.

The variables were renamed to make boolean correspondance clearer. Also, True was changed to CRLF in order to get QuickCheck to try shrinking from CRLF to LF if possible.

vdukhovni · 2020-11-27T22:18:40Z

This PR is incorrect. I did not see any code to handle the usual corner case that CRLF is split across two input buffers with CR in one and LF in the next. A quick test shows the case is not handled. Given a text file where the first character is "x" which is then followed by 40,000 CRLF pairs:

λ> import qualified System.IO as IO
λ> import qualified Data.ByteString.Char8 as BC
λ> import Control.Monad (forM)
λ> x <- IO.openFile "/tmp/test.txt" IO.ReadMode
λ> IO.hSetNewlineMode x $ IO.NewlineMode IO.CRLF IO.CRLF
λ> l <- forM [0..39000::Int] $ \i -> do { line <- BC.length <$> hGetLine x; pure (i, line) }
λ> Prelude.filter ((/= 0) . snd) l
[(0,1),(4095,1),(8191,1),(12287,1),(16383,1),(20479,1),(24575,1),(28671,1),(32767,1),(36863,1)]

vdukhovni · 2020-11-28T23:39:18Z

I also wonder whether it really makes sense to fix the underlying issue. When the open file has a non-trivial encoding, the bytestring hGetLine does not make any effort to handle that, and the user is simply expected to set the input mode to binary. Why should CRLF be handled when encodings are not? Quote:

ByteString I/O uses binary mode, without any character decoding or newline conversion.
The fact that it does not respect the Handle newline mode is considered a flaw and may
be changed in a future version.

I am not entirely sure this is actually a flaw. Once the encoding is ignored, all bets are off. If the file is UCS-16, does it really help to try to handle CRLF, when it'll actually be (little-endian) \r\0\n\0?

Bodigrim · 2020-11-29T22:00:33Z

If we do not make any assumptions about encoding at all, we should remove many functions, starting from words. In fact encoding assumptions are given at the very beginning:

bytestring/Data/ByteString/Char8.hs

Lines 21 to 23 in 101566e

    
           -- More specifically these byte strings are taken to be in the 
        
           -- subset of Unicode covered by code points 0-255. This covers 
        
           -- Unicode Basic Latin, Latin-1 Supplement and C0+C1 Controls.

Why should CRLF be handled when encodings are not?

This is a pragmatic choice. We can honour CRLF choice without getting into too much trouble, and there is users' demand for it. Covering encodings is more challenging.

vdukhovni · 2020-11-30T00:55:40Z

I understand that this is a pragmatic choice, but we're in a state of sin, and perhaps papering over just some of the friction can be considered to be setting up false expectations...

Indeed it can be appropriate to check whether the input is in some non-trivial encoding where ISO-8859-1 or "binary" are trivial, as might be various other ISO-8859-X variants. On the other hand it is not entirely obvious to me how to determine which encodings are safe to ignore, and which should raise errors.

vdukhovni · 2020-12-02T02:44:32Z

Anyway, if there's general consensus this should move forward, despite the asymmetry with encoding, I won't stand in the way. Of course the implementation needs to be correct, and should not impose an undue performance cost, especially when the NewlineMode is noNewlineTranslation. On Unix systems one generally expects:

ghci> import System.IO
ghci> import GHC.IO.Handle.Types
ghci> import GHC.IO.Handle.Internals
ghci> withHandle_ "" stdin $ \h -> pure $ (== noNewlineTranslation) $ NewlineMode (haInputNL h) (haOutputNL h)
True

Bodigrim · 2021-01-06T19:30:28Z

Yes, I think we should move this forward. @dbramucci are you still interested to carry on with this PR?

vdukhovni · 2022-12-08T07:44:41Z

This PR has stalled. The original author has not stepped forward to fix the split CRLF issue. Should we close this, or is someone else willing to take this over and fix it?

vdukhovni · 2022-12-08T14:44:35Z

This PR has stalled. The original author has not stepped forward to fix the split CRLF issue. Should we close this, or is someone else willing to take this over and fix it?

Also, having just fixed strictness in lazy bytestring lines, and spent most of the effort also fine-tuning performance, the question is then naturally of whether the newline handling of hGetLine should be mirrored in lines, which then becomes more complex and probably slower.

Also the Streaming.ByteStream module from streaming-bytestring has lineSplit, which preserves the newline endings, leaving the stream content undisturbed, but still partitioned into per-line substreams. Perhaps ByteString should also have a lineSplit-like function, that leaves the decision of what if anything to trim at the ends of each line to the user.

Bodigrim · 2022-12-09T01:41:36Z

Also, having just fixed strictness in lazy bytestring lines, and spent most of the effort also fine-tuning performance, the question is then naturally of whether the newline handling of hGetLine should be mirrored in lines, which then becomes more complex and probably slower.

Data.List describes lines as splitting by \n, never by \r, and Data.ByteString.lines should stick to the same semantics.

Also the Streaming.ByteStream module from streaming-bytestring has lineSplit, which preserves the newline endings, leaving the stream content undisturbed, but still partitioned into per-line substreams. Perhaps ByteString should also have a lineSplit-like function, that leaves the decision of what if anything to trim at the ends of each line to the user.

Sounds like a good idea to me.

dbramucci added 4 commits October 27, 2020 22:47

Added the property that hGetLine agrees with base

a70a788

Removed the redundant test for hGetLine.

537080c

The old test had wrote a special file filled with strange line endings. Now that there is a reliable, and consistent property test available for hGetLine, this code can be removed at little cost.

Made hgetline test manage files like other IO tests

a3d4c5f

Bodigrim reviewed Nov 23, 2020

View reviewed changes

dbramucci marked this pull request as ready for review November 24, 2020 03:05

dbramucci marked this pull request as draft November 24, 2020 03:09

dbramucci commented Nov 24, 2020

View reviewed changes

tests/Properties.hs Outdated Show resolved Hide resolved

dbramucci added 2 commits November 27, 2020 00:04

Slight tweak to hgetline property's newlines.

3e34f48

The variables were renamed to make boolean correspondance clearer. Also, True was changed to CRLF in order to get QuickCheck to try shrinking from CRLF to LF if possible.

Bodigrim linked an issue Nov 29, 2020 that may be closed by this pull request

getLine doesn't honor handle's newline setting #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Making ByteString.hGetLine behave like System.IO.hGetLine #327

dbramucci commented Nov 22, 2020

dbramucci commented Nov 22, 2020

Bodigrim left a comment

Bodigrim Nov 23, 2020

Bodigrim Nov 23, 2020

dbramucci Nov 24, 2020

Bodigrim Nov 24, 2020

dbramucci Nov 27, 2020

Bodigrim Nov 23, 2020 •

edited

Loading

Bodigrim Nov 23, 2020

dbramucci Nov 24, 2020

Bodigrim Nov 24, 2020

Bodigrim Nov 23, 2020

vdukhovni commented Nov 27, 2020 •

edited

Loading

vdukhovni commented Nov 28, 2020

Bodigrim commented Nov 29, 2020 •

edited

Loading

vdukhovni commented Nov 30, 2020

vdukhovni commented Dec 2, 2020

Bodigrim commented Jan 6, 2021

vdukhovni commented Dec 8, 2022

vdukhovni commented Dec 8, 2022 •

edited

Loading

Bodigrim commented Dec 9, 2022

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Are you sure you want to change the base?

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Conversation

dbramucci commented Nov 22, 2020

dbramucci commented Nov 22, 2020

Bodigrim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bodigrim Nov 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdukhovni commented Nov 27, 2020 • edited Loading

vdukhovni commented Nov 28, 2020

Bodigrim commented Nov 29, 2020 • edited Loading

vdukhovni commented Nov 30, 2020

vdukhovni commented Dec 2, 2020

Bodigrim commented Jan 6, 2021

vdukhovni commented Dec 8, 2022

vdukhovni commented Dec 8, 2022 • edited Loading

Bodigrim commented Dec 9, 2022

Bodigrim Nov 23, 2020 •

edited

Loading

vdukhovni commented Nov 27, 2020 •

edited

Loading

Bodigrim commented Nov 29, 2020 •

edited

Loading

vdukhovni commented Dec 8, 2022 •

edited

Loading