Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dbramucci
Copy link

This changes ByteString.hGetLine such that it respects the newline mode in the provided handle. (Specifically Handle__{haInputNL}).
Before, whether newlines were LF (Linux convention) or CRLF (Windows convention), ByteString.hGetLine would behave as though the only line ending was LF.

Now, when hGetLine is passed a handle, where the input newline is CRLF, it will treat both \r\n and \n as line feeds, just like System.IO.hGetLine does.
Note that this means that even in CRLF mode, "foo\nbar" will be treated as 2 separate lines.

The technical details are straightforward.

  1. haveBuf now informs findEOL what haInputNL is.
  2. haveBuf no longer assumes that findEOL only reads 1 char at a time (it can be 2 chars for \r\n now)
  3. findEOL checks for \r\n in CRLF mode and jumps 2 forward, otherwise inching forward like it did previously.

Furthermore, a property test and lined ascii text generator have been added to test that ByteString.hGetLine behaves like System.IO.hGetLine does.

This is intended to close issue #13

This closes issue haskell#13.

The changes can be summarized as
updating `findEOL` to look for "\r\n" in CRLF mode
and updating the logic of `haveBuf` to resize the buffer
according to the size of the newline.

Additionally, tests were added to verify that both
`hGetLine`s produce the same behavior.

Some of the edge-cases to worry about here include

* '\n' still counts as a line end.

    Thus line endings' length vary between 1 and 2 in CRLF mode.
* "\r\r\n" can give a false-start.

    This means you can't always skip 2 characters when `c' /= '\n'`.
* '\r' when not followed by '\n' isn't part of a newline.
* Not reading out of the buffer when '\r' is the last character.
The old test had wrote a special file filled with strange line endings.
Now that there is a reliable,
and consistent property test available for hGetLine,
this code can be removed at little cost.
@dbramucci
Copy link
Author

Additional work to do includes

  1. Benchmark this to make sure it doesn't harm performance
  2. Improve the LinedASCII generator to also generate "normal sized" lines (~ 100 char) and occasional pathologically long lines. As it is now, it generates a lot of 1 to 5 char long lines.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very decent and well-thought! Here are some suggestions about tests.

@@ -1614,6 +1614,36 @@ prop_read_write_file_D x = unsafePerformIO $ do
(const $ do y <- D.readFile f
return (x==y))

prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do
tid <- myThreadId
let f = "qc-test-"++show tid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use openTempFile, e. g.,

(fn, h) <- openTempFile "." "lazy-hclose-test.tmp"

prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do
tid <- myThreadId
let f = "qc-test-"++show tid
let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity that QuickCheck does not provide Arbitrary instances for Newline and NewlineMode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised an issue (nick8325/quickcheck#322) for this, so hopefully QuickCheck will provide those instances soon.
But I don't know what the timeline would look like to get rid of the Bools here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not worth for bytestring to depend on the very latest QuickCheck only for the sake of these instances, so we should probably keep if lineEndIn then LF else CRLF. But it would be nice to submit a PR to QuickCheck adding them, for generations to come.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request made, in the process I realized I should flip the conditional because QuickCheck attempts to shrink True into False and I would prefer shrinking CRLF into LF given that CRLF is the more complicated case.

tid <- myThreadId
let f = "qc-test-"++show tid
let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF)
bracket_
Copy link
Contributor

@Bodigrim Bodigrim Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much point to use bracket_ in tests. In this particular case it even makes things worse, because if the test fails with an exception, I would rather not removeFile f to facilitate investigation.

sLines <- withFile f ReadMode (\h -> do
hSetNewlineMode h newlineMode
readByLines System.IO.hGetLine h
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to reduce code duplication here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way would be to write a readFileByLines function.

readFileByLines filename getLine = withFile filename ReadMode (\h -> do
    hSetNewlineMode h newlineMode
    readByLines getLine h
  )

Alternatively, because it is expensive to open files on Windows it would be possible to use just the Handle from when the file was written and just seek the beginning after the initial write and read.
Something like

hPutStr h_ filetext
hFlush h_

let readByLinesFromStart getLine = hSeek h_ AbsoluteSeek 0 *> readByLines getLine h_
   
hSetNewlineMode h_ newlineMode
bsLines <- readByLinesFromStart C.hGetLine
sLines <- readByLinesFromStart System.IO.hGetLine

Then instead of open 3 files per test-case, we open 1 file.
I can try running that on Windows later to see if it is a large improvement (I would expect it is close to 3x for this short test).
Downside being that it isn't as obvious as opening files from scratch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not particularly care about performance of the test suite, I'd rather keep tests as straightforward and focused as possible.

return $ map C.unpack bsLines === sLines

where
readByLines getLine h_ = go []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better than a more naive implementation?

readByLines getLine h_ = do 
  isEnd <- hIsEOF h_ 
  if isEnd then return [] else (:) <$> getLine h_ <*> readByLines getLine h_

@dbramucci dbramucci marked this pull request as ready for review November 24, 2020 03:05
@dbramucci dbramucci marked this pull request as draft November 24, 2020 03:09
tests/Properties.hs Outdated Show resolved Hide resolved
… bug.

On Windows, the test data would be written using the platform newlines.
This means that any lone \n would become a \r\n.
The consequence is that the property would fail to test the
implementation on linux line endings when developing on windows.
The fix is to set the newlineMode to noNewlineTranslation before
writing the test data.
The variables were renamed to make boolean correspondance clearer.
Also, True was changed to CRLF in order to get QuickCheck to
try shrinking from CRLF to LF if possible.
@vdukhovni
Copy link
Contributor

vdukhovni commented Nov 27, 2020

This PR is incorrect. I did not see any code to handle the usual corner case that CRLF is split across two input buffers with CR in one and LF in the next. A quick test shows the case is not handled. Given a text file where the first character is "x" which is then followed by 40,000 CRLF pairs:

λ> import qualified System.IO as IO
λ> import qualified Data.ByteString.Char8 as BC
λ> import Control.Monad (forM)
λ> x <- IO.openFile "/tmp/test.txt" IO.ReadMode
λ> IO.hSetNewlineMode x $ IO.NewlineMode IO.CRLF IO.CRLF
λ> l <- forM [0..39000::Int] $ \i -> do { line <- BC.length <$> hGetLine x; pure (i, line) }
λ> Prelude.filter ((/= 0) . snd) l
[(0,1),(4095,1),(8191,1),(12287,1),(16383,1),(20479,1),(24575,1),(28671,1),(32767,1),(36863,1)]

@vdukhovni
Copy link
Contributor

I also wonder whether it really makes sense to fix the underlying issue. When the open file has a non-trivial encoding, the bytestring hGetLine does not make any effort to handle that, and the user is simply expected to set the input mode to binary. Why should CRLF be handled when encodings are not? Quote:

ByteString I/O uses binary mode, without any character decoding or newline conversion.
The fact that it does not respect the Handle newline mode is considered a flaw and may
be changed in a future version.

I am not entirely sure this is actually a flaw. Once the encoding is ignored, all bets are off. If the file is UCS-16, does it really help to try to handle CRLF, when it'll actually be (little-endian) \r\0\n\0?

@Bodigrim Bodigrim linked an issue Nov 29, 2020 that may be closed by this pull request
@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 29, 2020

If we do not make any assumptions about encoding at all, we should remove many functions, starting from words. In fact encoding assumptions are given at the very beginning:

-- More specifically these byte strings are taken to be in the
-- subset of Unicode covered by code points 0-255. This covers
-- Unicode Basic Latin, Latin-1 Supplement and C0+C1 Controls.

Why should CRLF be handled when encodings are not?

This is a pragmatic choice. We can honour CRLF choice without getting into too much trouble, and there is users' demand for it. Covering encodings is more challenging.

@vdukhovni
Copy link
Contributor

I understand that this is a pragmatic choice, but we're in a state of sin, and perhaps papering over just some of the friction can be considered to be setting up false expectations...

Indeed it can be appropriate to check whether the input is in some non-trivial encoding where ISO-8859-1 or "binary" are trivial, as might be various other ISO-8859-X variants. On the other hand it is not entirely obvious to me how to determine which encodings are safe to ignore, and which should raise errors.

@vdukhovni
Copy link
Contributor

Anyway, if there's general consensus this should move forward, despite the asymmetry with encoding, I won't stand in the way. Of course the implementation needs to be correct, and should not impose an undue performance cost, especially when the NewlineMode is noNewlineTranslation. On Unix systems one generally expects:

ghci> import System.IO
ghci> import GHC.IO.Handle.Types
ghci> import GHC.IO.Handle.Internals
ghci> withHandle_ "" stdin $ \h -> pure $ (== noNewlineTranslation) $ NewlineMode (haInputNL h) (haOutputNL h)
True

@Bodigrim
Copy link
Contributor

Bodigrim commented Jan 6, 2021

Yes, I think we should move this forward. @dbramucci are you still interested to carry on with this PR?

@vdukhovni
Copy link
Contributor

This PR has stalled. The original author has not stepped forward to fix the split CRLF issue. Should we close this, or is someone else willing to take this over and fix it?

@vdukhovni
Copy link
Contributor

vdukhovni commented Dec 8, 2022

This PR has stalled. The original author has not stepped forward to fix the split CRLF issue. Should we close this, or is someone else willing to take this over and fix it?

Also, having just fixed strictness in lazy bytestring lines, and spent most of the effort also fine-tuning performance, the question is then naturally of whether the newline handling of hGetLine should be mirrored in lines, which then becomes more complex and probably slower.

Also the Streaming.ByteStream module from streaming-bytestring has lineSplit, which preserves the newline endings, leaving the stream content undisturbed, but still partitioned into per-line substreams. Perhaps ByteString should also have a lineSplit-like function, that leaves the decision of what if anything to trim at the ends of each line to the user.

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 9, 2022

Also, having just fixed strictness in lazy bytestring lines, and spent most of the effort also fine-tuning performance, the question is then naturally of whether the newline handling of hGetLine should be mirrored in lines, which then becomes more complex and probably slower.

Data.List describes lines as splitting by \n, never by \r, and Data.ByteString.lines should stick to the same semantics.

Also the Streaming.ByteStream module from streaming-bytestring has lineSplit, which preserves the newline endings, leaving the stream content undisturbed, but still partitioned into per-line substreams. Perhaps ByteString should also have a lineSplit-like function, that leaves the decision of what if anything to trim at the ends of each line to the user.

Sounds like a good idea to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getLine doesn't honor handle's newline setting
3 participants