getLine doesn't honor handle's newline setting #13

erantapaa · 2014-03-15T01:43:13Z

It would be nice if BS.getLine would use the handle's newline setting so that in CRLF mode strings are returned with both the CR (if present) and LF removed.

An example of code that demonstrates the behavior:

http://stackoverflow.com/questions/22417171/bs-getline-and-crlf-endings

The text was updated successfully, but these errors were encountered:

dcoutts · 2014-12-19T10:49:56Z

Yeah. I have a complete rewrite of the bytestring IO code that's mostly done. The new code respects the newline mode. But since it is slightly different behaviour I'll probably include the IO actions in a new module (and eventually deprecate the existing ones).

int-index · 2016-02-01T09:43:56Z

What's the status on this?

fumieval · 2017-10-04T10:42:02Z

hGetLine from System.IO handles CR (http://hackage.haskell.org/package/base-4.10.0.0/docs/src/GHC.IO.Handle.Text.html#hGetLine); this discrepancy seems bad. Would it be acceptable to change the behaviour?

arkrost · 2020-05-07T10:50:26Z

Any updates?

sjakobi · 2020-07-04T22:45:02Z

Apparently there's quite a bit of demand for a fix.

@dcoutts What happened to the IO code rewrite you mentioned? It would be interesting to see how you solved this issue!

since it is slightly different behaviour I'll probably include the IO actions in a new module (and eventually deprecate the existing ones).

Yeah. I suspect that simply changing the behaviour of getLine and hGetLine could possibly cause some subtle bugs in existing code.

So how about adding variants like getLineCrossPlatform and hGetLineCrossPlatform that respect the handle's newline settings?

For reference, here's the relevant code:

bytestring/Data/ByteString.hs

Lines 1618 to 1668 in 39ad965

    
           -- | Read a line from stdin. 
        
           getLine :: IO ByteString 
        
           getLine = hGetLine stdin 
        
           -- | Read a line from a handle 
        
           hGetLine :: Handle -> IO ByteString 
        
           hGetLine h = 
        
             wantReadableHandle_ "Data.ByteString.hGetLine" h $ 
        
               \ h_@Handle__{haByteBuffer} -> do 
        
                 flushCharReadBuffer h_ 
        
                 buf <- readIORef haByteBuffer 
        
                 if isEmptyBuffer buf 
        
                    then fill h_ buf 0 [] 
        
                    else haveBuf h_ buf 0 [] 
        
            where 
        
             fill h_@Handle__{haByteBuffer,haDevice} buf !len xss = do 
        
               (r,buf') <- Buffered.fillReadBuffer haDevice buf 
        
               if r == 0 
        
                  then do writeIORef haByteBuffer buf{ bufR=0, bufL=0 } 
        
                          if len > 0 
        
                             then mkBigPS len xss 
        
                             else ioe_EOF 
        
                  else haveBuf h_ buf' len xss 
        
             haveBuf h_@Handle__{haByteBuffer} 
        
                     buf@Buffer{ bufRaw=raw, bufR=w, bufL=r } 
        
                     len xss = 
        
               do 
        
                   off <- findEOL r w raw 
        
                   let new_len = len + off - r 
        
                   xs <- mkPS raw r off 
        
                 -- if eol == True, then off is the offset of the '\n' 
        
                 -- otherwise off == w and the buffer is now empty. 
        
                   if off /= w 
        
                       then do if w == off + 1 
        
                                       then writeIORef haByteBuffer buf{ bufL=0, bufR=0 } 
        
                                       else writeIORef haByteBuffer buf{ bufL = off + 1 } 
        
                               mkBigPS new_len (xs:xss) 
        
                       else fill h_ buf{ bufL=0, bufR=0 } new_len (xs:xss) 
        
             -- find the end-of-line character, if there is one 
        
             findEOL r w raw 
        
                   | r == w = return w 
        
                   | otherwise =  do 
        
                       c <- readWord8Buf raw r 
        
                       if c == fromIntegral (ord '\n') 
        
                           then return r -- NB. not r+1: don't include the '\n' 
        
                           else findEOL (r+1) w raw

Bodigrim · 2020-09-26T17:31:55Z

I think that we should bring Data.ByteString.hGetLine in sync with System.IO.hGetLine. Ignoring haInputNL is clearly a grave bug, and I cannot imagine anyone making sensible use of the current behaviour. So I'm in favor of bug fixing hGetLine instead of introducing a new function, but anyways a PR is very welcome.

dbramucci · 2020-10-23T01:31:55Z

I'd like to give this a shot, although it'll be my first time writing with low-level Haskell IO.

I intend to make Data.ByteString.hGetLine's handling of newlines consistent with System.IO.hGetLine.
So that

BS.getLine = fmap BS.pack getLine

holds.

dbramucci · 2020-10-23T01:37:32Z

Referencing issue #249, I'm thinking that it makes sense to put the fixed version of hGetLine into Data.ByteString.Char8.hGetLine.

Then, legacy code won't break (i.e. in case they were compensating for the extra \rs on their own, we don't want them to start chopping off the last meaningful character on Windows) and we can have the nice behaving Data.ByteString.Char8.hGetLine that we wanted in the first-place.
The depreciation of ByteString.hGetLine can then be for both reasons, the strange Windows behavior and the ASCII encoding assumption maybe with a message like.

 {-# DEPRECATED getLine 
     "Use Data.ByteString.Char8.getLine instead. (Data.ByteString.getLine does not correctly handle \r\n on Windows; Functions that rely on ASCII encodings belong in Data.ByteString.Char8)" 
   #-}

Bodigrim · 2020-10-23T23:59:43Z

I intend to make Data.ByteString.hGetLine's handling of newlines consistent with System.IO.hGetLine.

Agreed, feel free to go ahead.

Then, legacy code won't break...

The thing is that existing Data.ByteString.hGetLine is re-exported from Data.ByteString.Char8 as well. So one cannot confidently guarantee that legacy code, compenstating for \r on its own, does not break.

We can bikeshed whether this constitutes a bug fix or a breaking change later on.

This closes issue haskell#13. The changes can be summarized as updating `findEOL` to look for "\r\n" in CRLF mode and updating the logic of `haveBuf` to resize the buffer according to the size of the newline. Additionally, tests were added to verify that both `hGetLine`s produce the same behavior. Some of the edge-cases to worry about here include * '\n' still counts as a line end. Thus line endings' length vary between 1 and 2 in CRLF mode. * "\r\r\n" can give a false-start. This means you can't always skip 2 characters when `c' /= '\n'`. * '\r' when not followed by '\n' isn't part of a newline. * Not reading out of the buffer when '\r' is the last character.

dbramucci · 2020-10-28T03:38:25Z

Ah, that's unfortunate. I overlooked that re-export.

This closes issue haskell#13. The changes can be summarized as updating `findEOL` to look for "\r\n" in CRLF mode and updating the logic of `haveBuf` to resize the buffer according to the size of the newline. Additionally, tests were added to verify that both `hGetLine`s produce the same behavior. Some of the edge-cases to worry about here include * '\n' still counts as a line end. Thus line endings' length vary between 1 and 2 in CRLF mode. * "\r\r\n" can give a false-start. This means you can't always skip 2 characters when `c' /= '\n'`. * '\r' when not followed by '\n' isn't part of a newline. * Not reading out of the buffer when '\r' is the last character.

dbramucci · 2020-10-28T04:40:17Z

I've implemented this along with some tests to verify it works but I still have 2 concerns with this change.

I haven't think of a good way to test the property that
```
BS.hGetLine = BS.pack . System.IO.hGetLine -- when all the data is ascii
```
because there is an IO in the way and I haven't thought of an efficient way to make handles, other than making files and verifying that they are the same by reading it both ways.

As of now, I just test the property over the same file, ending in 4 different ways, with all 4 choices of NewlineMode.
I currently have no benchmarks for BS.hGetLine

I'm concerned that this implementation may cause performance regressions but I don't want to start optimizing until I have some measurements to work off of.

dbramucci · 2020-10-28T05:03:49Z

Also, an interesting quirk to notice is that System.IO.hGetLine treats both, \n and \r\n as line endings when haInputNL=CRLF.

I'm not sure if that is "correct windows behavior" or not, specifically I recall old versions of notepad mushing all lines onto one line when separated by \ns.

Bodigrim · 2020-10-28T18:26:16Z

You can use IO in QuickCheck properties, see Test.QuickCheck.ioProperty
Let's have a correct implementation first and worry about performance later.

Also, an interesting quirk to notice is that System.IO.hGetLine treats both, \n and \r\n as line endings when haInputNL=CRLF.

Interesting. I'm in favor of mirroring this behaviour.

sjakobi · 2020-11-01T11:18:35Z

Also, an interesting quirk to notice is that System.IO.hGetLine treats both, \n and \r\n as line endings when haInputNL=CRLF.

Interesting. I'm in favor of mirroring this behaviour.

👍

dbramucci · 2020-11-07T04:13:40Z

I have written the property test along with a newtype LinedASCII that generates ASCII with a lot of newlines.
This was needed because one of the sequences that trips up badly behaved implementations, "\r\r\n" only has a 1/2^(21) ~ 1 in 2 million chance of occurring in a 3-char substring.

This meant that one had to run a few million tests to reliably catch bugs which is too slow, but the newtype generates a newline 50% of the time, so "\r\r\n" occurs 1 out of 8 3-char substrings, making the test fairly quick and reliable.
Of course, the chance of getting an ~80char long line is microscopic, so there's still room for improvement.

Bodigrim · 2020-11-07T10:41:58Z

@dbramucci nice!

Bodigrim · 2020-11-19T19:51:09Z

@dbramucci how is it going? Do not hesitate to create a draft PR, it could facilitate further discussions.

dbramucci · 2020-11-22T03:43:33Z

I just did a quick experiment and it turns out that System.IO.hGetLine, (the old) ByteString.hGetLine and Data.Text.IO.hGetLine all have different behavior in CRLF mode.

Data.Text.IO.hGetLine takes the string "\ra" and interprets this as the line "\na"

System.IO.hGetLine and the old version of ByteString.hGetLine both leave the \r alone and give the line "\ra".

Notably, I believe this means that Data.Text.IO.hGetLine is the only hGetLine that can return a string containing \n.
System.IO.hGetLine takes the string "foo\r\nbar" and interprets it as providing two lines "foo" and "bar".

The old ByteString.hGetLine gives "foo\r" as the first line and "bar" as the second line.

And while I don't know of any place that relies on this, I wouldn't be surprised to see people caught off-guard by the fact that "foo\nbar" is two lines, even in CRLF mode.

dbramucci · 2020-11-22T03:43:37Z

And on the note of inconsistencies: lines and unlines presume \n is the line feed, so the law we might expect to hold

head . lines <$> hGetContents = hGetLine

cannot hold if the handle is in CRLF input mode.
(At least this is somewhat obvious from the type signature of lines).

soulomoon · 2022-07-02T16:45:33Z

Encounter it in codeforces

sjakobi added bug Windows labels Jun 25, 2020

sjakobi mentioned this issue Jun 25, 2020

Is hSetBinaryMode required for ByteString I/O? #97

Closed

sjakobi modified the milestone: 0.10.12.0 Jul 2, 2020

Bodigrim added the pr-welcome label Sep 26, 2020

dbramucci linked a pull request Nov 22, 2020 that will close this issue

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Draft

Bodigrim linked a pull request Nov 29, 2020 that will close this issue

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Draft

Bodigrim mentioned this issue Jan 16, 2023

0.12.0.0 release planning #573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getLine doesn't honor handle's newline setting #13

getLine doesn't honor handle's newline setting #13

erantapaa commented Mar 15, 2014

dcoutts commented Dec 19, 2014

int-index commented Feb 1, 2016

fumieval commented Oct 4, 2017

arkrost commented May 7, 2020

sjakobi commented Jul 4, 2020

Bodigrim commented Sep 26, 2020

dbramucci commented Oct 23, 2020

dbramucci commented Oct 23, 2020 •

edited

Loading

Bodigrim commented Oct 23, 2020

dbramucci commented Oct 28, 2020

dbramucci commented Oct 28, 2020

dbramucci commented Oct 28, 2020

Bodigrim commented Oct 28, 2020

sjakobi commented Nov 1, 2020

dbramucci commented Nov 7, 2020

Bodigrim commented Nov 7, 2020

Bodigrim commented Nov 19, 2020

dbramucci commented Nov 22, 2020

dbramucci commented Nov 22, 2020 •

edited

Loading

soulomoon commented Jul 2, 2022

getLine doesn't honor handle's newline setting #13

getLine doesn't honor handle's newline setting #13

Comments

erantapaa commented Mar 15, 2014

dcoutts commented Dec 19, 2014

int-index commented Feb 1, 2016

fumieval commented Oct 4, 2017

arkrost commented May 7, 2020

sjakobi commented Jul 4, 2020

Bodigrim commented Sep 26, 2020

dbramucci commented Oct 23, 2020

dbramucci commented Oct 23, 2020 • edited Loading

Bodigrim commented Oct 23, 2020

dbramucci commented Oct 28, 2020

dbramucci commented Oct 28, 2020

dbramucci commented Oct 28, 2020

Bodigrim commented Oct 28, 2020

sjakobi commented Nov 1, 2020

dbramucci commented Nov 7, 2020

Bodigrim commented Nov 7, 2020

Bodigrim commented Nov 19, 2020

dbramucci commented Nov 22, 2020

dbramucci commented Nov 22, 2020 • edited Loading

soulomoon commented Jul 2, 2022

dbramucci commented Oct 23, 2020 •

edited

Loading

dbramucci commented Nov 22, 2020 •

edited

Loading