Julep: Support universal newlines and make it default for Text IO #19785

mpastell · 2016-12-30T22:18:23Z

I think it would be useful to have universal newlines mode for open similar to Python. A short description of the mode for open from Python 3.6 docs https://docs.python.org/3.6/library/functions.html#open .

Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

When writing output to the stream any '\n' characters written are translated to the system default line separator, os.linesep.

Currently this needs to be handled manually in Julia which I think is a bit annoying and easy to enough to forget.

So I suggest the following:

Add support for universal newlines to open as a flag to mode or a new keyword argument
Make universal newlines the default for methods reading from a stream and returning a string e.g. readstring, readlines, eachline

The text was updated successfully, but these errors were encountered:

tkelman · 2016-12-31T04:54:29Z

Inserting \r bytes into input or output that weren't present in the contents of a file or string isn't a behavior we should perpetuate. Code that's written and executed on unix also needs to be able to deal correctly with \r\n files.

mpastell · 2016-12-31T07:23:52Z

I forgot to say that adding \r when writing doesn't seem that useful. Although Python only inserts '\r' when writing text files and that can also be changed.

When reading \r\n is translated to \n by default which I think is very useful and I have never had the need to change the default during 10 years.

It is also useful to read the original Python PEP https://www.python.org/dev/peps/pep-0278/

tkelman · 2016-12-31T07:49:38Z

I'd be fine with making any line-oriented functions more permissive in what they considered to be a line ending. But I think I/O should generally be binary-faithful. \r\n newlines are almost always best dealt with using dos2unix if you have write permissions on the file in question.

I wasn't much of a fan of #14073, but parsing of Julia source files does normalize line endings from \r\n to \n for the sake of multiline string literals. I don't think we should do the same with all non-Julia I/O.

mpastell · 2017-01-02T10:42:19Z

I think code should to be written in a way that there is no need to call dos2unix. I would find it strange to tell Windows users not to use \r\n in their files.

Of course using the Python way is not the only solution, but I think it works very well.

For instance I recently had a bug in Weave.jl (JunoLab/Weave.jl#72) related to this, which I fixed using replace(s, "\r\n", "\n") on all input files. Also #656 would have been avoided using the universal newlines approach.

Line-oriented functions do work with \r\n and \n (you just need to remember \r exists when you are developing on Windows). I'm not sure if they should also work with other EOL characters as well.

mpastell · 2017-01-02T11:00:48Z

I just opened #19815 . It would not be fixed with this proposal, but shows I'm not the only one who forgot to deal with \r ;)

mpastell · 2017-01-04T17:37:26Z

After giving it some thought I agree fixing line oriented functions as @tkelman suggested is a good solution.
I think it would be best to follow unicode standard (http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf ) and consider all of the following as line ends.

Then

Make readline to break on all of the above options
readlines and eachline use readline so it means they will follow same behavior
chomp should also be fixed to work with NEL and CR
I would like to add splitlines according to Add splitlines? #19759
I suggest adding newline2lf and that can be used to convert line endings to \n (usign utf8proc)

If this is OK I can try to implement it and make pull request.

mpastell · 2017-01-04T17:47:04Z

Actually the Unicode standard also has the following recommendation

I think it is more common to keep the separator in the return value as readline currently does. Any ideas?

nalimilan · 2017-01-04T18:24:46Z

+1, but I'd start with a PR to change eachline and chomp; whether we need splitlines and newline2lf is a different question.

Regarding whether to include the newlines character(s), could you check what popular languages do (in particular recent ones like Rust, Swift and Go)? We could always add an argument to choose the behavior.

mpastell · 2017-01-04T18:39:54Z

Rust and Swift seem to drop newlines by default, Python doesn't.

https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines
https://developer.apple.com/reference/swift/1641199-readline
https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

It'd be good to add dropping the character as an option to Julia readline as well, but making it the default and changing existing behavior would probably break a lot of code.

nalimilan · 2017-01-04T20:54:47Z

OK, thanks for checking. Changing the default would make sense to me. It should be possible by first adding the new argument and deprecating the one-argument method in 0.6, and then changing its meaning in 1.0. Maybe it would be accepted if you can make a PR shortly.

mpastell · 2017-01-04T22:14:37Z

@StefanKarpinski any thoughts on this?

I can try implement the changes to readline, but changing the default behavior breaks many thinks including the REPL. I think this would need strong agreement that changing the default behavior is the right thing to do and needs work from others as well.

I would be fine with keeping the current default (as I'm used to it in Python anyway) and just adding the variant that removes the newlines as an option.

mpastell · 2017-01-05T12:27:36Z

Would it be OK to implement readlineas

function readline(s::IO)
    linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        write(out, c)
        if c in linefeeds
            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
                write(out, read(s, Char))
            end
            break
        end
    end

    return String(take!(out))
end

I'm not sure if options to readline are actually needed as you can simply use

chomp.(readlines(file))

To get rid of EOL characters.

nalimilan · 2017-01-05T13:24:23Z

Anyway if you want to change the default, you'll have to first provide a version with an argument, to allow for a non-breaking deprecation period. So you may as well start with that. Then seeing how much code would need to be changed in Base (and whether it would simplify code or not) would be an interesting data point to make a decision.

The version using chomp is going to be quite slower currently as it needs to create SubString (not sure whether it could really be optimized in the future by changing the String type, cf. #16107). In general it seems to me that skipping the newline characters directly from readline should be faster, since you already look for them, which chomp would have to do once again.

mpastell · 2017-01-05T15:50:50Z

How about?

function readline(s::IO, chomp = true; nl2lf=false)
    nl2lf && (chomp = false)

    linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        if c in linefeeds
            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
                !(nl2lf || chomp) && write(out, c)
                c = read(s, Char)
                !chomp && write(out, c)
            else
                nl2lf && (c = '\n')
                !chomp && write(out, c)
            end

            break
        else
            write(out, c)
        end
    end

    return String(take!(out))
end

If chomp = true EOL will be omitted and if nl2lf = true EOL will be converted to '\n' .

Did I understand correctly that if we want to make chomp = true the default in the future we should define

function readline(s::IO)
    readline(s, false)
end

and add the approriate deprecation warning?

StefanKarpinski · 2017-01-06T22:22:52Z

I guess we could have a chomp::Bool=false argument to eachline and readlines. I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.

nalimilan · 2017-01-07T10:55:02Z

I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.

For now I/O streams are assumed to contain UTF-8 text, since they have no encoding information attached to them anyway. When passing a custom string type to readline, the conversion to UTF-8 is supposed to happen when creating the IOBuffer from the string object.

Custom string types are free to implement their own readline function if they do not use UTF-8. But it would make sense to provide a generic readline(::AbstractString) method which does not go through IOBuffer: that way custom string types do not have to implement streaming conversion to UTF-8, they just need to support getindex. Or we could provide a default IOBuffer(::AbstractString) implementation based on getindex.

mpastell · 2017-01-08T09:16:49Z

I have added support for \r and added chomp::Bool=false for readline` in #19877 . Adding support for other characters (if there is a need) based on that would not be too difficult.

The code that detects line ends also works for most encodings as they share the same control characters.

tkelman added the io Involving the I/O subsystem: libuv, read, write, etc. label Dec 31, 2016

giordano mentioned this issue Jan 2, 2017

Add splitlines? #19759

Closed

mpastell mentioned this issue Jan 5, 2017

RFC: Make EOL handling more general #19877

Closed

5 tasks

ViralBShah added the julep Julia Enhancement Proposal label Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Julep: Support universal newlines and make it default for Text IO #19785

Julep: Support universal newlines and make it default for Text IO #19785

mpastell commented Dec 30, 2016

tkelman commented Dec 31, 2016

mpastell commented Dec 31, 2016 •

edited

Loading

tkelman commented Dec 31, 2016 •

edited

Loading

mpastell commented Jan 2, 2017

mpastell commented Jan 2, 2017

mpastell commented Jan 4, 2017

mpastell commented Jan 4, 2017

nalimilan commented Jan 4, 2017

mpastell commented Jan 4, 2017 •

edited

Loading

nalimilan commented Jan 4, 2017

mpastell commented Jan 4, 2017 •

edited

Loading

mpastell commented Jan 5, 2017

nalimilan commented Jan 5, 2017

mpastell commented Jan 5, 2017

StefanKarpinski commented Jan 6, 2017

nalimilan commented Jan 7, 2017

mpastell commented Jan 8, 2017

Julep: Support universal newlines and make it default for Text IO #19785

Julep: Support universal newlines and make it default for Text IO #19785

Comments

mpastell commented Dec 30, 2016

tkelman commented Dec 31, 2016

mpastell commented Dec 31, 2016 • edited Loading

tkelman commented Dec 31, 2016 • edited Loading

mpastell commented Jan 2, 2017

mpastell commented Jan 2, 2017

mpastell commented Jan 4, 2017

mpastell commented Jan 4, 2017

nalimilan commented Jan 4, 2017

mpastell commented Jan 4, 2017 • edited Loading

nalimilan commented Jan 4, 2017

mpastell commented Jan 4, 2017 • edited Loading

mpastell commented Jan 5, 2017

nalimilan commented Jan 5, 2017

mpastell commented Jan 5, 2017

StefanKarpinski commented Jan 6, 2017

nalimilan commented Jan 7, 2017

mpastell commented Jan 8, 2017

mpastell commented Dec 31, 2016 •

edited

Loading

tkelman commented Dec 31, 2016 •

edited

Loading

mpastell commented Jan 4, 2017 •

edited

Loading

mpastell commented Jan 4, 2017 •

edited

Loading