Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julep: Support universal newlines and make it default for Text IO #19785

Open
mpastell opened this issue Dec 30, 2016 · 17 comments
Open

Julep: Support universal newlines and make it default for Text IO #19785

mpastell opened this issue Dec 30, 2016 · 17 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. julep Julia Enhancement Proposal

Comments

@mpastell
Copy link
Contributor

I think it would be useful to have universal newlines mode for open similar to Python. A short description of the mode for open from Python 3.6 docs https://docs.python.org/3.6/library/functions.html#open .

Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

When writing output to the stream any '\n' characters written are translated to the system default line separator, os.linesep.

Currently this needs to be handled manually in Julia which I think is a bit annoying and easy to enough to forget.

So I suggest the following:

  1. Add support for universal newlines to open as a flag to mode or a new keyword argument
  2. Make universal newlines the default for methods reading from a stream and returning a string e.g. readstring, readlines, eachline
@tkelman
Copy link
Contributor

tkelman commented Dec 31, 2016

Inserting \r bytes into input or output that weren't present in the contents of a file or string isn't a behavior we should perpetuate. Code that's written and executed on unix also needs to be able to deal correctly with \r\n files.

@tkelman tkelman added the io Involving the I/O subsystem: libuv, read, write, etc. label Dec 31, 2016
@mpastell
Copy link
Contributor Author

mpastell commented Dec 31, 2016

I forgot to say that adding \r when writing doesn't seem that useful. Although Python only inserts '\r' when writing text files and that can also be changed.

When reading \r\n is translated to \n by default which I think is very useful and I have never had the need to change the default during 10 years.

It is also useful to read the original Python PEP https://www.python.org/dev/peps/pep-0278/

@tkelman
Copy link
Contributor

tkelman commented Dec 31, 2016

I'd be fine with making any line-oriented functions more permissive in what they considered to be a line ending. But I think I/O should generally be binary-faithful. \r\n newlines are almost always best dealt with using dos2unix if you have write permissions on the file in question.

I wasn't much of a fan of #14073, but parsing of Julia source files does normalize line endings from \r\n to \n for the sake of multiline string literals. I don't think we should do the same with all non-Julia I/O.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 2, 2017

I think code should to be written in a way that there is no need to call dos2unix. I would find it strange to tell Windows users not to use \r\n in their files.

Of course using the Python way is not the only solution, but I think it works very well.

For instance I recently had a bug in Weave.jl (JunoLab/Weave.jl#72) related to this, which I fixed using replace(s, "\r\n", "\n") on all input files. Also #656 would have been avoided using the universal newlines approach.

Line-oriented functions do work with \r\n and \n (you just need to remember \r exists when you are developing on Windows). I'm not sure if they should also work with other EOL characters as well.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 2, 2017

I just opened #19815 . It would not be fixed with this proposal, but shows I'm not the only one who forgot to deal with \r ;)

@mpastell
Copy link
Contributor Author

mpastell commented Jan 4, 2017

After giving it some thought I agree fixing line oriented functions as @tkelman suggested is a good solution.
I think it would be best to follow unicode standard (http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf ) and consider all of the following as line ends.

unicode

Then

  1. Make readline to break on all of the above options
  2. readlines and eachline use readline so it means they will follow same behavior
  3. chomp should also be fixed to work with NEL and CR
  4. I would like to add splitlines according to Add splitlines? #19759
  5. I suggest adding newline2lf and that can be used to convert line endings to \n (usign utf8proc)

If this is OK I can try to implement it and make pull request.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 4, 2017

Actually the Unicode standard also has the following recommendation

readline

I think it is more common to keep the separator in the return value as readline currently does. Any ideas?

@nalimilan
Copy link
Member

+1, but I'd start with a PR to change eachline and chomp; whether we need splitlines and newline2lf is a different question.

Regarding whether to include the newlines character(s), could you check what popular languages do (in particular recent ones like Rust, Swift and Go)? We could always add an argument to choose the behavior.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 4, 2017

Rust and Swift seem to drop newlines by default, Python doesn't.

https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines
https://developer.apple.com/reference/swift/1641199-readline
https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

It'd be good to add dropping the character as an option to Julia readline as well, but making it the default and changing existing behavior would probably break a lot of code.

@nalimilan
Copy link
Member

OK, thanks for checking. Changing the default would make sense to me. It should be possible by first adding the new argument and deprecating the one-argument method in 0.6, and then changing its meaning in 1.0. Maybe it would be accepted if you can make a PR shortly.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 4, 2017

@StefanKarpinski any thoughts on this?

I can try implement the changes to readline, but changing the default behavior breaks many thinks including the REPL. I think this would need strong agreement that changing the default behavior is the right thing to do and needs work from others as well.

I would be fine with keeping the current default (as I'm used to it in Python anyway) and just adding the variant that removes the newlines as an option.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 5, 2017

Would it be OK to implement readlineas

function readline(s::IO)
    linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        write(out, c)
        if c in linefeeds
            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
                write(out, read(s, Char))
            end
            break
        end
    end

    return String(take!(out))
end

I'm not sure if options to readline are actually needed as you can simply use

chomp.(readlines(file))

To get rid of EOL characters.

@nalimilan
Copy link
Member

Anyway if you want to change the default, you'll have to first provide a version with an argument, to allow for a non-breaking deprecation period. So you may as well start with that. Then seeing how much code would need to be changed in Base (and whether it would simplify code or not) would be an interesting data point to make a decision.

The version using chomp is going to be quite slower currently as it needs to create SubString (not sure whether it could really be optimized in the future by changing the String type, cf. #16107). In general it seems to me that skipping the newline characters directly from readline should be faster, since you already look for them, which chomp would have to do once again.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 5, 2017

How about?

function readline(s::IO, chomp = true; nl2lf=false)
    nl2lf && (chomp = false)

    linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        if c in linefeeds
            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
                !(nl2lf || chomp) && write(out, c)
                c = read(s, Char)
                !chomp && write(out, c)
            else
                nl2lf && (c = '\n')
                !chomp && write(out, c)
            end

            break
        else
            write(out, c)
        end
    end

    return String(take!(out))
end

If chomp = true EOL will be omitted and if nl2lf = true EOL will be converted to '\n' .

Did I understand correctly that if we want to make chomp = true the default in the future we should define

function readline(s::IO)
    readline(s, false)
end

and add the approriate deprecation warning?

@StefanKarpinski
Copy link
Member

I guess we could have a chomp::Bool=false argument to eachline and readlines. I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.

@nalimilan
Copy link
Member

I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.

For now I/O streams are assumed to contain UTF-8 text, since they have no encoding information attached to them anyway. When passing a custom string type to readline, the conversion to UTF-8 is supposed to happen when creating the IOBuffer from the string object.

Custom string types are free to implement their own readline function if they do not use UTF-8. But it would make sense to provide a generic readline(::AbstractString) method which does not go through IOBuffer: that way custom string types do not have to implement streaming conversion to UTF-8, they just need to support getindex. Or we could provide a default IOBuffer(::AbstractString) implementation based on getindex.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 8, 2017

I have added support for \r and added chomp::Bool=false for readline` in #19877 . Adding support for other characters (if there is a need) based on that would not be too difficult.

The code that detects line ends also works for most encodings as they share the same control characters.

@ViralBShah ViralBShah added the julep Julia Enhancement Proposal label Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. julep Julia Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

5 participants