Introduce Encoding parametric singleton type #9

nalimilan · 2016-02-13T17:10:37Z

First step towards efficient encoders for common encodings,
as well as towards providing information about encodings.

This also allows adding convenience methods to base I/O functions taking
an additional encoding parameter without risking ambiguities.

See the new tests for an illustration of the API.

@ScottPJones What do you think of this PR? I've tried implementing most of the features from quinnj/Strings.jl#3, but with a parametric singleton type Encoding. This allows supporting arbitrary encodings, and generating methods on-the-fly without polluting the methods table with support for all possible encodings.

But I must say I don't know why you need these functions (like codeunit or native_endian), so I cannot tell whether this will work for you.

TODO:

classify the encodings currently in encodings_other. Can all of non-UTF/UCS encodings be considered as 8-bit?
handle aliases like UTF16LE
test the AbstractString convenience methods

First step towards efficient encoders for common encodings, as well as towards providing information about encodings. This also allows adding convenience methods to base I/O functions taking an additional encoding parameter without risking ambiguities.

ScottPJones · 2016-02-13T20:31:34Z

I'll start reviewing this this weekend.
Great to see more being done to handle strings in a good way!

Did you look at the discussions about making the encodings use traits?
Will this be able to handle the having some sort of hierarchy of encodings? (i.e. UTF-16LE / UTF-16BE
being both UTF-16, the only difference being the endianness?)
That is why I wanted native_endian, so that the code could be make more generic, using a simple call to a function that swaps the bytes if not native endian, also codeunit, which would be UInt8 for all byte oriented encodings, but UInt16 for all of the UTF-16* variants, and UInt32 for the UTF-32* ones.

I think the encodings can be classified by the code unit, whether they are native or opposite endian (for cases where the code unit is 2 or 4 bytes), whether they take 1, 2, or more code units to represent each code point, and whether or not the code points are Unicode (UTF-8, UTF-16, UTF-32 and variants),
a subset of Unicode (such as ASCII, ANSI Latin 1, UCS-2), ASCII compatible (such as CP1252, where the first 128 code points are ASCII), or not even ASCII compatible (such as EBCDIC, and a few others).
The distinction between 16-bit UCS-2 (which can be directly indexed) and UTF-16 (which could be called a DWCS, and can't be directly indexed), can be very important for performance.
8-bit character sets are much easier to handle efficiently, and can be done with simple tables, whereas usually for the multibyte (except UTF-8) you need special code + large tables for both directions.
I've had to deal with Shift-JIS, EUC, GB, and Big5 a lot in the past. Note that EUC-JP is not a DBCS, it is a MBCS (characters added by the later standard take 3 bytes).

ScottPJones · 2016-02-13T20:40:21Z

src/encodings.jl

+    "1026", "1046", "1047", "10646-1:1993", "10646-1:1993/UCS4",
+    "437", "500", "500V1", "850", "851", "852", "855", "856", "857",
+    "860", "861", "862", "863", "864", "865", "866", "866NAV", "869",
+    "874", "8859_1", "8859_2", "8859_3", "8859_4", "8859_5", "8859_6",


8859_1 is a synonym for ANSI Latin 1, which I think should be classified separately, as it is purely an 8-bit subset of Unicode.

Yes, as I noted, there's a lot of classification work to do here. I've just started moving a few of these to encodings8 to test how it works.

Anyway, if we want to store more properties about each encoding, we should create an immutable with a few fields, and make an array of that, instead of storing only the name.

Yes. Maybe the table would have just the properties of the canonical encodings, and have another table that mapped all of the string names to the entry in the table?
What ideas do you have for those sorts of structures?

I'll push a proposal shortly. Indeed, it sounds like keeping a separate list of aliases will make everything shorter and easier to maintain.

ScottPJones · 2016-02-13T20:54:38Z

This definitely looks like a good start! I hope you don't mind all the comments!

ScottPJones · 2016-02-13T21:09:42Z

I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results.
For example, CP864 (an Arabic char set) looks compatible, but it is not
(the % character is replaced by \u066a).

nalimilan · 2016-02-14T15:10:38Z

I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results.
For example, CP864 (an Arabic char set) looks compatible, but it is not
(the % character is replaced by \u066a).

Actually, I've just bumped into this: http://demo.icu-project.org/icu-bin/convexp?conv=hp-roman8 It seems that ICU provides information about all encodings, and in particular whether it's ASCII-compatible.

ScottPJones · 2016-02-14T15:16:25Z

Ah, that's great, it also has the information to decide whether it is single, double, or multi code unit, I see.

nalimilan · 2016-02-14T15:17:37Z

@ScottPJones Please have a look at the stub EncodingInfo type and to the partial list of encodings. Do you think this provides all the information we need?

ScottPJones · 2016-02-14T15:46:57Z

The new encodinginfo stuff looks much better, yes.
It will be nice if we can come up with a way to automatically generate the tables, from either iconv or ICU.
Another thing might be to somehow have the encoding stuff be able to have the information for the tables for encodings that we directly support, while allowing automatically falling back to iconv for encodings that we simply don't care about that much (like UTF7 and most of the obsolete EUC, GB, Big5, Mac,
etc ones).

nalimilan · 2016-02-14T17:20:57Z

The new encodinginfo stuff looks much better, yes.
It will be nice if we can come up with a way to automatically generate the tables, from either iconv or ICU.

I think it would be easier to take the code that does the same thing in iconv-lite:
https://github.com/ashtuchkin/iconv-lite/blob/master/generation/gen-sbcs.js
(generated file: https://github.com/ashtuchkin/iconv-lite/blob/master/encodings/sbcs-data-generated.js)

Another thing might be to somehow have the encoding stuff be able to have the information for the tables for encodings that we directly support, while allowing automatically falling back to iconv for encodings that we simply don't care about that much (like UTF7 and most of the obsolete EUC, GB, Big5, Mac, etc ones).

Yes, that was the idea. Using the Tim Holy traits trick based on the encodings info, it should be easy to override the current StringEncoder and StringDecoder where we have a specialized version.

ScottPJones · 2016-02-14T19:30:07Z

Those generators from iconv-lite look nice (even if they are in JS instead of Julia! ;-) ), I see he does what I'd been talking about, and checks to see if the first half of the table is the same as ASCII.
For a lot of the newer 8-bit ISO character sets, it's also useful to check if the range 0x0:0x9f is identical to ANSI Latin 1/Unicode. Another property that is very useful to keep track of, for efficient converters, that can be figured out by the generator, is to know if all of the characters map to the BMP (because then
smaller tables can be used) (and in fact, if all characters map to a particular section of the BMP, which is frequently the case).
What I wonder, maybe you have some ideas, is what would be the best way to set things up so that we can either directly use iconv or ICU or whatever, if we don't support an encoding, and for the ones we directly support, have different methods for different classes of encodings, for cases where no tables are needed, or where the only differences would be tables, possibly loaded from a binary file at run-time. Python 3 seems to have a framework that allows all of that.

It would really be nice for Julia to have best-in-class support for character sets, encodings & strings, even compared to Python 3 and Swift 2.0!

nalimilan · 2016-02-14T21:37:36Z

What I wonder, maybe you have some ideas, is what would be the best way to set things up so that we can either directly use iconv or ICU or whatever, if we don't support an encoding, and for the ones we directly support, have different methods for different classes of encodings, for cases where no tables are needed, or where the only differences would be tables, possibly loaded from a binary file at run-time. Python 3 seems to have a framework that allows all of that.

I think traits allow for exactly this kind of thing. You just need to add methods for StringEncoder and StringDecoder based on the information we have about encodings.

ScottPJones · 2017-01-18T03:31:19Z

bump (even though it's your own PR ;-) )
This has fallen behind the main branch, but still seems a very nice improvement, if you plan to move forward with StringEncodings.jl.

nalimilan · 2017-01-18T09:23:21Z

That's not the top of my priorities right now, though I'd be happy to review a PR if you want to update it. Do you need a feature in particular?

ScottPJones · 2017-01-18T13:53:26Z

OK, I'm not sure how I'd make a PR on this PR though.

nalimilan · 2017-01-18T14:21:08Z

Just open a new PR. Anyway only the second commit is useful here IIRC.

nalimilan force-pushed the nl/types branch 2 times, most recently from 3801bec to 18fd160 Compare February 13, 2016 17:34

nalimilan force-pushed the nl/types branch from 18fd160 to 4c83568 Compare February 13, 2016 17:39

ScottPJones reviewed Feb 13, 2016
View reviewed changes

WIP: store a list of encodings and their properties

1671897

nalimilan force-pushed the master branch 2 times, most recently from 36ce0a5 to a87aaa9 Compare February 21, 2016 10:46

nalimilan mentioned this pull request Feb 21, 2016

Register StringEncodings 0.1.0 JuliaLang/METADATA.jl#4624

Merged

nalimilan mentioned this pull request Apr 29, 2016

Stringapalooza JuliaLang/julia#16107

Closed

32 tasks

nalimilan mentioned this pull request Apr 20, 2017

deprecate String(::IOBuffer) JuliaLang/julia#21438

Closed

nalimilan force-pushed the master branch from cc7456f to 5def787 Compare June 17, 2018 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Encoding parametric singleton type #9

Introduce Encoding parametric singleton type #9

nalimilan commented Feb 13, 2016

ScottPJones commented Feb 13, 2016

ScottPJones Feb 13, 2016

nalimilan Feb 14, 2016

ScottPJones Feb 14, 2016

nalimilan Feb 14, 2016

ScottPJones commented Feb 13, 2016

ScottPJones commented Feb 13, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Feb 14, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Feb 14, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Feb 14, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Jan 18, 2017

nalimilan commented Jan 18, 2017

ScottPJones commented Jan 18, 2017

nalimilan commented Jan 18, 2017

Introduce Encoding parametric singleton type #9

Are you sure you want to change the base?

Introduce Encoding parametric singleton type #9

Conversation

nalimilan commented Feb 13, 2016

ScottPJones commented Feb 13, 2016

ScottPJones Feb 13, 2016

Choose a reason for hiding this comment

nalimilan Feb 14, 2016

Choose a reason for hiding this comment

ScottPJones Feb 14, 2016

Choose a reason for hiding this comment

nalimilan Feb 14, 2016

Choose a reason for hiding this comment

ScottPJones commented Feb 13, 2016

ScottPJones commented Feb 13, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Feb 14, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Feb 14, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Feb 14, 2016

nalimilan commented Feb 14, 2016

ScottPJones commented Jan 18, 2017

nalimilan commented Jan 18, 2017

ScottPJones commented Jan 18, 2017

nalimilan commented Jan 18, 2017