-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Encoding parametric singleton type #9
base: master
Are you sure you want to change the base?
Conversation
3801bec
to
18fd160
Compare
First step towards efficient encoders for common encodings, as well as towards providing information about encodings. This also allows adding convenience methods to base I/O functions taking an additional encoding parameter without risking ambiguities.
I'll start reviewing this this weekend. Did you look at the discussions about making the encodings use traits? I think the encodings can be classified by the code unit, whether they are native or opposite endian (for cases where the code unit is 2 or 4 bytes), whether they take 1, 2, or more code units to represent each code point, and whether or not the code points are Unicode (UTF-8, UTF-16, UTF-32 and variants), |
"1026", "1046", "1047", "10646-1:1993", "10646-1:1993/UCS4", | ||
"437", "500", "500V1", "850", "851", "852", "855", "856", "857", | ||
"860", "861", "862", "863", "864", "865", "866", "866NAV", "869", | ||
"874", "8859_1", "8859_2", "8859_3", "8859_4", "8859_5", "8859_6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8859_1 is a synonym for ANSI Latin 1, which I think should be classified separately, as it is purely an 8-bit subset of Unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I noted, there's a lot of classification work to do here. I've just started moving a few of these to encodings8
to test how it works.
Anyway, if we want to store more properties about each encoding, we should create an immutable with a few fields, and make an array of that, instead of storing only the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Maybe the table would have just the properties of the canonical encodings, and have another table that mapped all of the string names to the entry in the table?
What ideas do you have for those sorts of structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push a proposal shortly. Indeed, it sounds like keeping a separate list of aliases will make everything shorter and easier to maintain.
This definitely looks like a good start! I hope you don't mind all the comments! |
I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results. |
Actually, I've just bumped into this: http://demo.icu-project.org/icu-bin/convexp?conv=hp-roman8 It seems that ICU provides information about all encodings, and in particular whether it's ASCII-compatible. |
Ah, that's great, it also has the information to decide whether it is single, double, or multi code unit, I see. |
@ScottPJones Please have a look at the stub |
The new encodinginfo stuff looks much better, yes. |
I think it would be easier to take the code that does the same thing in iconv-lite:
Yes, that was the idea. Using the Tim Holy traits trick based on the encodings info, it should be easy to override the current |
Those generators from iconv-lite look nice (even if they are in JS instead of Julia! ;-) ), I see he does what I'd been talking about, and checks to see if the first half of the table is the same as ASCII. It would really be nice for Julia to have best-in-class support for character sets, encodings & strings, even compared to Python 3 and Swift 2.0! |
I think traits allow for exactly this kind of thing. You just need to add methods for |
36ce0a5
to
a87aaa9
Compare
bump (even though it's your own PR ;-) ) |
That's not the top of my priorities right now, though I'd be happy to review a PR if you want to update it. Do you need a feature in particular? |
OK, I'm not sure how I'd make a PR on this PR though. |
Just open a new PR. Anyway only the second commit is useful here IIRC. |
First step towards efficient encoders for common encodings,
as well as towards providing information about encodings.
This also allows adding convenience methods to base I/O functions taking
an additional encoding parameter without risking ambiguities.
See the new tests for an illustration of the API.
@ScottPJones What do you think of this PR? I've tried implementing most of the features from quinnj/Strings.jl#3, but with a parametric singleton type
Encoding
. This allows supporting arbitrary encodings, and generating methods on-the-fly without polluting the methods table with support for all possible encodings.But I must say I don't know why you need these functions (like
codeunit
ornative_endian
), so I cannot tell whether this will work for you.TODO:
encodings_other
. Can all of non-UTF/UCS encodings be considered as 8-bit?UTF16LE
AbstractString
convenience methods