-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Characters described as own simplified and traditional variant #408
Comments
These additions to the Unihan database were introduced in Unicode Version 15.0 (2022), and the late John Jenkins was maintaining these properties. I am unable to find a document for these additions, and because the properties are provisional, no document was technically necessary. These cases are described as the complex simplified/traditional relationship as documented in Section 3.7.1 of UAX #38, specifically the first sub-bullet of the fourth bullet of the first set of numbered bullets:
The only solid paper trail that I could find was document L2/22-255 that is in response to the last four sections of document L2/22-226. The following from the top of page 2 seems to be key:
I also found some feedback in PRI #433 (Unicode 14.0.0 Beta), and I suspect that the additions for Unicode Version 15.0 may have been in response to that. |
@kenlunde after some thought, here is my take on the description of these two database fields for the first set of numbered bullet points: I have no problem whatsoever with cases (1), (2), (3) and (4.2). The first three are exactly how I would expect the database to work. I hadn't run across any instances of (4.2), but the logic is sound. I do have a problem with part of (4.1). When a simplified character can represent two or more traditional characters, then that is important information and needs to be in the database. In that context is is logical to label a character its own traditional variant, as long as there is at least one more traditional variant given. Preferably there should be two entries in kDefinition to clarify that the character is used in two senses, unless the traditional variants are basically interchangeable. On the other hand, calling a character its own simplified variant doesn't make much sense to me, since there is no simplification involved. That part of (4.1) should work the same way as case (1) and leave kSimplifiedVariant empty. If the character is listed as one of its own traditional variants, that already implies the character is used in both simplified and traditional environments. Calling it its own simplified variant adds no more information: that is implicit from the data in the kTraditionalVariant field. Removing the entries labeling a character its own simplified variant would impact the stakeholder in the feedback you mentioned, but only temporarily, since it would be easy to code around such a change. |
Cases (1), (2), (3), (4.1), and (4.2) date back to UAX 38 Revision 11 for Unicode Version 6.1 (2012). For the upcoming UTC meeting, the only action that I am comfortable with is to research this issue more, and to work out a solution with the stakeholders. |
The Unihan database currently contains 431 instances of characters described as their own variants. This is logically inconsistent. The correct traditional variants should of course remain, but the logically incorrect entries need to be removed.
I have previously reported four of these instances - U+575B 坛, U+5978 奸, U+6784 构 and U+9759 静 - through the official channel. Rather than doing this more than four hundred more times, I instead generated a complete list of all the instances, which is attached:
simplified.txt
How would you like to proceed on this issue? Since kSimplifiedVariant and kTraditionalVariant are provisional fields, we could work through files already here in this repository, once updated to the current Unicode version. For a mass update like this, however, it might be simpler for you to update the official copy of the database directly, then regenerate files here for a final check.
Let me know.
The text was updated successfully, but these errors were encountered: