Characters described as own simplified and traditional variant #408

paulmasson · 2024-10-05T00:49:22Z

The Unihan database currently contains 431 instances of characters described as their own variants. This is logically inconsistent. The correct traditional variants should of course remain, but the logically incorrect entries need to be removed.

I have previously reported four of these instances - U+575B 坛, U+5978 奸, U+6784 构 and U+9759 静 - through the official channel. Rather than doing this more than four hundred more times, I instead generated a complete list of all the instances, which is attached:

simplified.txt

How would you like to proceed on this issue? Since kSimplifiedVariant and kTraditionalVariant are provisional fields, we could work through files already here in this repository, once updated to the current Unicode version. For a mass update like this, however, it might be simpler for you to update the official copy of the database directly, then regenerate files here for a final check.

Let me know.

kenlunde · 2024-10-05T19:45:49Z

These additions to the Unihan database were introduced in Unicode Version 15.0 (2022), and the late John Jenkins was maintaining these properties. I am unable to find a document for these additions, and because the properties are provisional, no document was technically necessary.

These cases are described as the complex simplified/traditional relationship as documented in Section 3.7.1 of UAX #38, specifically the first sub-bullet of the fourth bullet of the first set of numbered bullets:

X may be mapped to itself or to another ideograph when converting between SC and TC. In this case, the ideograph is its own simplification as well as the simplification for other ideographs. An example would be U+540E 后, which is the simplification for itself and for U+5F8C 後. When mapping TC to SC, it is left alone, but when mapping SC to TC it may or may not be changed, depending on context. In this case, both kTraditionalVariant and kSimplifiedVariant properties are defined and X is included among the values for both.

The only solid paper trail that I could find was document L2/22-255 that is in response to the last four sections of document L2/22-226. The following from the top of page 2 seems to be key:

As is explained in UAX #38, a character may be listed as a simplified or traditional variant of itself. This is to satisfy the requirement that the variant fields define symmetric relationships. Should the UTC decide that the simplified-traditional variant relationships need not be symmetric, they could then be dropped.

I also found some feedback in PRI #433 (Unicode 14.0.0 Beta), and I suspect that the additions for Unicode Version 15.0 may have been in response to that.

paulmasson · 2024-10-10T00:09:14Z

@kenlunde after some thought, here is my take on the description of these two database fields for the first set of numbered bullet points:

I have no problem whatsoever with cases (1), (2), (3) and (4.2). The first three are exactly how I would expect the database to work. I hadn't run across any instances of (4.2), but the logic is sound. I do have a problem with part of (4.1).

When a simplified character can represent two or more traditional characters, then that is important information and needs to be in the database. In that context is is logical to label a character its own traditional variant, as long as there is at least one more traditional variant given. Preferably there should be two entries in kDefinition to clarify that the character is used in two senses, unless the traditional variants are basically interchangeable.

On the other hand, calling a character its own simplified variant doesn't make much sense to me, since there is no simplification involved. That part of (4.1) should work the same way as case (1) and leave kSimplifiedVariant empty. If the character is listed as one of its own traditional variants, that already implies the character is used in both simplified and traditional environments. Calling it its own simplified variant adds no more information: that is implicit from the data in the kTraditionalVariant field.

Removing the entries labeling a character its own simplified variant would impact the stakeholder in the feedback you mentioned, but only temporarily, since it would be easy to code around such a change.

kenlunde · 2024-10-11T12:07:56Z

Cases (1), (2), (3), (4.1), and (4.2) date back to UAX 38 Revision 11 for Unicode Version 6.1 (2012). For the upcoming UTC meeting, the only action that I am comfortable with is to research this issue more, and to work out a solution with the stakeholders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characters described as own simplified and traditional variant #408

Characters described as own simplified and traditional variant #408

paulmasson commented Oct 5, 2024

kenlunde commented Oct 5, 2024

paulmasson commented Oct 10, 2024

kenlunde commented Oct 11, 2024

Characters described as own simplified and traditional variant #408

Characters described as own simplified and traditional variant #408

Comments

paulmasson commented Oct 5, 2024

kenlunde commented Oct 5, 2024

paulmasson commented Oct 10, 2024

kenlunde commented Oct 11, 2024