Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters described as own simplified and traditional variant #408

Open
paulmasson opened this issue Oct 5, 2024 · 3 comments
Open

Characters described as own simplified and traditional variant #408

paulmasson opened this issue Oct 5, 2024 · 3 comments

Comments

@paulmasson
Copy link
Contributor

The Unihan database currently contains 431 instances of characters described as their own variants. This is logically inconsistent. The correct traditional variants should of course remain, but the logically incorrect entries need to be removed.

I have previously reported four of these instances - U+575B 坛, U+5978 奸, U+6784 构 and U+9759 静 - through the official channel. Rather than doing this more than four hundred more times, I instead generated a complete list of all the instances, which is attached:

simplified.txt

How would you like to proceed on this issue? Since kSimplifiedVariant and kTraditionalVariant are provisional fields, we could work through files already here in this repository, once updated to the current Unicode version. For a mass update like this, however, it might be simpler for you to update the official copy of the database directly, then regenerate files here for a final check.

Let me know.

@kenlunde
Copy link
Member

kenlunde commented Oct 5, 2024

These additions to the Unihan database were introduced in Unicode Version 15.0 (2022), and the late John Jenkins was maintaining these properties. I am unable to find a document for these additions, and because the properties are provisional, no document was technically necessary.

These cases are described as the complex simplified/traditional relationship as documented in Section 3.7.1 of UAX #38, specifically the first sub-bullet of the fourth bullet of the first set of numbered bullets:

X may be mapped to itself or to another ideograph when converting between SC and TC. In this case, the ideograph is its own simplification as well as the simplification for other ideographs. An example would be U+540E 后, which is the simplification for itself and for U+5F8C 後. When mapping TC to SC, it is left alone, but when mapping SC to TC it may or may not be changed, depending on context. In this case, both kTraditionalVariant and kSimplifiedVariant properties are defined and X is included among the values for both.

The only solid paper trail that I could find was document L2/22-255 that is in response to the last four sections of document L2/22-226. The following from the top of page 2 seems to be key:

As is explained in UAX #38, a character may be listed as a simplified or traditional variant of itself. This is to satisfy the requirement that the variant fields define symmetric relationships. Should the UTC decide that the simplified-traditional variant relationships need not be symmetric, they could then be dropped.

I also found some feedback in PRI #433 (Unicode 14.0.0 Beta), and I suspect that the additions for Unicode Version 15.0 may have been in response to that.

@paulmasson
Copy link
Contributor Author

@kenlunde after some thought, here is my take on the description of these two database fields for the first set of numbered bullet points:

I have no problem whatsoever with cases (1), (2), (3) and (4.2). The first three are exactly how I would expect the database to work. I hadn't run across any instances of (4.2), but the logic is sound. I do have a problem with part of (4.1).

When a simplified character can represent two or more traditional characters, then that is important information and needs to be in the database. In that context is is logical to label a character its own traditional variant, as long as there is at least one more traditional variant given. Preferably there should be two entries in kDefinition to clarify that the character is used in two senses, unless the traditional variants are basically interchangeable.

On the other hand, calling a character its own simplified variant doesn't make much sense to me, since there is no simplification involved. That part of (4.1) should work the same way as case (1) and leave kSimplifiedVariant empty. If the character is listed as one of its own traditional variants, that already implies the character is used in both simplified and traditional environments. Calling it its own simplified variant adds no more information: that is implicit from the data in the kTraditionalVariant field.

Removing the entries labeling a character its own simplified variant would impact the stakeholder in the feedback you mentioned, but only temporarily, since it would be easy to code around such a change.

@kenlunde
Copy link
Member

Cases (1), (2), (3), (4.1), and (4.2) date back to UAX 38 Revision 11 for Unicode Version 6.1 (2012). For the upcoming UTC meeting, the only action that I am comfortable with is to research this issue more, and to work out a solution with the stakeholders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants