Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode normalization #1047

Open
ThomasWaldmann opened this issue Jul 14, 2020 · 1 comment
Open

unicode normalization #1047

ThomasWaldmann opened this issue Jul 14, 2020 · 1 comment

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jul 14, 2020

we usually have utf-8 content in text items (including links and transclusions by-name) and we usually also use utf-8 encoding for item names.

unicode content and names should get normalized before they are encoded to utf-8 and stored, otherwise stuff can get inconsistent (especially if people use apple devices).

i recently noticed this there:

while moin2 is not released and there is no "production" content in it, we can avoid that stuff gets inconsistent.

i'ld suggest we always normalize unicode text (names, text item content) to NFC form before storing it into backend.

that way we can avoid that a NFD link to an NFC named item looks correct, but does not work.

any text that comes from the user must go through that normalization (e.g. when entering stuff in form fields).

this is not a problem in English, because all is plain ascii, but for a lot of other languages, like german, french, spanish, ...

For example, take the german a-umlaut (both print outputs look the same in a terminal, but not on github):

# NFC normalization (composed):
>>> print("\xc3\xa4".decode('utf8'))
ä
# NFD normalization (decomposed):
>>> print("a\xcc\x88".decode('utf8'))
ä
@ThomasWaldmann
Copy link
Member Author

of course the importer from moin-1.9 also needs to normalize (page content, page names, attachment names).

about attachment content: guess if it ends up being a text/*;coding=utf-8 item, we should also normalize content to NFC form.

usually this should not change the coding as NFC is the usual stuff, just apple does it differently when it comes to filesystem names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant