Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language localization #270

Open
wipfli opened this issue Jul 8, 2024 · 14 comments
Open

Language localization #270

wipfli opened this issue Jul 8, 2024 · 14 comments

Comments

@wipfli
Copy link
Sponsor Contributor

wipfli commented Jul 8, 2024

Currently, the basemap does not have any language localization capabilities. Country, state, and place names are taken from the OSM name tag and contain information in the local language or languages. For example, the country label of Germany is "Deutschland" whereas the country label for Italy is "Italia". In this Issue I would like to propose a scheme for displaying names localized to a specific user language which should make the basemap more accessible to a wider audience.

Assumptions

Let us make the following assumptions:

  • A typical user speaks and reads primarily one language, their first language.
  • A typical user expects to see map labels in their first language.
  • We have a database with labels where
    • name contains the local name(s) using a single script.
    • name:<language-code> contains the name in a specific language.
    • For each <language-code> we know the script(s).
    • There is a defined list of supported <language-code>s.

Definitions

  • Language localization: display labels in the first or preferred language of a user.
  • Language fallback chain: If a label is not available in the target language, try another language which is similar to the first language. All languages in the fallback chain use the same script.

Proposed Supported Languages

Below is a proposed list of roughly 80 supported languages. The languages are grouped by script and some languages may use more than one script. Note that for some scripts such as Telugu or Khmer we need to create a positioned glyph font first.

The structure is:

Language: <language-code>, number of nodes/ways/relations in name:<language-code> in OSM's taginfo

Latin

  • AFRIKAANS: af, 10k
  • ALBANIAN: sq, 10k
  • AZERBAIJANI: az, 10k
  • AZERBAIJANI (Arabic script): az-Arab, 2k
  • BASQUE: eu, 70k
  • BOSNIAN: bs, 6k
  • CATALAN: ca, 600k
  • CROATIAN: hr, 20k
  • CZECH: cs, 50k
  • DANISH: da, 10k
  • DUTCH: nl, 80k
  • ENGLISH: en, 6M
  • ESTONIAN: et, 10k
  • FINNISH: fi, 400k
  • FILIPINO: fil, 600
  • FRENCH: fr, 600k
  • GALICIAN: gl, 10k
  • GERMAN: de, 500k
  • HUNGARIAN: hu, 60k
  • ICELANDIC: is, 3k
  • INDONESIAN: id, 10k
  • ITALIAN: it, 100k
  • LATVIAN: lv, 10k
  • LITHUANIAN: lt, 50k
  • MALAY (Latin script): ms, 70k
  • MALAY (Arabic script): ms-Arab, 3k
  • NORWEGIAN: no, 10k
  • Norwegian Nynorsk: nn, 4k
  • POLISH: pl, 300k
  • PORTUGUESE: pt, 50k
  • ROMANIAN: ro, 50k
  • SLOVAK: sk, 20k
  • SLOVENIAN: sl, 10k
  • SPANISH: es, 100k
  • SWAHILI: sw, 20k
  • SWEDISH: sv, 100k
  • TURKISH: tr, 30k
  • UZBEK: uz, 10k
  • UZBEK (Latin script): uz-Latn, 1k
  • UZBEK (Cyrillic script): uz-Cyrl, 1k
  • UZBEK (Arabic script): uz-Arab, 900
  • VIETNAMESE: vi, 30k
  • ZULU: zu, 1k

Arabic

  • ARABIC: ar, 1M
  • FARSI: fa, 50k
  • URDU: ur, 80k

Cyrillic

  • BELARUSIAN: be, 400k
  • BULGARIAN: bg, 30k
  • KAZAKH: kk, 40k
  • KAZAKH (Latin script): kk-Latn, 1k
  • KAZAKH (Arabic script): kk-Arab, 8k
  • KAZAKH (Cyrillic script): kk-Cyrl, 1k
  • KYRGYZ: ky, 5k
  • MACEDONIAN: mk, 40k
  • RUSSIAN: ru, 1M
  • SERBIAN (Cyrillic script): sr, 300k
  • SERBIAN (Latin script): sr-Latn, 200k
  • UKRAINIAN: uk, 1M

Han

  • CHINESE: zh, 1M
  • CHINESE (SIMPLIFIED): zh-Hans, 100k
  • CHINESE (TRADITIONAL): zh-Hant, 300k

Devanagari

  • GUJARATI: gu, 4k
  • HINDI: hi, 60k
  • MARATHI: mr, 10k
  • NEPALI: ne, 10k

One Language Per Script

  • AMHARIC: am, 8k
  • ARMENIAN: hy, 40k
  • KOREAN: ko, 700k
  • KOREAN (Latin script): ko-Latn, 100k
  • KOREAN (Hanja script): ko-Hani, 50k
  • JAPANESE: ja, 1M
  • JAPANESE (Hiragana script): ja-Hira and ja_kana, 200k
  • JAPANESE (Latin script): ja_rm and ja-Latn, 100k
  • GEORGIAN: ka, 60k
  • GREEK: el, 100k
  • MONGOLIAN mn, 10k
  • MONGOLIAN (Traditional script): mn-Mong, 1k
  • MONGOLIAN (Cyrillic script): mn-Cyrl, 1k
  • HEBREW: he, 100k
  • KANNADA: kn, 90k
  • BENGALI: bn, 10k
  • BURMESE: my, 40k
  • KHMER: km, 8k
  • LAO: lo, 3k
  • MALAYALAM: ml, 30k
  • PUNJABI: pa, 30k
  • SINHALESE: si, 2k
  • TAMIL: ta, 20k
  • TELUGU: te, 20k
  • THAI: th, 100k

Proposed Rules

  1. If the target language is not available, follow a language fallback chain. End in the name tag only if the script of the target language and the script of the name tag are the same.
  2. Display country labels only in the target language.
  3. Display state labels only in the target language.
  4. Display place labels in one or two lines.
    1. One line: The target language uses the same script as the name tag. In this case only show the label in the target language in a single line label.
    2. Two lines: The target language uses a different script than the name tag. In this case show two lines. First the target language, second the name.
  5. Street labels follow the same logic as place labels.

Examples

Localized to English

Country example 1:

Switzerland

City Example 1:

Geneva

Country Example 2:

Greece

City Example 2:

Athens
Αθήνα

Localized to Greek

Country example 1:

Ελβετία

City Example 1:

Γενεύη
Genève

Country example 2:

Ελλάδα

City example 2:

Αθήνα
@bdon
Copy link
Member

bdon commented Jul 11, 2024

All looks like good assumptions

There may be some complication in name:<language-code> with zh-Hans and zh-Hant. I believe Tilezen had some special logic for this related to one or the other missing, to fill in the zh slot. It seems out of scope to perform any automated conversion between them. @nvkelso any lessons learned from Tilezen here?

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 11, 2024

Chinese is actually an interesting case, because there quite a lot of entries in osm:

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 11, 2024

Here is a list of all OSM name tag values that use more than one script: LINK (15 MB). It has something like 500k entries.

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 12, 2024

Update: The assumption that we have a database where the name tag always contains only one script is wrong. For example names in Morocco come often in 3 scripts: Latin, Arabic, and Tifinagh.

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 12, 2024

@bdon
Copy link
Member

bdon commented Jul 13, 2024

Update: The assumption that we have a database where the name tag always contains only one script is wrong. For example names in Morocco come often in 3 scripts: Latin, Arabic, and Tifinagh.

This should also be the case in Hong Kong and a few other places due to mapping conventions.

In these situations we could ignore name completely because the parse is unreliable (unless it very consistently breaks on / etc?)

If I look at Hong Kong in English -> the 2nd label should be Chinese
if I look at Hong Kong in Chinese -> the 2nd label should be English
what happens if I look at it in Russian though? should it show 3 labels?

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 13, 2024

Hong Kong is an interesting example because there are two languages (English, Chinese) and two scripts (Latin, Han).

Let us assume for a moment that we have a database where up to local names of a city can be stored separately

name_1 = Hong Kong
name_2 = 香港

The ordering of the names has a meaning, maybe number of people speaking the language or administrative/cultural use.

If we had this dataset of listed names, we could do the following rule:

  • Display place labels in one, two, or three lines.
    • One line: The target language uses the same script as the name_1 and name_2 tag. In this case only show the label in the target language in a single line label.
    • Two lines: The target language uses a different script than the name_1 or the name_2 tag. In this case show two lines. First the target language, second the name_1 if the script is different, else name_2.
    • Three lines: The target language uses a different script than the name_1 and the name_2 tag, and name_1 and name_2 use different scripts. In this case show three lines. First the target language, second the name_1, the third name_2.

With this rule we would get the following for Hong Kong:

English:

Hong Kong
香港

Chinese:

香港
Hong Kong

Russian:

Гонконг
Hong Kong
香港

@nvkelso
Copy link
Collaborator

nvkelso commented Jul 15, 2024

Special cases for country and region (state) labels:

Display country labels only in the target language
Display state labels only in the target language

Do you mean to say the county and state labels would not be "stacked" by default, and the value of the "single line" label would follow the fallback chain in rule 1, like (modified):

If the target language is not available, follow a language fallback chain. End in the name tag only if the localized values in fallback chain are unavailable. (Possible variation to only take the 1st element in / separated multi-value?)

Or do you mean to say no name would be displayed at all if the localization data is unavailable?

Street labels:

For street labels, are you proposing the labels be stacked or delimitated (concatenated)?

Multiple alphabet option languages:

Might be worth adding more details around the languages which can be represented in multiple alphabets. Does each writing system belong in a different fallback chain?

Chinese

For Chinese, Tilezen v1.9 added basic detection of Chinese simplified versus traditional for each name, and localized name key-value pairs and modified the tile output to better annotate that. The raw input data in OSM is sometimes quite messy.

Overall:

On the display side, our recommendations and best practices match and are exceeded by what @wipfli is already suggesting here, nice! The demo matches my expectations :)

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 19, 2024

Thanks for your questions @nvkelso.

Country labels should appear only in the target language. For example, if the target language is French, then the country labels should be "Allemagne", "Suisse", or "Autriche". So here no stacking is needed. From my experience, country labels have great language coverage so we should be able to not need a fallback chain at all. My preference would be to define a set of supported languages and then make sure that for each supported language we have a name:<language-code> for the country labels.

State labels are currently used too much in the Protomaps basemap in my opinion. In the US it might make sense to have them on the map because the country is huge and most states are huge too, but in smaller countries the state labels are not needed. I will open a separate issue at some point to propose to only have state/province labels for these countries: US, Canada, Mexico, Brazil, China, India, Australia. For now I propose to ignore the problem of state labels and treat them like city labels or like country labels, we can see which one works better.

Street labels I honestly have not thought much about yet.

Some languages use multiple scripts like for example Kazakh or Uzbek, however, there is very limited coverage in OSM for the different script (only around 1k names) and so I don't see the value at the moment of adding special logic for these languages. Japanese might have larger coverage and also uses multiple scripts, but so far my impression is that the sample code works quite well in Japan. I want to reach out to some Japanese friends and ask them for input.

@nvkelso
Copy link
Collaborator

nvkelso commented Jul 22, 2024

@wipfli For the tiles, is anymore work needed besides what was already merged in #254 (already in a tagged release)? It seems like the remaining work in this issue is mostly about display business logic, is that right?

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 23, 2024

Thanks for asking! With the current tiles we have information in the pmap:script tag about the script used in the name tag. Here are some examples:

  • Athens:
    • name = Αθήνα
    • pmap:script = Greek
  • Zürich:
    • name = Zürich
    • pmap:script is absent and that implies the script is Latin
  • Hong Kong:
    • name = Hong Kong 香港
    • pmap:script = Mixed

So you see that in the case of Zürich and Athens, where only one script is used in the name tag, we can build everything we want with MapLibre style expressions.

However, if the name tag contains more than one script, like for example in Hong Kong, then we are a bit in a tricky situation.

One option when we have pmap:script = Mixed is to display only the name in the target language. For example, if the target language is English we just show name:en. It would look like this:

Hong Kong

Another option when we have pmap:script = Mixed is to display the target language and the name value. However, it can then happen that the name is duplicated in a label which looks bad. For example {name:en}\n{name} would look like this:

Hong Kong
Hong Kong 香港

Yet another option would be when we have pmap:script = Mixed would be to ignore the target language altogether and just show the name value. This is what the basemap currently does. For Hong Kong, it would look like this:

Hong Kong 香港

Tiles modification

Question to @nvkelso and @bdon: Do you think any of the above 3 options is good enough for now?

If yes, I can start implementing the frontend styles.

If no, I suggest we do a bit more thinking around how mixed-script name tags can be broken up at tile generation time.

I am leaning towards the second, i.e., breaking up Hong Kong 香港 to Hong Kong and 香港 when we generate the tiles because the map just looks so much better. What do you think?

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 23, 2024

I did some java prototyping for splitting the name tag into segments with different scripts.

Here is the result (1.6 MB): https://github.com/wipfli/multi-script-names/blob/main/list.txt

Overall I am quite happy with this segmentation. The data is place=* from OSM with name containing more than one script, e.g., Latin and Greek would be included, but a Latin-only name would not be included.

We have to deal with some typos coming from confusion between similar looking letters in Cyrillic, Latin, and Greek. Also, sometimes Latin letters are used for numbering purposes so there we should not segment.

Some numbers:

  • number of segments=count
  • 2=13620
  • 3=2222
  • 4=1

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 23, 2024

Regarding the entries that use 3 scripts, we have

  • TIFINAGH: 2029
  • MONGOLIAN: 98
  • ETHIOPIC: 84

Now do we want to support the languages that use these scripts? Because if we don't then, we can get away with 2 segments for the name tag, otherwise we will need 3.

@wipfli
Copy link
Sponsor Contributor Author

wipfli commented Jul 25, 2024

I made some tiles with segmented name tags. Here is a demo using MapLibre GL JS v4.5.0 with a style localized to Arabic:

Morocco

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=8.87/33.7469/-7.1911

image

Note how Arabic is in the top line because it is localized to Arabic. In OSM, I think Arabic is mostly the last entry in Morocco.

Hong Kong

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=9.22/22.3113/114.2289

image

Athens

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=10.35/37.9577/23.7035

image

Note how it falls back to name:en if no Arabic name is available.

Cairo

https://pub-cf7f11e26ace447db8f7215b61ac0eae.r2.dev/segment.html#map=10.4/30.0417/31.2211

image

If the name is Arabic, only show the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants