Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: global configuration option to select stored languages #867

Open
orangejulius opened this issue May 19, 2020 · 7 comments · May be fixed by pelias/model#137
Open

Proposal: global configuration option to select stored languages #867

orangejulius opened this issue May 19, 2020 · 7 comments · May be fixed by pelias/model#137

Comments

@orangejulius
Copy link
Member

We currently import names for records in many different languages (mostly from OSM and WOF). For example, the document we end up storing in Elasticsearch for the Who's on First San Francisco record has the following names:

{
  "default": [
    "San Francisco",
    "S Francisco",
    "S. Francisco",
    "SFO",
    "Sanfran",
    "Sanfrancisco",
    "Frisco"
  ],
  "am": "ሳን ፍራንሲስኰ",
  "ar": "سان فرانسيسكو",
  "az": "San-Fransisko",
  "ba": "Сан-Франциско",
  "be": [
    "Горад Сан-Францыска",
    "Сан-Францыска"
  ],
  "bn": "সান ফ্রান্সিস্কো",
  "bo": "སན་ཧྥུ་རན་སིས་ཁོ",
  "bg": "Сан Франциско",
  "ce": "Сан-Франциско",
  "zh": [
    "舊 金山",
    "旧金山"
  ],
  "cv": "Сан-Франциско",
  "el": "Σαν Φρανσίσκο",
  "en": [
    "S Francisco",
    "S. Francisco",
    "SFO",
    "Sanfran",
    "Sanfrancisco",
    "Frisco"
  ],
  "eo": [
    "San-Francisko",
    "Sanfrancisko"
  ],
  "eu": "San Frantzisko",
  "fa": "سان فرانسیسکو",
  "fi": [
    "San Franciscon",
    "San Franciscoon",
    "San Franciscossa",
    "San Franciscosta"
  ],
  "fy": "San Fransisko",
  "gu": "સેનફ્રાન્સિસ્કો",
  "he": "סן פרנסיסקו",
  "hi": "सैन फ्रांसिस्को",
  "hy": "Սան Ֆրանցիսկո",
  "ja": "サンフランシスコ",
  "kn": "ಸ್ಯಾನ್ ಫ್ರಾನ್ಸಿಸ್ಕೋ",
  "ka": "სან-ფრანცისკო",
  "kk": "Сан Фран сиско",
  "ky": "Сан-Франциско",
  "ko": [
    "샌프란시스코",
    "샌프란"
  ],
  "la": "Franciscopolis",
  "lv": "Sanfrancisko",
  "lt": "San Fransiskas",
  "ml": "സാൻ ഫ്രാൻസിസ്കോ",
  "mr": "सॅन फ्रान्स िस्को",
  "mk": "Сан Франциско",
  "mn": "Сан-Франциско",
  "my": "ဆန်ဖရန်စစ္စကိုမြို့",
  "nv": "Naʼníʼá Hóneezí",
  "ne": "सान फ्रान्सिस्को",
  "os": "Сан-Франциско",
  "pa": [
    "ਸੈਨ ਫਰਾਂਸਿਸਕੋ",
    "ਸੈਨ ਫ਼ਰਾਂ ਸਿਸਕੋ"
  ],
  "pt": "São Francisco",
  "ro": "Сан Франциско",
  "ru": "Сан-Франциско",
  "si": "සැන් ෆ්‍රැන්සිස්කෝ",
  "so": "San Fransisko",
  "es": "Ciudad de San Francisco",
  "sr": "Сан Франци ско",
  "ta": "சான் பிரான்சிஸ்கோ",
  "tt": "Сан-Франциско",
  "te": "శాన్ ఫ్రాన్సిస్కో",
  "tl": "Lungsod ng San Francisco",
  "th": "ซานฟรานซิสโก",
  "tk": "San-Fransisko",
  "ug": "San Fransisko",
  "uk": [
    "Сан-Франціско",
    "Сан-Франциско"
  ],
  "ur": "سان فرانسسکو",
  "uz": "San Fransisko",
  "yi": "סאן פראנציסקא"
}

I don't doubt that all these values are important to someone, but they are probably not all important to everyone. These name records are both stored and indexed, so there is certainly a cost to keeping them all.

It might make sense to add a configuration option, probably in pelias.json, that would control what languages are allowed to be imported across all Pelias services. Placeholder, for example, stores similar sets of languages.

This would allow different Pelias installations to be better tuned for only the data they need.

@Joxit
Copy link
Member

Joxit commented May 19, 2020

This would be interesting for Pelias instances that need few languages.

Could we also change the default language ? To have Londre instead of London for example for a French instance ?

@orangejulius
Copy link
Member Author

Another reason to support this sort of thing: currently the list of sub-properties of the name field is enormous, with 193 entries.

Here's the full list from one of our planet indicies, see if you can spot the interesting ones:

"aa"
"ab"
"ae"
"af"
"ak"
"am"
"an"
"ar"
"as"
"av"
"ay"
"az"
"ba"
"be"
"bg"
"bh"
"bi"
"bm"
"bn"
"bo"
"br"
"bs"
"ca"
"ce"
"ch"
"co"
"cr"
"cs"
"cu"
"cv"
"cy"
"da"
"de"
"default"
"dv"
"dz"
"ee"
"el"
"en"
"eo"
"es"
"et"
"eu"
"fa"
"ff"
"fi"
"fj"
"fo"
"fr"
"function Object() { [native code] }"
"fy"
"ga"
"gd"
"gl"
"gn"
"gu"
"gv"
"ha"
"he"
"hi"
"ho"
"hr"
"ht"
"hu"
"hy"
"hz"
"ia"
"id"
"ie"
"ig"
"ii"
"ik"
"international"
"io"
"is"
"it"
"iu"
"ja"
"jv"
"ka"
"kg"
"ki"
"kj"
"kk"
"kl"
"km"
"kn"
"ko"
"kr"
"ks"
"ku"
"kv"
"kw"
"ky"
"la"
"lb"
"lg"
"li"
"ln"
"lo"
"lt"
"lu"
"lv"
"mg"
"mh"
"mi"
"mk"
"ml"
"mn"
"mr"
"ms"
"mt"
"my"
"na"
"national"
"nb"
"nd"
"ne"
"ng"
"nl"
"nn"
"no"
"nr"
"nv"
"ny"
"oc"
"official"
"oj"
"old"
"om"
"or"
"os"
"pa"
"pi"
"pl"
"ps"
"pt"
"qu"
"regional"
"rm"
"rn"
"ro"
"ru"
"rw"
"sa"
"sc"
"sd"
"se"
"sg"
"sh"
"si"
"sk"
"sl"
"sm"
"sn"
"so"
"sorting"
"sq"
"sr"
"ss"
"st"
"su"
"sv"
"sw"
"ta"
"te"
"tg"
"th"
"ti"
"tk"
"tl"
"tn"
"to"
"tr"
"ts"
"tt"
"tw"
"ty"
"ug"
"uk"
"ur"
"uz"
"ve"
"vi"
"vo"
"wa"
"wo"
"xh"
"yi"
"yo"
"za"
"zh"
"zu"

With values like old, international, and especially function Object() { [native code] } present, it's clear some level of whitelisting is required.

@Joxit
Copy link
Member

Joxit commented Aug 18, 2020

Since this should be done at import time, I suggest something like

{
  "imports": {
    "langs": ["en", "fr", "it", "nl"]
  }
}

With an extension for default language:

{
  "imports": {
    "langs": {
      "keep": ["en", "fr", "it", "nl"],
      "default": "en"
  }
}

function Object() { [native code] } 😅

@orangejulius
Copy link
Member Author

I realize I never replied to the default language part. It 100% sounds good to me 👍

@missinglink
Copy link
Member

The language filtering can be performed in pelias/model in a post-processing script, that way the code is only in one place and doesn't have to be implemented by every importer?

@Joxit
Copy link
Member

Joxit commented Jan 28, 2021

Yes, this sounds good to me 👍

@missinglink
Copy link
Member

I want a t-shirt that says function Object() { [native code] }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants