Language Code for languages not listed in ISO 639-1 #2338
Replies: 3 comments
-
Thanks for this issue – those are definitely valid points. I'll also go ahead and merge the Norwegian issue #2339 and the older issue #1308, because they're all related to the same considerations. We should probably consider extending the two-letter language codes and allow using hyphenated forms like |
Beta Was this translation helpful? Give feedback.
-
Quick upodate, copying over my reply from #4187:
|
Beta Was this translation helpful? Give feedback.
-
This issue is a bit old but I just stumbled upon it. If you haven't yet committed to a language code format, might I suggest against defining your own format and instead using a standard? BCP 47 is the format used in computing already (e.g., as the value of an One big benefit is that you only need to be as specific as necessary, so if a language has an ISO 639-1 alpha-2 code and you don't care about the region, script, etc., you can just use the 2-letter code, meaning it would be backward compatible with spaCy's existing codes. There may be some edge cases: There are also good libraries for working with BCP 47; langcodes is one with a compatible license (MIT). |
Beta Was this translation helpful? Give feedback.
-
Currently spaCy uses the two letter ISO 639-1 language codes. However that list does not cover all the languages.
As more and more language models are added, we may come across situation where the language does not have a two letter code, or the model is trained on a specific dialect. So, I guess we should switch to ISO 639-3 language codes, or find some way to use both?
Beta Was this translation helpful? Give feedback.
All reactions