Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kyrgyz language support #1344

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Kyrgyz language support #1344

wants to merge 2 commits into from

Conversation

alexeyev
Copy link

@alexeyev alexeyev commented Dec 5, 2024

Hello, thank you for your fantastic work.

Please, add the support of the Kyrgyz language. How can I help?

In this pull request I provide the list of characters and a list of words built based on the two corpora from here using this hacky script:

import re

paths = [#"data/kir_community_2017/kir_community_2017-words.txt",
         "data/kir_newscrawl_2016_1M/kir_newscrawl_2016_1M-words.txt",
         "data/kir_wikipedia_2021_300K/kir_wikipedia_2021_300K-words.txt"]

tokens = []
removable = re.compile(r"(.*[′…ЇЈЎ&')¤/´˅(\"A-Za-z0-9Α-Ωα-ω.úƒƖ½ö+ЄІ,:;?!>< ]+.*|Ё.*|\w-\w+)", re.UNICODE)

for path in paths:
    with (open(path, "r", encoding="utf-8") as rf):
        for line in rf:
            line = line.strip()
            if line:
                split_line = line.split("\t")
                count = int(split_line[2])
                if count < 6:
                    continue
                token = split_line[1].strip() \
                    .replace("ɵ", "ө") \
                    .replace("ϴ", "Ө") \
                    .replace("ʏ", "ү")
                token = token.strip("​•₣‰ʿ°—­‘»²¬/µ«£:;“”„'()´`$%–№.,-")
                if len(token) > 2 and not removable.match(token):
                    tokens.append(token)

tokens = sorted(list(set(tokens)))
tokens_clipped_tail = []

for token in tokens:
    if token == "өөө":
        break
    else:
        tokens_clipped_tail.append(token)

with open("ky.txt", "w", encoding="utf-8") as wf:
    wf.write("\n".join(tokens_clipped_tail))

print(f"A total of {len(tokens_clipped_tail)} tokens.")

Best regards,
Anton.

@Hellomik2002
Copy link

Can you help me with the Kazakh lang, Write me please https://t.me/hellomik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants