Question only: not sure which parameters to use for this library #928

jzohrab · 2024-10-13T03:54:30Z

Hello, thank you for the library.

I've written a free program for learning languages called Lute (https://github.com/LuteOrg/lute-v3), and it would be nice to add Thai support. This library looks great, but I'm not sure what are the "best" parameters when using it. As I don't speak Thai, I can't say if the sentence splitting is accurate or not, for Thai learners.

I did some testing at https://github.com/jzohrab/lute_thai_testing -- can you suggest what might be the most accurate settings for the library, for splitting Thai texts into sentences for learners?

Cheers and regards!

github-actions · 2024-10-13T03:54:53Z

Hello @jzohrab, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

wannaphong · 2024-10-13T05:31:37Z

Hello! Here is the best config for Thai language:

Word tokenizer: Deepcut, it is a state-of-the-act deep learning method for Thai word tokenizer but it is slow and use many compute, so you can use newmm is a dictionary-based, maximum matching, constrained by Thai Character Cluster (TCC) boundaries with improved TCC rules that are used in newmm. If you want to improve newmm, you can deepcut for doing the update dictionary from your data and add new words to newmm's dictionary. see more: https://pythainlp.org/tutorials/notebooks/pythainlp_get_started.html#Word
Sentence tokenizer: wtpsplit, it is state-of-the-act sentence tokenizer for Thai language. see more https://aclanthology.org/2023.acl-long.398/. If you use wtpsplit by pythainlp, you can read the docs: https://pythainlp.org/docs/5.0/api/tokenize.html

Regards,
Wannaphong

wannaphong added the question asking questions/giving suggestions label Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question only: not sure which parameters to use for this library #928

Question only: not sure which parameters to use for this library #928

jzohrab commented Oct 13, 2024

github-actions bot commented Oct 13, 2024

wannaphong commented Oct 13, 2024

Question only: not sure which parameters to use for this library #928

Question only: not sure which parameters to use for this library #928

Comments

jzohrab commented Oct 13, 2024

github-actions bot commented Oct 13, 2024

wannaphong commented Oct 13, 2024