tokenizer_exceptions problem with Persian #1772
mosynaq
started this conversation in
Language Support
Replies: 1 comment
-
It's true that we don't really have a good solution to this for Persian, Arabic and other similar languages. I'm still not positive what the best strategy should be. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to train Persian to spaCy. One of the problems is here: in
tokenizer_exceptions.py
, spaCy expects concatenation of two orths, form the word itself likedo + n't = don't
, but for Persian, this expectation is not valid for some cases.For example, the verb "بر نخواهد گشت" ( = s/he will not return), is made up of "بر" + "نـ" + "خواهد گشت".
("نـ" negates a Persian verb. Most of the times the negation thing comes in the beginning, but in some, cases like this one, it comes in between. )
As you can see you cannot simply concatenate orthes to form the full form.
Should spaCy expectation be changed? Or should I do something?
My Environment
Beta Was this translation helpful? Give feedback.
All reactions