tokenizer_exceptions problem with Persian #1772

mosynaq · 2017-12-27T15:58:31Z

mosynaq
Dec 27, 2017

I am trying to train Persian to spaCy. One of the problems is here: in tokenizer_exceptions.py, spaCy expects concatenation of two orths, form the word itself like do + n't = don't, but for Persian, this expectation is not valid for some cases.
For example, the verb "بر نخواهد گشت" ( = s/he will not return), is made up of "بر" + "نـ" + "خواهد گشت".
("نـ" negates a Persian verb. Most of the times the negation thing comes in the beginning, but in some, cases like this one, it comes in between. )
As you can see you cannot simply concatenate orthes to form the full form.

Should spaCy expectation be changed? Or should I do something?

My Environment

spaCy version: 2.0.5
Platform: Linux-4.11.0-14-generic-x86_64-with-LinuxMint-18.2-sonya
Models: en, fa
Python version: 3.5.2

honnibal · 2018-09-12T14:27:25Z

honnibal
Sep 12, 2018
Maintainer

It's true that we don't really have a good solution to this for Persian, Arabic and other similar languages. I'm still not positive what the best strategy should be.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer_exceptions problem with Persian #1772

{{title}}

Replies: 1 comment

{{title}}

Select a reply

tokenizer_exceptions problem with Persian #1772

mosynaq Dec 27, 2017

My Environment

Replies: 1 comment

honnibal Sep 12, 2018 Maintainer

mosynaq
Dec 27, 2017

honnibal
Sep 12, 2018
Maintainer