Add an additional Chinese tokenizer #6304
Replies: 2 comments
-
We'd be happy to accept a PR that adds this option to the If the models are all loaded from external sources/packages (similar to spaCy/spacy/lang/zh/__init__.py Lines 92 to 101 in 2c98040 As long as the tool doesn't do any unexpected normalization that makes it hard to align the output with the original text, this should be relatively straightforward. If you want to add the POS tags, that's also possible. The Japanese If you want the model to be packaged with the spacy model itself (as we do for our custom |
Beta Was this translation helpful? Give feedback.
-
Thanks for the immediate reply. I'll try the easy way that you suggested and see how it works out! |
Beta Was this translation helpful? Give feedback.
-
Feature description
This feature would integrate CkipTagger, a tokenizer/POS/NER tagger trained on Traditional Chinese, into the spaCy ecosystem. I noticed that spaCy recently added PKUseq, so maybe it's also possible to add CkipTagger? Based on my experience, CkipTagger typically has better results than PKUseq or jieba when dealing with texts originally written in traditional Chinese.
Beta Was this translation helpful? Give feedback.
All reactions