Add an additional Chinese tokenizer #6304

howard-haowen · 2020-10-25T15:55:00Z

howard-haowen
Oct 25, 2020

Feature description

This feature would integrate CkipTagger, a tokenizer/POS/NER tagger trained on Traditional Chinese, into the spaCy ecosystem. I noticed that spaCy recently added PKUseq, so maybe it's also possible to add CkipTagger? Based on my experience, CkipTagger typically has better results than PKUseq or jieba when dealing with texts originally written in traditional Chinese.

adrianeboyd · 2020-10-26T07:19:16Z

adrianeboyd
Oct 26, 2020

We'd be happy to accept a PR that adds this option to the ChineseTokenizer. If someone wants to work on this, please start with the current develop branch (soon to be spacy v3.0.0) since the Chinese tokenizer settings have been redesigned to make it easier to add additional word segmentation tools.

If the models are all loaded from external sources/packages (similar to jieba, or sudachipy for Japanese), then it should be a matter of adding a new segmenter option with additional configuration options if necessary, along with a function to load the module and models. You'd also need to align the word segmenter output with the original text to create a Doc as here for jieba and pkuseg:

spaCy/spacy/lang/zh/__init__.py

Lines 92 to 101 in 2c98040

    
           if self.segmenter == Segmenter.jieba: 
        
               words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x]) 
        
               (words, spaces) = util.get_words_and_spaces(words, text) 
        
               return Doc(self.vocab, words=words, spaces=spaces) 
        
           elif self.segmenter == Segmenter.pkuseg: 
        
               if self.pkuseg_seg is None: 
        
                   raise ValueError(Errors.E1000) 
        
               words = self.pkuseg_seg.cut(text) 
        
               (words, spaces) = util.get_words_and_spaces(words, text) 
        
               return Doc(self.vocab, words=words, spaces=spaces)

As long as the tool doesn't do any unexpected normalization that makes it hard to align the output with the original text, this should be relatively straightforward. If you want to add the POS tags, that's also possible. The Japanese sudachipy example shows one way to do this, and there's also a new option for Doc in v3 that lets you provide the tags or pos on initialization along with words and spaces.

If you want the model to be packaged with the spacy model itself (as we do for our custom pkuseg models), then implementing the serialization would potentially be the most complicated part, but if you just want to have the option to load models from a third-party package or rely on the user to download models locally in advance, then you wouldn't need to add anything beyond the configuration settings to the serialization (like the sudachipy split mode setting).

0 replies

howard-haowen · 2020-10-27T04:24:37Z

howard-haowen
Oct 27, 2020
Author

Thanks for the immediate reply. I'll try the easy way that you suggested and see how it works out!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an additional Chinese tokenizer #6304

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Add an additional Chinese tokenizer #6304

howard-haowen Oct 25, 2020

Feature description

Replies: 2 comments

adrianeboyd Oct 26, 2020

howard-haowen Oct 27, 2020 Author

howard-haowen
Oct 25, 2020

adrianeboyd
Oct 26, 2020

howard-haowen
Oct 27, 2020
Author