Chinese word segmentation model for spaCy #12984

PythonCancer · 2023-08-18T08:15:28Z

PythonCancer
Aug 18, 2023

The Chinese word segmentation model zh_core_web_sm-3.5.0 in spaCy has two files. One is weights.npz, which contains dimensions and model weight values, and I can understand that. The other file is features.msgpack; what is this file for? Is it for features? Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?

rmitsch · 2023-08-21T19:05:09Z

rmitsch
Aug 21, 2023
Maintainer

Hi @PythonCancer,

...word segmentation model zh_core_web_sm-3.5.0 in spaCy...

zh_core_web_sm-3.5.0 is a pre-trained spaCy pipeline, not just a word segmentation model. For word segmentation for Chinese text in spaCy see https://spacy.io/usage/models#chinese - we support character segmentation and the two third-party word segmenter jieba and pkuseg.

The other file is features.msgpack; what is this file for? Is it for features?

Yes, features used by pkuseg to determine how to perform word segmentation.

Because I want to train my own word segmentation model and embed it into spaCy, can you explain it?

spaCy itself doesn't provide specialized components for word segmentation (other than for tokenization, lemmatization, dependency parsing etc.). If you want to train your own word segmentation model and it outperforms the ones integrated in spaCy w.r.t. accuracy or speed, we're happy to consider integrating it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese word segmentation model for spaCy #12984

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Chinese word segmentation model for spaCy #12984

PythonCancer Aug 18, 2023

Replies: 1 comment

rmitsch Aug 21, 2023 Maintainer

PythonCancer
Aug 18, 2023

rmitsch
Aug 21, 2023
Maintainer