Losing POS Tagging & Other Token Attributes when Segmenting with Jieba or Pkuseg #12846
creolio
started this conversation in
Language Support
Replies: 1 comment 1 reply
-
Hi @creolio, in the first example (with the default segmenter) you're using the In your other two snippets, using |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to ensure that I have accurate word segmentation/tokenization for Chinese while retaining access to token attributes such as part of speech, but it seems that when I switch segmenters from the default, I lose most of the token attribute data. I'm not training any custom models or anything like that.
My base jupyter notebook code looks like this:
With the above, I'm able to get both segmentation and token attributes, but am confused because I thought the default segmenter was "char". I'm using this as my solution for now, but would like to be able to play with other segmenters:
When I change:
nlp = spacy.load("zh_core_web_sm")
To:
I get this output:
Or if I use the following instead:
Or:
I get this output:
Using:
python3.10
spacy3.6
Beta Was this translation helpful? Give feedback.
All reactions