How to make tokenizer handle double-word countries name ? #2793
-
Such as "south korea is an country where..." , the tokenizer usually return two tokens: "south" and "korea" but what I want is "south korea", just one token is better for my problem. Thanks your attention~ Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If you want "South Korea" to be one token, the best approach would be to find and then merge the tokens afterwards. You can do this by adding a custom component to your pipeline. This example shows a pretty similar use case: https://github.com/explosion/spaCy/blob/master/examples/pipeline/custom_component_countries_api.py Given a list of countries, it uses the |
Beta Was this translation helpful? Give feedback.
If you want "South Korea" to be one token, the best approach would be to find and then merge the tokens afterwards. You can do this by adding a custom component to your pipeline.
This example shows a pretty similar use case: https://github.com/explosion/spaCy/blob/master/examples/pipeline/custom_component_countries_api.py
Given a list of countries, it uses the
PhraseMatcher
to find them in theDoc
and merges them into one token. Optionally, you can also set entity labels or custom attributes on the merged spans.