How to make tokenizer handle double-word countries name ? #2793

ToSev7en · 2018-09-25T03:41:42Z

ToSev7en
Sep 25, 2018

Such as "south korea is an country where..." , the tokenizer usually return two tokens: "south" and "korea"

but what I want is "south korea", just one token is better for my problem.

Thanks your attention~

Info about spaCy

spaCy version: 2.0.11
Platform: Darwin-17.7.0-x86_64-i386-64bit
Python version: 3.6.2
Models: en

Answered by ines

Sep 25, 2018

If you want "South Korea" to be one token, the best approach would be to find and then merge the tokens afterwards. You can do this by adding a custom component to your pipeline.

This example shows a pretty similar use case: https://github.com/explosion/spaCy/blob/master/examples/pipeline/custom_component_countries_api.py

Given a list of countries, it uses the PhraseMatcher to find them in the Doc and merges them into one token. Optionally, you can also set entity labels or custom attributes on the merged spans.

View full answer

ines · 2018-09-25T09:59:14Z

ines
Sep 25, 2018
Maintainer

If you want "South Korea" to be one token, the best approach would be to find and then merge the tokens afterwards. You can do this by adding a custom component to your pipeline.

This example shows a pretty similar use case: https://github.com/explosion/spaCy/blob/master/examples/pipeline/custom_component_countries_api.py

Given a list of countries, it uses the PhraseMatcher to find them in the Doc and merges them into one token. Optionally, you can also set entity labels or custom attributes on the merged spans.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make tokenizer handle double-word countries name ? #2793

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to make tokenizer handle double-word countries name ? #2793

ToSev7en Sep 25, 2018

Info about spaCy

Replies: 1 comment

ines Sep 25, 2018 Maintainer

ToSev7en
Sep 25, 2018

ines
Sep 25, 2018
Maintainer