Uncommon hyphen doesn't get recognized #4384

KristiyanVachev · 2019-10-06T12:18:26Z

KristiyanVachev
Oct 6, 2019

How to reproduce the behaviour

Enough said.

The hyphen in question is '–' in '4–15 kg'. The hyphen that's recognized is the regular '-'.

The text comes from SQuAD v1 dataset.

Your Environment

Operating System: Windows-10-10.0.18362-SP0
Python Version Used: Python 3.7.2
spaCy Version Used: 2.1.8

Edit: * spaCy Version 2.2.1 also includes the bug

Answered by adrianeboyd

Oct 7, 2019

Hi, tokenization around hyphens can be tricky and as you noticed, the current default tokenizer only splits on ASCII hyphen between numbers. It is probably a good idea to add en dash to the defaults between numbers, but we'd need to be careful to consider what other kinds of cases this might affect.

In the meanwhile, it's fairly easy to customize the tokenizer. Here's the line that matches ASCII hyphen between numbers as an infix and all we need to do is add a pattern with en dash in the same context:

spaCy/spacy/lang/punctuation.py

Line 41 in 573e543

r"(?<=[0-9])[+\-\*^](?=[0-9-])",

infixes = nlp.Defaults.infixes + (r"(?<=[0-9])–(?=[0-9-])",)
infix_regex = spacy.util.co…

View full answer

adrianeboyd · 2019-10-07T07:42:33Z

adrianeboyd
Oct 7, 2019

Hi, tokenization around hyphens can be tricky and as you noticed, the current default tokenizer only splits on ASCII hyphen between numbers. It is probably a good idea to add en dash to the defaults between numbers, but we'd need to be careful to consider what other kinds of cases this might affect.

In the meanwhile, it's fairly easy to customize the tokenizer. Here's the line that matches ASCII hyphen between numbers as an infix and all we need to do is add a pattern with en dash in the same context:

spaCy/spacy/lang/punctuation.py

Line 41 in 573e543

r"(?<=[0-9])[+\-\*^](?=[0-9-])",

infixes = nlp.Defaults.infixes + (r"(?<=[0-9])–(?=[0-9-])",)
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
print([token.text for token in nlp('4–15 kg')])
# ['4', '–', '15', 'kg']

Here is an example with a bit more detail in the docs: https://spacy.io/usage/linguistic-features#native-tokenizer-additions

If you're using 2.1.8, be sure you modify the tokenizer before processing any texts or you might run into a caching bug, which is fixed in 2.2.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncommon hyphen doesn't get recognized #4384

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Uncommon hyphen doesn't get recognized #4384

KristiyanVachev Oct 6, 2019

How to reproduce the behaviour

Your Environment

Replies: 1 comment

adrianeboyd Oct 7, 2019

KristiyanVachev
Oct 6, 2019

adrianeboyd
Oct 7, 2019