Uncommon hyphen doesn't get recognized #4384
-
How to reproduce the behaviourEnough said. The hyphen in question is The text comes from SQuAD v1 dataset. Your Environment
Edit: * spaCy Version 2.2.1 also includes the bug |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi, tokenization around hyphens can be tricky and as you noticed, the current default tokenizer only splits on ASCII hyphen between numbers. It is probably a good idea to add en dash to the defaults between numbers, but we'd need to be careful to consider what other kinds of cases this might affect. In the meanwhile, it's fairly easy to customize the tokenizer. Here's the line that matches ASCII hyphen between numbers as an infix and all we need to do is add a pattern with en dash in the same context: spaCy/spacy/lang/punctuation.py Line 41 in 573e543
Here is an example with a bit more detail in the docs: https://spacy.io/usage/linguistic-features#native-tokenizer-additions If you're using 2.1.8, be sure you modify the tokenizer before processing any texts or you might run into a caching bug, which is fixed in 2.2. |
Beta Was this translation helpful? Give feedback.
Hi, tokenization around hyphens can be tricky and as you noticed, the current default tokenizer only splits on ASCII hyphen between numbers. It is probably a good idea to add en dash to the defaults between numbers, but we'd need to be careful to consider what other kinds of cases this might affect.
In the meanwhile, it's fairly easy to customize the tokenizer. Here's the line that matches ASCII hyphen between numbers as an infix and all we need to do is add a pattern with en dash in the same context:
spaCy/spacy/lang/punctuation.py
Line 41 in 573e543