Skip to content
Discussion options

You must be logged in to vote

Hi, tokenization around hyphens can be tricky and as you noticed, the current default tokenizer only splits on ASCII hyphen between numbers. It is probably a good idea to add en dash to the defaults between numbers, but we'd need to be careful to consider what other kinds of cases this might affect.

In the meanwhile, it's fairly easy to customize the tokenizer. Here's the line that matches ASCII hyphen between numbers as an infix and all we need to do is add a pattern with en dash in the same context:

r"(?<=[0-9])[+\-\*^](?=[0-9-])",

infixes = nlp.Defaults.infixes + (r"(?<=[0-9])–(?=[0-9-])",)
infix_regex = spacy.util.co…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #4384 on December 11, 2020 00:03.