Tokenizer Incorrectly Splitting "M1M" #13360

jasondalycanpk · 2024-02-27T18:30:26Z

jasondalycanpk
Feb 27, 2024

The tokenizer is incorrectly splitting the token M1M into M1 and M tokens. See the following:

How to reproduce the behaviour

Run the following code:

import spacy


nlp = spacy.load("en_core_web_trf")
doc = nlp("hello there M1M 1M1")

for sent in doc.sents:
    for token in sent:
        print(token)

This gives the following output:

hello
there
M1
M
1M1

Your Environment

Operating System: Debian Bookworm (12)
Python Version Used: 3.11.x
spaCy Version Used: spacy-3.5.4

svlandeg · 2024-02-29T14:39:03Z

svlandeg
Feb 29, 2024
Maintainer

Hi!

The tokenizer applies some heuristics, and in this case it's seeing "M" as a unit. You'd have the same behaviour when appending K or mg. The suffix rule responsible for this behaviour is here:

r"(?<=[0-9])(?:{u})".format(u=UNITS)

Like I said - these are heuristics that often help correctly tokenize texts where spaces are missing, but they might also result in false positive hits from time to time, as might be the case in your data. You could consider customizing the tokenizer for your use-case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Incorrectly Splitting "M1M" #13360

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Tokenizer Incorrectly Splitting "M1M" #13360

jasondalycanpk Feb 27, 2024

How to reproduce the behaviour

Your Environment

Replies: 1 comment

svlandeg Feb 29, 2024 Maintainer

jasondalycanpk
Feb 27, 2024

svlandeg
Feb 29, 2024
Maintainer