Tokenizer Incorrectly Splitting "M1M" #13360
Unanswered
jasondalycanpk
asked this question in
Help: Other Questions
Replies: 1 comment
-
Hi! The tokenizer applies some heuristics, and in this case it's seeing "M" as a unit. You'd have the same behaviour when appending
Like I said - these are heuristics that often help correctly tokenize texts where spaces are missing, but they might also result in false positive hits from time to time, as might be the case in your data. You could consider customizing the tokenizer for your use-case. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The tokenizer is incorrectly splitting the token M1M into M1 and M tokens. See the following:
How to reproduce the behaviour
Run the following code:
This gives the following output:
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions