Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

Open
luisquintanilla opened this issue Jun 26, 2024 · 2 comments
Open
Assignees

Comments

@luisquintanilla
Copy link
Contributor

luisquintanilla commented Jun 26, 2024

Given the following code:

using Microsoft.ML.Tokenizers;

ReadOnlySpan<char> sentence = "[CLS] I love AI [SEP]";

var pretokenizer = new WhiteSpace();

var tokens = pretokenizer.PreTokenize(sentence);

// Actual
Console.WriteLine("Actual");
foreach(var result in tokens)
{
    var substr = sentence.Slice(result.Offset, result.Length);
    Console.WriteLine(substr.ToString());
}

Console.WriteLine("----");

// Expected
Console.WriteLine("Expected");
foreach(var expected in sentence.ToString().Split(' '))
{
    Console.WriteLine(expected);
}

The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.

Output:

Actual:
[
CLS
]
I
love
AI
[
SEP
]
----
Expected:
[CLS]
I
love
AI
[SEP]

I would expect it to only split on whitespace based on the name.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Jun 26, 2024
@tarekgh
Copy link
Member

tarekgh commented Jun 26, 2024

@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.

private const string PretokenizePattern = /*lang=regex*/ @"\w+|[^\w\s]+";

https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21

Do you still want to change that?

@tarekgh tarekgh added needs-author-action and removed untriaged New issue has not been triaged labels Jun 26, 2024
Copy link
Contributor

This issue has been marked needs-author-action and may be missing some important information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants