[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

luisquintanilla · 2024-06-26T02:38:38Z

Given the following code:

using Microsoft.ML.Tokenizers;

ReadOnlySpan<char> sentence = "[CLS] I love AI [SEP]";

var pretokenizer = new WhiteSpace();

var tokens = pretokenizer.PreTokenize(sentence);

// Actual
Console.WriteLine("Actual");
foreach(var result in tokens)
{
    var substr = sentence.Slice(result.Offset, result.Length);
    Console.WriteLine(substr.ToString());
}

Console.WriteLine("----");

// Expected
Console.WriteLine("Expected");
foreach(var expected in sentence.ToString().Split(' '))
{
    Console.WriteLine(expected);
}

The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.

Output:

Actual:
[
CLS
]
I
love
AI
[
SEP
]
----
Expected:
[CLS]
I
love
AI
[SEP]

I would expect it to only split on whitespace based on the name.

The text was updated successfully, but these errors were encountered:

tarekgh · 2024-06-26T16:22:36Z

@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.

machinelearning/src/Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs

Line 22 in 8e3f72d

private const string PretokenizePattern = /*lang=regex*/ @"\w+|[^\w\s]+";

https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21

Do you still want to change that?

dotnet-policy-service · 2024-06-26T16:23:57Z

This issue has been marked needs-author-action and may be missing some important information.

dotnet-policy-service bot added the untriaged New issue has not been triaged label Jun 26, 2024

luisquintanilla assigned tarekgh Jun 26, 2024

tarekgh added needs-author-action and removed untriaged New issue has not been triaged labels Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

luisquintanilla commented Jun 26, 2024 •

edited

Loading

tarekgh commented Jun 26, 2024 •

edited

Loading

dotnet-policy-service bot commented Jun 26, 2024

[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

Comments

luisquintanilla commented Jun 26, 2024 • edited Loading

tarekgh commented Jun 26, 2024 • edited Loading

dotnet-policy-service bot commented Jun 26, 2024

luisquintanilla commented Jun 26, 2024 •

edited

Loading

tarekgh commented Jun 26, 2024 •

edited

Loading