You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using Microsoft.ML.Tokenizers;ReadOnlySpan<char>sentence="[CLS] I love AI [SEP]";varpretokenizer=new WhiteSpace();vartokens= pretokenizer.PreTokenize(sentence);// Actual
Console.WriteLine("Actual");foreach(var result in tokens){varsubstr= sentence.Slice(result.Offset, result.Length);
Console.WriteLine(substr.ToString());}
Console.WriteLine("----");// Expected
Console.WriteLine("Expected");foreach(var expected in sentence.ToString().Split(' ')){
Console.WriteLine(expected);}
The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.
Output:
Actual:
[
CLS
]
I
love
AI
[
SEP
]
----
Expected:
[CLS]
I
love
AI
[SEP]
I would expect it to only split on whitespace based on the name.
The text was updated successfully, but these errors were encountered:
Given the following code:
The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.
Output:
I would expect it to only split on whitespace based on the name.
The text was updated successfully, but these errors were encountered: