tok is a Byte Pair Encoding tokenizer used for splitting text into tokens which can then be encoded into ids.
- Custom string normalization
- Easy to use API for interacting with text tokenization
- Serializable vocabulary and merge rules
tok uses utf8proc for normalizing strings and cereal for serializing, you can also install them through your package manager.
Using a debian based distro:
apt install libutf8proc-dev libcereal-dev
- Clone the repository with
git clone https://github.com/M3nny/tok
- Run
make
inside the cloned repository, it will create abuild
directory with the static library - Include it in you project (you also have to link utf8proc)
g++ -std=c++11 -c program.cpp -o program.o
g++ -std=c++11 program.o -o program -L path_to/tok/build -l tok -l utf8proc
#include <vector>
#include "tok.hpp"
int main() {
tok tokenizer;
tokenizer.load("pretrained/eng_adjectives_adverbs_30k.bin");
std::string str = "i've just bought a melon!";
std::vector<std::string> tokenized_str = tokenizer.tokenize(str);
// ["i", "'", "ve", "Ķjust", "Ķbought", "Ķa", "Ķmel", "on", "!", "<|eot|>"]
return 0;
}
Important
The API documentation can be found in tok.hpp
and some examples are listed inside the examples
folder.
You can find pretrained vocabularies inside pretrained
.