🍫 tok

tok is a Byte Pair Encoding tokenizer used for splitting text into tokens which can then be encoded into ids.

Features

Custom string normalization
Easy to use API for interacting with text tokenization
Serializable vocabulary and merge rules

Installation

tok uses utf8proc for normalizing strings and cereal for serializing, you can also install them through your package manager.

Using a debian based distro:

apt install libutf8proc-dev libcereal-dev

Usage

Clone the repository with git clone https://github.com/M3nny/tok
Run make inside the cloned repository, it will create a build directory with the static library
Include it in you project (you also have to link utf8proc)

g++ -std=c++11 -c program.cpp -o program.o
g++ -std=c++11 program.o -o program -L path_to/tok/build -l tok -l utf8proc

Brief example

#include <vector>
#include "tok.hpp"

int main() {
    tok tokenizer;
    tokenizer.load("pretrained/eng_adjectives_adverbs_30k.bin");
    std::string str = "i've just bought a melon!";

    std::vector<std::string> tokenized_str = tokenizer.tokenize(str);
    // ["i", "'", "ve", "Ķjust", "Ķbought", "Ķa", "Ķmel", "on", "!", "<|eot|>"]

    return 0;
}

Important

The API documentation can be found in tok.hpp and some examples are listed inside the examples folder.

You can find pretrained vocabularies inside pretrained.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
examples		examples
pretrained		pretrained
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
tok.cpp		tok.cpp
tok.hpp		tok.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍫 tok

Features

Installation

Usage

Brief example

About

Releases

Packages

Languages

M3nny/tok

Folders and files

Latest commit

History

Repository files navigation

🍫 tok

Features

Installation

Usage

Brief example

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages