go-sentencepiece

This is a pure Go implementation of encoding and decoding text with the SentencePiece tokenizer.

"Encoding" is the operation used to split text into tokens, using a trained tokenizer model. "Decoding" is the reverse process - converting a list of tokens into the original text.

SentencePiece is a general family of tokenizers that is configured by a protobuf configuration file. This repository currently focuses on implementing just the functionality required to reproduce the tokenization of Gemma models (the same tokenizer is used for Google's proprietary Gemini family of models). Specifically, it only implements BPE tokenization since this is what Gemma uses.

Current status

This package should be ready to use for encoding text into tokens using the Gemma tokenizer; it's been reasonably optimized and extensively tested vs. the SentencePiece Python bindings (see system_test.go in this repository).

If you find any problems or discrepancies, please open an issue.

Tokenizer configuration

The configuration file for the tokenizer is a protobuf (structured data, serialized in the protocol buffer format) that describes a trained tokenizer model; it includes the complete learned vocabulary used for tokenization, as well as other configuration information.

It is not part of this repository. Please fetch it from the official Gemma implementation repository. NewProcessor* constructors will expect to read this file.

Developing

A protobuf is used to configure the tokenizer. The structure of the protobuf is described by the internal/model/sentencepiece_model.proto file, which is vendored from https://github.com/google/sentencepiece

To re-generate the *.pb.go file from it:

$ cd internal/model
$ ./gen.sh

The configuration protobuf itself is obtained as described in the Tokenizer configuration section. All tests require the MODELPATH env var to point to a local copy of the tokenizer configuration file.

Online demo

To see an in-browser demo of this tokenizer in action, visit https://eliben.github.io/go-sentencepiece/

The Go code is compiled to WebAssembly and loaded from a small JS program to allow interactive encoding of text.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
doc		doc
internal		internal
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_test.go		benchmark_test.go
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
normalize.go		normalize.go
processor.go		processor.go
processor_test.go		processor_test.go
system_test.go		system_test.go
token.go		token.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-sentencepiece

Current status

Tokenizer configuration

Developing

Online demo

About

Releases

Contributors 2

Languages

License

eliben/go-sentencepiece

Folders and files

Latest commit

History

Repository files navigation

go-sentencepiece

Current status

Tokenizer configuration

Developing

Online demo

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages