This repository provides word segmentation models available in the fast tokenizer Vaporetto, as well as a set of programs for creating each model.
Create the resources
directory directly under the repository root, copy *.xml
files contained
in the BCCWJ M-XML directory and lex_3_1.csv
contained in
UniDic 3.1.1 into it, and then run
build.sh
in the models
directory.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
See the guidelines.