Releases: daac-tools/vibrato
v0.5.1
v0.5.0
Main changes
- Add a Wasm demo #115
- Handle locale on the Wasm demo #119
- Add bi-gram feature info generator for MeCab models #121
- Embed a magic number into a model #129
Precompiled model files
We provide precompiled models for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release. The licenses are contained in each file.
All models were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using CORE data in BCCWJ v1.1 (except the PN category).
Note that all the models are compressed in zstd format. You can directly input them to Vibrato CLIs, but if using vibrato
APIs, you need to extract them outside the APIs (see README).
Models trained using Vibrato
The three variants are trained using BCCWJ v1.1 (except the PN category) and UniDic v3.1.1.
bccwj-suw+unidic-cwj-3_1_1
: A standard version.bccwj-suw+unidic-cwj-3_1_1+compact
: A smaller (but slower) version that compresses the connection matrix in the manner described in small-dic.md.bccwj-suw+unidic-cwj-3_1_1+compact-dual
: An intermediate version of the above two by thedual-connector
technique.bccwj-suw+unidic-cwj-3_1_1-extracted+compact
: A further smaller version that contains only POS and pronunciation features.bccwj-suw+unidic-cwj-3_1_1-extracted+compact-dual
: Thedual-connector
version.
These models were trained with L1-regularization.
Models converted from publicly-available resources
ipadic-mecab-2_7_0
from IPADIC v2.7.0jumandic-mecab-7_0
from mecab-jumandic-utf8 v7.0naist-jdic-mecab-0_6_3b
from NAIST Japanese Dictionary v0.6.3bunidic-mecab-2_1_2
from UniDic v2.1.2unidic-cwj-3_1_1
from UniDic v3.1.1unidic-cwj-3_1_1+compact
from UniDic v3.1.1, whose connection matrix is compressed in a manner of mecab_smalldic.unidic-cwj-3_1_1+compact-dual
from UniDic v3.1.1, which is thedual-connector
version.
Statistics for compressed UniDic models
The following table shows UniDic model sizes in the two versions: without and with +compact
or +compact-dual
(not in zstd format).
Models | Standard | Compact | Compact-dual |
---|---|---|---|
bccwj-suw+unidic-cwj-3_1_1 | 618 MB | 248 MB | 275 MB |
unidic-cwj-3_1_1 | 717 MB | 252 MB | 300 MB |
v0.4.0
Main changes
- Handle zstd-compressed dictionaries in all CLIs #112
Precompiled dictionary files
We provide precompiled dictionaries for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release.
The following variants are distributed:
ipadic-mecab-2_7_0/system.dic.zst
from IPADIC v2.7.0ipadic-mecab-2_7_0-small/system.dic.zst
from IPADIC v2.7.0- A smaller version that contains only the features
品詞-品詞細分類1
and発音
.
- A smaller version that contains only the features
jumandic-mecab-7_0/system.dic.zst
from mecab-jumandic-utf8 v7.0naist-jdic-mecab-0_6_3b/system.dic.zst
from NAIST Japanese Dictionary v0.6.3bunidic-mecab-2_1_2/system.dic.zst
from UniDic v2.1.2unidic-cwj-3_1_1/system.dic.zst
from UniDic v3.1.1
These system dictionaries were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using license-expired data obtained from Aozora Bunko, following the guideline.
The licenses are contained in each file.
v0.3.3
v0.3.2
Main changes
- Add train feature flag #93
- Publish WordIdx and Dictionary::word_feature() #101
- Separate lifetime parameter in Worker and Tokenizer #102
Precompiled dictionary files
You can use those distributed in Release v0.3.1.
v0.3.1
Main changes
- Remove preparation scripts and distribute precompiled binaries #87
- Add DualConnector, a faster and smaller dictionary format #86
Precompiled dictionary files
We provide precompiled dictionaries for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release.
The following three variants are distributed:
ipadic-mecab-2_7_0/system.dic
from IPADIC v2.7.0jumandic-mecab-7_0/system.dic
from mecab-jumandic-utf8 v7.0naist-jdic-mecab-0_6_3b/system.dic
from NAIST Japanese Dictionary v0.6.3bunidic-mecab-2_1_2/system.dic
from UniDic v2.1.2unidic-cwj-3_1_1/system.dic
from UniDic v3.1.1
These system dictionaries were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using license-expired data obtained from Aozora Bunko, following the guideline.
The licenses are contained in each file.
v0.3.0
v0.2.0
v0.1.2
v0.1.1
The initial release!