Skip to content

Releases: daac-tools/vibrato

v0.5.1

12 May 02:54
2df1de2
Compare
Choose a tag to compare

Main changes

  • Update MSRV to 1.65 #138
  • Update bincode version to 2.0.0-rc.3 #140

Precompiled model files

You can use those distributed in https://github.com/daac-tools/vibrato/releases/tag/v0.5.0

v0.5.0

22 Feb 04:52
a295d9d
Compare
Choose a tag to compare

Main changes

  • Add a Wasm demo #115
  • Handle locale on the Wasm demo #119
  • Add bi-gram feature info generator for MeCab models #121
  • Embed a magic number into a model #129

Precompiled model files

We provide precompiled models for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release. The licenses are contained in each file.

All models were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using CORE data in BCCWJ v1.1 (except the PN category).

Note that all the models are compressed in zstd format. You can directly input them to Vibrato CLIs, but if using vibrato APIs, you need to extract them outside the APIs (see README).

Models trained using Vibrato

The three variants are trained using BCCWJ v1.1 (except the PN category) and UniDic v3.1.1.

  • bccwj-suw+unidic-cwj-3_1_1: A standard version.
  • bccwj-suw+unidic-cwj-3_1_1+compact: A smaller (but slower) version that compresses the connection matrix in the manner described in small-dic.md.
  • bccwj-suw+unidic-cwj-3_1_1+compact-dual: An intermediate version of the above two by the dual-connector technique.
  • bccwj-suw+unidic-cwj-3_1_1-extracted+compact: A further smaller version that contains only POS and pronunciation features.
  • bccwj-suw+unidic-cwj-3_1_1-extracted+compact-dual: The dual-connector version.

These models were trained with L1-regularization.

Models converted from publicly-available resources

Statistics for compressed UniDic models

The following table shows UniDic model sizes in the two versions: without and with +compact or +compact-dual (not in zstd format).

Models Standard Compact Compact-dual
bccwj-suw+unidic-cwj-3_1_1 618 MB 248 MB 275 MB
unidic-cwj-3_1_1 717 MB 252 MB 300 MB

v0.4.0

03 Feb 04:23
8887f5b
Compare
Choose a tag to compare

Main changes

  • Handle zstd-compressed dictionaries in all CLIs #112

Precompiled dictionary files

We provide precompiled dictionaries for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release.

The following variants are distributed:

These system dictionaries were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using license-expired data obtained from Aozora Bunko, following the guideline.

The licenses are contained in each file.

v0.3.3

14 Dec 04:26
fa34ba5
Compare
Choose a tag to compare

Main changes

  • Publish members of WordIdx #104
  • Add const variable VERSION #105

Precompiled dictionary files

You can use those distributed in Release v0.3.1.

v0.3.2

13 Dec 07:08
84ef956
Compare
Choose a tag to compare

Main changes

  • Add train feature flag #93
  • Publish WordIdx and Dictionary::word_feature() #101
  • Separate lifetime parameter in Worker and Tokenizer #102

Precompiled dictionary files

You can use those distributed in Release v0.3.1.

v0.3.1

26 Oct 05:45
689ee41
Compare
Choose a tag to compare

Main changes

  • Remove preparation scripts and distribute precompiled binaries #87
  • Add DualConnector, a faster and smaller dictionary format #86

Precompiled dictionary files

We provide precompiled dictionaries for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release.

The following three variants are distributed:

These system dictionaries were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using license-expired data obtained from Aozora Bunko, following the guideline.

The licenses are contained in each file.

v0.3.0

19 Oct 03:59
6ba9fdf
Compare
Choose a tag to compare

Main changes

  • Reorganize builder modules #74, #77
  • Reorganize workspaces #80 and their docs #71
  • Add accuracy evaluator #57
  • Add smaller dictionary option #63
  • Support longer input sentences #72
  • Speed up the tokenize command when stdout is not TTY #59

v0.2.0

23 Sep 01:08
a303803
Compare
Choose a tag to compare

Main updates

Minor updates

  • The command line arguments of prepare/system are changed. #54
  • The version in the unidic-cwj installer is updated to v3.1.1. #48

v0.1.2

14 Sep 05:30
de25020
Compare
Choose a tag to compare
  • Modify the calculation of left/right-id access probabilities (#40)

v0.1.1

23 Aug 07:33
77f0e5d
Compare
Choose a tag to compare

The initial release!