bert_create_pretraining

The crate provides the port of the original BERT create_pretraining_data.py script from the Google BERT repository.

Installation

Cargo

$ cargo install bert_create_pretraining

Usage

You can use the bert_create_pretraining binary to create the pretraining data for BERT in parallel. The binary takes the following arguments:

$ find "${DATA_DIR}" -name "*.txt" | xargs -I% -P $NUM_PROC -n 1 \
basename % | xargs -I% -P ${NUM_PROC} -n 1 \
  "${TARGET_DIR}/bert_create_pretraining" \
  --input-file="${DATA_DIR}/%" \
  --output-file="${OUTPUT_DIR}/%.tfrecord" \
  --vocab-file="${VOCAB_DIR}/vocab.txt" \
  --max-seq-length=512 \
  --max-predictions-per-seq=75 \
  --masked-lm-prob=0.15 \
  --random-seed=12345 \
  --dupe-factor=5

You can check the full list of options with the following command:

$ bert_create_pretraining --help

License

MIT license. See LICENSE file for full license.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bert_create_pretraining

Installation

Cargo

Usage

License

About

Releases

Packages

Languages

License

yigit353/bert_create_pretraining

Folders and files

Latest commit

History

Repository files navigation

bert_create_pretraining

Installation

Cargo

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages