The crate provides the port of the original BERT create_pretraining_data.py script from the Google BERT repository.
$ cargo install bert_create_pretraining
You can use the bert_create_pretraining
binary to create the pretraining data for BERT in parallel. The binary takes the following arguments:
$ find "${DATA_DIR}" -name "*.txt" | xargs -I% -P $NUM_PROC -n 1 \
basename % | xargs -I% -P ${NUM_PROC} -n 1 \
"${TARGET_DIR}/bert_create_pretraining" \
--input-file="${DATA_DIR}/%" \
--output-file="${OUTPUT_DIR}/%.tfrecord" \
--vocab-file="${VOCAB_DIR}/vocab.txt" \
--max-seq-length=512 \
--max-predictions-per-seq=75 \
--masked-lm-prob=0.15 \
--random-seed=12345 \
--dupe-factor=5
You can check the full list of options with the following command:
$ bert_create_pretraining --help
MIT license. See LICENSE file for full license.