Skip to content

shuto-facengineer/electra_japanese

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ELECTRA for Japanese text.

This is repository of Japanese ELECTRA model.

We trained ELECTRA model based on Google/ELECTRA code and follow the pipeline from yoheikikuta to switch to use SentencePiece. Big thanks to Google team and yoheikikuta for their efforts.

In this Readme, I combined documents from Google, yoheikikuta and my explanation.

Pretrained models

We provide pretrained Electra model and trained SentencePiece model for Japanese text. Training data is the Japanese wikipedia corpus from Wikimedia Downloads.
Please download all objects of Electra in the following google drive to data/models/electra_small/ directory, then move the vocab file to data/.

Loss function during training is as below:

electra-training-loss

Pre-train a small ELECTRA model

These instructions pre-train a small ELECTRA model (12 layers, 256 hidden size). The pre-training task of Electra-small took 6 days with 1M steps by using 1 GPU Radeon VII 16GB. All scripts for pretraining from scratch are provided. Follow the instructions below.

Training SentencePiece model

Please download jawiki-data from this link and extract it. It's also the dataset we used to pretrain Electra.

python pretrain/train-sentencepiece.py
  • Run python extract_wiki_data.py to extract the dataset.

Data preparation

  • Place the vocab file and sentencepiece model in data/wiki-ja.vocab and data/wiki-ja.model.
  • Run python build_japanesewiki_pretrain_data.py --data-dir data/ --model-dir data/ --num-processes 4.
    It pre-processes/tokenizes the data and outputs examples as tfrecord files under data/pretrain_tfrecords.

Pretraining

This step is exactly the same as in Google/Electra document. You can refer this script below:

python run_pretraining.py \
    --data-dir data \
    --model-name electra_small_japanese \
    --hparams '{"debug": false, "do_train": true, "do_eval": false, "vocab_file": "data/vocab.txt", "model_sentencepiece_path": "model_sentence_piece/wiki-ja.model", "model_size": "small", "vocab_size": 32000, "max_seq_length": 512, "num_train_steps": 1000000, "train_batch_size": 64}'

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published