This is a Transformer based neural machine translation(NMT) model🤗.
The dataset is from WMT 2018 Chinese-English track (Only NEWS Area)
- Tool:sentencepiece
- Preprocess:Run
./data/get_corpus.py
, in which we will get bilingual data to build our training, dev and testing set. The data will be saved incorpus.en
andcorpus.ch
, with one sentence in each line. - Word segmentation model training: Run
./tokenizer/tokenize.py
, in which the sentencepiece.SentencePieceTrainer.Train() mothed is called to train our word segmentation model. After training,chn.model
,chn.vocab
,eng.model
andeng.vocab
will be saved in./tokenizer
..model
is the word segmentation model we need and.vocab
is the vocabulary.
We use the open-source code transformer-pytorch developmented by Harvard.
This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:
- tqdm
- pytorch >= 1.5.1
- sacrebleu >= 1.4.14
- sentencepiece >= 0.1.94
To get the environment settled quickly, run:
pip install -r requirements.txt
Hyperparameters can be modified in config.py
.
- This code supports MultiGPU training. You should modify
device_id
list inconfig.py
andos.environ['CUDA_VISIBLE_DEVICES']
inmain.py
to use your own GPUs.
To start training, please run:
python main.py
The training log is saved in ./experiment/train.log
, and the translation results of testing dataset is in ./experiment/output.txt
.
Training on 2 GeForce GTX 1080 Ti, 1h/epoch.
Model | NoamOpt | LabelSmoothing | Best Dev Bleu | Test Bleu |
---|---|---|---|---|
1 | No | No | 24.07 | 24.03 |
2 | Yes | No | 26.08 | 25.94 |
3 | No | Yes | 23.92 | 23.84 |
You can email me if you need the pretrained model (Model 2 -- The best performance model)😊. I will send you a google drive download link.
The testing results of Model 2 with beam search:
Beam_size | 2 | 3 | 4 | 5 |
---|---|---|---|---|
Test Bleu | 26.59 | 26.80 | 26.84 | 26.86 |
Name the pretrained model or your own trained model with model.pth
and save it in the path ./experiment
. Run translate_example
method in main.py
, and then you can get one sentence translation result.
English Input Sentence for example:
The near-term policy remedies are clear: raise the minimum wage to a level that will keep a fully employed worker and his or her family out of poverty, and extend the earned-income tax credit to childless workers.
ground truth:
近期的政策对策很明确:把最低工资提升到足以一个全职工人及其家庭免于贫困的水平,扩大对无子女劳动者的工资所得税减免。
Translation result with beam size = 3:
短期政策方案很清楚:把最低工资提高到充分就业的水平,并扩大向无薪工人发放所得的税收信用。
The codes released in this reposity are only tested successfully with Linux. If you wanna try it with Windows, steps below may be useful to you as mentioned in issue 2:
-
adding utf-8 encoding declaration:
in lines 16 and 19 of get_corpus.py:
with open(ch_path, "w", encoding="utf-8") as fch: with open(en_path, "w", encoding="utf-8") as fen:
in line 165 of train.py:
with open(config.output_path, "w", encoding="utf-8") as fp:
-
using conda command to install sacrebleu if Anoconda is used for building your virtual env:
conda install -c conda-forge sacrebleu
For any other problems you meet when doing your own project, welcome to issuing or sending emails to me 😊~