English|简体中文

Remind： This repo has been refactored, for paper re-production or backward compatibility; plase checkout to repro branch

ERNIE 2.0 is a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning. ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification, Ranking, NER, machine reading comprehension, text genration and so on.

News

May.20.2020:
- Try ERNIE in "dygraph", with:
  - Pretrain and finetune ERNIE with PaddlePaddle v1.8.
  - Eager execution with paddle.fluid.dygraph.
  - Distributed training.
  - Easy deployment.
  - Learn NLP in Aistudio tutorials.
  - Backward compatibility for old-styled checkpoint
- ERNIE-GEN is avaliable now! (link here)
  - the state-of-the-art pre-trained model for generation tasks, accepted by IJCAI-2020.
    - A novel span-by-span generation pre-training task.
    - An infilling generation echanism and a noise-aware generation method.
    - Implemented by a carefully designed Multi-Flow Attention architecture.
  - You are able to download all models including base/large/large-430G.
Apr.30.2020: Release ERNIESage, a novel Graph Neural Network Model using ERNIE as its aggregator. It is implemented through PGL
Mar.27.2020: Champion on 5 SemEval2020 sub tasks
Dec.26.2019: 1st place on GLUE leaderboard
Nov.6.2019: Introducing ERNIE-tiny
Jul.7.2019: Introducing ERNIE2.0
Mar.16.2019: Introducing ERNIE1.0

Quick Tour

import numpy as np
import paddle.fluid.dygraph as D
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel

D.guard().__enter__() # activate paddle `dygrpah` mode

model = ErnieModel.from_pretrained('ernie-1.0')    # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

ids, _ = tokenizer.encode('hello world')
ids = D.to_variable(np.expand_dims(ids, 0))  # insert extra `batch` dimension
pooled, encoded = model(ids)                 # eager execution
print(pooled.numpy())                        # convert  results to numpy

Tutorials

Don't have GPU? try ERNIE in AIStudio! (please choose the latest version and apply for a GPU environment)

ERNIE for beginners
Sementic analysis
Cloze test
Knowledge distillation
Ask ERNIE
Loading old-styled checkpoint

Setup

1. install PaddlePaddle

This repo requires PaddlePaddle 1.7.0+, please see here for installaton instruction.

2. install ernie

pip install paddle-ernie

or

git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
cd ERNIE
pip install -r requirements.txt
pip install -e .

3. download pretrained models (optional)

Model	Description	abbreviation
ERNIE 1.0 Base for Chinese	L12H768A12	ernie-1.0
ERNIE Tiny	L3H1024A16	ernie-tiny
ERNIE 2.0 Base for English	L12H768A12	ernie-2.0-en
ERNIE 2.0 Large for English	L24H1024A16	ernie-2.0-large-en
ERNIE Gen base for English	L12H768A12	ernie-gen-base-en
ERNIE Gen Large for English	L24H1024A16	ernie-gen-large-en
ERNIE Gen Large 430G for English	Layer:24, Hidden:1024, Heads:16 + 430G pretrain corpus	ernie-gen-large-430g-en

4. download datasets

English Datasets

Download the GLUE datasets by running this script

the --data_dir option in the following section assumes a directory tree like this:

data/xnli
├── dev
│   └── 1
├── test
│   └── 1
└── train
    └── 1

see demo data for MNLI task.

Chinese Datasets

Datasets	Description
XNLI	XNLI is a natural language inference dataset in 15 languages. It was jointly built by Facebook and New York University. We use Chinese data of XNLI to evaluate language understanding ability of our model. url
ChnSentiCorp	ChnSentiCorp is a sentiment analysis dataset consisting of reviews on online shopping of hotels, notebooks and books.
MSRA-NER	MSRA-NER (SIGHAN2006) dataset is released by MSRA for recognizing the names of people, locations and organizations in text.
NLPCC2016-DBQA	NLPCC2016-DBQA is a sub-task of NLPCC-ICCPOL 2016 Shared Task which is hosted by NLPCC(Natural Language Processing and Chinese Computing), this task targets on selecting documents from the candidates to answer the questions. [url: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]
CMRC2018	CMRC2018 is a evaluation of Chinese extractive reading comprehension hosted by Chinese Information Processing Society of China (CIPS-CL). url

Fine-tuning

try eager execution with dygraph model :

python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
       --from_pretrained ernie-1.0 \
       --data_dir ./data/xnli

Distributed finetune

paddle.distributed.launch is a process manager, we use it to launch python processes on each avalible GPU devices:

When in distributed training, max_steps is used as stopping criteria rather than epoch to prevent dead block. You could calculate max_steps with EPOCH * NUM_TRAIN_EXAMPLES / TOTAL_BATCH. Also notice than we shard the train data according to device id to prevent over fitting.

demo: (make sure you have more than 2 GPUs, online model download can not work in paddle.distributed.launch, you need to run single card finetuning first to get pretrained model, or donwload and extract one manualy from here):

python3 -m paddle.distributed.launch \
./demo/finetune_classifier_dygraph_distributed.py \
    --data_dir data/mnli \
    --max_steps 10000 \
    --from_pretrained ernie-2.0-en

many other demo python scripts:

Sentiment Analysis
Semantic Similarity
Name Entity Recognition(NER)
Machine Reading Comprehension
Text generation

recomended hyper parameters:

tasks	batch size	learning rate
CoLA	32 / 64 (base)	3e-5
SST-2	64 / 256 (base)	2e-5
STS-B	128	5e-5
QQP	256	3e-5(base)/5e-5(large)
MNLI	256 / 512 (base)	3e-5
QNLI	256	2e-5
RTE	16 / 4 (base)	2e-5(base)/3e-5(large)
MRPC	16 / 32 (base)	3e-5
WNLI	8	2e-5
XNLI	512	1e-4(base)/4e-5(large)
CMRC2018	64	3e-5
DRCD	64	5e-5(base)/3e-5(large)
MSRA-NER(SIGHAN2006)	16	5e-5(base)/1e-5(large)
ChnSentiCorp	24	5e-5(base)/1e-5(large)
LCQMC	32	2e-5(base)/5e-6(large)
NLPCC2016-DBQA	64	2e-5(base)/1e-5(large)

Pretraining with ERNIE 1.0

see here

Online inference

If --inference_model_dir is passed to finetune_classifier_dygraph.py, a deployable model will be generated at the end of finetuning and your model is ready to serve.

For details about online inferece, see C++ inference API, or you can start a multi-gpu inference server with a few lines of codes:

python -m propeller.tools.start_server -m /path/to/saved/inference_model  -p 8881

and call the server just like calling local function (python3 only):

from propeller.service.client import InferenceClient
from ernie.tokenizing_ernie import ErnieTokenizer

client = InferenceClient('tcp://localhost:8881')
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, sids = tokenizer.encode('hello world')
ids = np.expand_dims(ids, 0)
sids = np.expand_dims(sids, 0)
result = client(ids, sids)

A pre-made inference model for ernie-1.0 can be downloaded at here. It can be used for feature-based finetuning or feature extraction.

Distillation

Knowledge distillation is good way to compress and accelerate ERNIE.

For details about distillation, see here

Citation

ERNIE 1.0

@article{sun2019ernie,
  title={Ernie: Enhanced representation through knowledge integration},
  author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Chen, Xuyi and Zhang, Han and Tian, Xin and Zhu, Danxiang and Tian, Hao and Wu, Hua},
  journal={arXiv preprint arXiv:1904.09223},
  year={2019}
}

ERNIE 2.0

@article{sun2019ernie20,
  title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
  author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:1907.12412},
  year={2019} 
}

ERNIE-GEN

@article{xiao2020ernie-gen,
  title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
  author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2001.11314},
  year={2020}
}

For full reproduction of paper results, please checkout to repro branch of this repo.

Communication

Github Issues: bug reports, feature requests, install issues, usage issues, etc.
QQ discussion group: 760439550 (ERNIE discussion group).
QQ discussion group: 958422639 (ERNIE discussion group-v2).
Forums: discuss implementations, research, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.en.md

README.en.md

News

Table of contents

Quick Tour

Tutorials

Setup

1. install PaddlePaddle

2. install ernie

3. download pretrained models (optional)

4. download datasets

Fine-tuning

Pretraining with ERNIE 1.0

Online inference

Distillation

Citation

ERNIE 1.0

ERNIE 2.0

ERNIE-GEN

Communication

Files

README.en.md

Latest commit

History

README.en.md

File metadata and controls

News

Table of contents

Quick Tour

Tutorials

Setup

1. install PaddlePaddle

2. install ernie

3. download pretrained models (optional)

4. download datasets

Fine-tuning

Pretraining with ERNIE 1.0

Online inference

Distillation

Citation

ERNIE 1.0

ERNIE 2.0

ERNIE-GEN

Communication