Skip to content

Latest commit

 

History

History
350 lines (226 loc) · 13.6 KB

machine_translation.md

File metadata and controls

350 lines (226 loc) · 13.6 KB

Chinese Machine Translation

Background

Machine translation (MT) converts a text from one language to another. Here, we focus on translation into and out of Chinese.

Example

Input:

美中两国可能很快达成一个贸易协议。

Output:

The United States and China may soon reach a trade agreement.

Standard Metrics

  • Direct assessment (human judgment). Amazon Mechanical Turk workers are supplied with a system translation and a human reference. They are asked “How accurately does the candidate text convey the original semantics of the reference text?”
  • Bleu score (Papineni et al 02).
    • Bleu-n4r4: word {1,2,3,4}-gram matches, against four human reference translations
      • A brevity penalty additionally punishes translations that are shorter than the reference(s).
      • Standard Bleu scripts will tokenize translations and references before computing n-gram matches.
      • If Chinese is target language, character {1,2,3,4}-gram matches are used
      • Bleu-n4r1 is used when only one human reference is available.
    • There are important variations:
      • Case-sensitive vs. case-insensitive
      • Brevity penalty may kick in when the system translation is shorter than the shortest reference, or the “closest” reference.
  • NIST. A variation of BLEU that gives higher weight to rare n-grams.
  • TER (Translation Edit Rate). Automatically calculates the number of edits required to make a translation identical to a human reference.
  • BLEU-SBP ((Chiang et al 08)[http://aclweb.org/anthology/D08-1064]). Addresses decomposability problems with Bleu, proposing a cross between Bleu and word error rate.
  • HTER. Returns the number of edits performed by a human posteditor to get an automatic translation into good shape.

ZH-EN

WMT.

The Second Conference on Machine Translation (WMT17) has a Chinese/English MT component, done in cooperation with CWMT 2017.

Test set Size (sentences) Genre
WMT17 Parallel English/Chinese test set 2001 News

Note that the Conference on Machine Translation (WMT19) is announced here.

Metrics

Results

Chinese to English (WMT17)

System Direct Assessment (Ave z) Bleu
[Hany et al 18] 27.4
[Wang et al 17] 0.209 26.4
[Sennrich et al 17] 0.208 25.7
[Tan et al 17] 0.184 26

English - Chinese (WMT17)

System Direct Assessment (Ave z) Bleu
[Wang et al 17] 0.208
[Sennrich et al 17] 0.178 36.3
[Tan et al 17] 0.165 35.8

Resources

There are many parallel English/Chinese text resources to train MT systems on. These are publicly available:

Dataset Size (words on English side) Genre
UN 327m Political
New Commentary v12 5m News opinions
CWMT 154m Web, movies, thesaurus, government, news conversation, novels, technical documents
AI_Challenger 120m Movie subtitles, English learning, etc.
WMT 2017 Dev 54k News

The Linguistic Data Consortium has additional resources, such as FBIS and NIST test sets.

NIST.

NIST has a long history of supporting Chinese-English translation by creating annual test sets and running annual NIST OpenMT evaluations during the 2000s. Many sites have reported results on NIST test sets.

Test sets contain Chinese sentences with four distinct (human reference) English translations each. Four references makes NIST an unusually strong evaluation set.

Variations in training and evaluation conditions can make it difficult to compare systems.

  • Evaluation script. Some papers evaluate with mteval-v13a, while others evaluate with the multibleu script. Case sensitivity can also be an issue, and with multiple references, there are variations in how to compute Bleu’s brevity penalty.
  • Training data. Papers vary in the number of training sentence pairs they restrict to.
  • Development data. Some papers use NIST 02 for development/tuning, while others use it as test data.

Note that this paper proposes a standard corpus and methodology around NIST sets, while reporting high Bleu scores. Github

Test set Size (sentence pairs) Genre
NIST 02 878 News
NIST 03 919 News
NIST 04 1788 News
NIST 05 1082 News
NIST 06 1664 Newswire, broadcast news, broadcast conversations, web newgroups
NIST 08 1357 Newswire, broadcast news, broadcast conversations, web newgroups

Metrics

Bleu

Results

System Training sentence pairs Eval script NIST 02 NIST 03 NIST 04 NIST 05 NIST 06 NIST 08 Average
[Zhang et al 2019] 1.25m mteval-v11b 48.31 49.40 48.72 48.45 48.72
[Hadiwinoto & Ng, 2018] 7.65m mteval-v13a 46.94 47.58 49.13 47.78 49.37 41.48 47.05
[Yang te al, 2020] 1.2m unspecified 46.56 46.04 37.53
[Meng et al 2019] 1.25m unspecified 40.56 (dev) 39.93 41.54 38.01 37.45 29.07 37.76
[Ma et al 2018c] 1.25m unspecified 39.77 (dev) 38.91 40.02 36.82 35.93 27.61 36.51
[Chen et al 2017] 1.6m multibleu 36.57 35.64 36.63 34.35 30.57

Resources

The Linguistic Data Consortium provides training materials typically used for NIST OpenMT tasks.

IWSLT 2015.

Test sets Size (sentences) # of talks Genre
tst2014 1068 12 TED talks
tst2015 1,080 12 TED talks

Metrics

  • Automatic metrics: Bleu, NIST, TER.
  • Manual metrics: HTER.

Results

Chinese to English (tst2015)

System Bleu NIST TER
MITLL-AFRL 16.86 5.2565 67.31

English to Chinese (tst2015)

System Bleu NIST TER
Univ. Edinburgh 25.39 6.3985 60.83
MITLL-AFRL 24.31 6.4136 59.00

Resources

Dataset Size (sentences) # of talks Genre
Train 210k 1718 TED talks

TED corpus.

This site contains an up-to-date multi-way corpus of TED talks using for machine translation research. It also contains a leaderboard maintained by Kevin Duh.

Test set Size (sentences) Genre
Chinese/English test 1,982 TED talks
Chinese/English dev 1,958 TED talks

This site contains more languages but a different train/test split.

Results

Chinese to English

System Bleu
Kevin Duh, 6-layer transformer (Sockeye) 16.63

English to Chinese

System Bleu
Kevin Duh, coming Not yet available

Resources

The Multitarget TED Talks Task (MTTT)

ZH-JA

Workshop on Asian Translation.

The Workshop on Asian Translation has run since 2014. Here, we include the 2018 Chinese/Japanese evaluations.

Metrics

  • BLEU.
  • RIBES. Method developed at NTT based on rank correlation coefficients (link).
  • Scoring code.
    • for evaluating Chinese outputs: link
    • for evaluating Japanese outputs: link

ASPEC Chinese-Japanese

Participants must get data from here

Test set Size (sentences) Genre
ASPEC Chinese-Japanese 2107 Scientific abstracts
ASPEC Japanese-Chinese 2107 Scientific abstracts

JPO Patent Corpus 2

Participants must get data from here

Test set Size (sentences) Genre
JPCN Chinese-Japanese 5,204 Patents
JPCN Japanese-Chinese 5204 Patents
JPCN1 Chinese-Japanese 2000 Patents
JPCN1 Japanese-Chinese 2000 Patents
JPCN2 Chinese-Japanese 3,000 Patents
JPCN2 Japanese-Chinese 3,000 Patents
JPCN3 Chinese-Japanese 204 Patents
JPCN3 Japanese-Chinese 204 Patents
JPSEP Chinese-Japanese 1151 Patent Expression Patterns

Results

  • See here
  • Note that citing others’ results should only be done with anonymization.

Resources

Dataset Size (sentences) Genre
Japanese-Chinese train 250,000 Patents
Japanese-Chinese dev 2000 Patents
Japanese-Chinese devtest 2000 Patents

IWSLT2020 ZH-JA Open Domain Translation.

The shared task is to promote research on translation between Asian languages, exploitation of noisy parallel web corpora for MT and smart processing of data and provenance.

Metrics

  • 4-gram character BlEU.

A (secret) mixed-genre test set was intended to cover a variety of topics. The test data was selected from high-quality (human translated) parallel web content, authored between January and March 2020.

Test set Size (sentences) Genre
Chinese-Japanese 875 mixed-genre
Japanese-Chinese 875 mixed-genre

Results

Chinese to Japanese

System Bleu
CASIA* 43.0
Xiaomi 34.3
TSUKUBA 33.0

Japanese to Chinese

System Bleu
CASIA* 55.8
Samsung Research China 34.0
OPPO 32.9

* means system collected external parallel training data that inadvertently overlapped with the blind test set.

Resources

Dataset Size (sentences) Genre
Web crawled 18,966,595 mixed-genre
Existing parallel sources 1,963,238 mixed-genre

Others

CWMT.

CWMT 2017 and 2018 (China Workshop on Machine Translation) features six tasks:

Test set Size (sentences) Genre
CWMT Chinese-English news 1000 News
CWMT English-Chinese news 1000 News
Mongolian-Chinese 1001 Daily expressions
Tibetan-Chinese 729 Government documents
Uyghur-Chinese 1000 News
Japanese-Chinese 1000 Patents

In 2019, CWMT became CCMT (China Conference on Machine Translation).

Metrics

BLEU-SBP is the primary metric. Other metrics include BLEU-NIST, TER, METEOR, NIST, GTM, mWER, mPER, and ICT.

Results

Still being compiled.

Resources

Detailed at here

Other Resources

Opus is an excellent site for open-source parallel corpora, with a nice language-pair search function.


Suggestions? Changes? Please send email to [email protected]