Skip to content

Latest commit

 

History

History
165 lines (107 loc) · 7.44 KB

word_embedding.md

File metadata and controls

165 lines (107 loc) · 7.44 KB

Chinese Word Embeddings

Background

Word embedding ingests a large corpus of text and outputs, for each word type, an n-dimensional vector of real numbers. This vector captures syntactic and semantic information about the word that can be employed to solve various NLP tasks. In Chinese, the unit of encoding may be a character or a sub-character unit, rather than a word.

Example

Input:

Large corpus of text

Output:

“查询”, vec(W) = [-0.059569, 0.126913, 0.273161, 0.225467, -0.185914, 0.018743, -0.18434, 0.083859, -0.115781, -0.216993, 0.063437, -0.005511, 0.276968,…, 0.254486]

Standard Metrics

Word vectors can be evaluated intrinsically (e.g., whether similar words get similar vectors) or extrinsically (e.g., to what extent word vectors can improve a sentiment analyzer).

Intrinsic evaluation looks at

  • Word relatedness : Spearman correlation (⍴) between human-labeled scores and scores generated by the embeddings on Chinese word similarity datasets wordsim-240 and wordsim-296 (translations of English resources).
  • Word Analogy: Accuracy on the word analogy task (e.g: “ 男人 (man) : 女人 (woman) :: 父亲 (father) : X ”, where X chosen by cosine similarity). Different types of word analogy tasks (1) Capitals of countries (2) States/provinces of cities (3) Family words

Extrinsic evaluation:

  • Accuracy on Chinese sentiment analysis task
  • F1 score on Chinese named entity recognition task
  • Accuracy on part-of-speech tagging task

See e.g. Torregrossa et al., 2020 for a more detailed comparison of metrics

Chinese word similarity lists.

Test set # word pairs with human similarity judgments
wordsim-240 240
wordsim-296 297

Metrics

Results

  • The SoTA system (VCWE) published in NAACL 2019, combines intra-character compositionality (computed via convolutional neural network ) and inter-character compositionality (computed via a recurrent neural network with self-attention) to compute the word embeddings
System wordsim-240 (⍴) wordsim-296 (⍴)
Sun et. al. (2019) (VCWE) 57.81 61.29
Yu et. al. (2017) (JWE) 51.92 59.84

Chinese word analogy lists.

Given “France : Paris :: China : ?”, a system should come up with the answer “Beijing”.

Test set # analogies
Capitals of countries 687
States/provinces of cities 175
Family relationships 240

Metrics

Results

System Accuracy (capital) Accuracy (state) Accuracy (family) Accuracy (total)
Yu et. al. (2017) (JWE) 0.91 0.93 0.62 0.85
Yin et. al. (2016) (MGE) 0.89 0.88 0.39 0.76
CBOW (baseline) 0.84 0.88 0.60 0.79

Chinese sentiment analysis.

  • This test measures how much the sentiment analysis task benefits from different word vectors.
  • There is no agreed-upon baseline (e.g., sentiment classifier code), so it is difficult to compare across papers.
  • Sentiment dataset available at http://sentic.net/chinese-review-datasets.zip (Peng et. al. (2018))
    • Consists of Chinese reviews in 4 domains: notebook, car, camera and phone
    • Binary classification task: reviews are either positive or negative
    • Does not have train/dev/test split.
Test set # positive reviews # negative reviews
Notebook 417 206
Car 886 286
Camera 1,558 673
Phone 1,713 843

Results

System Accuracy (notebook) Accuracy (car) Accuracy (camera) Accuracy (phone) Accuracy (overall)
Sun et. al. (2019) (VCWE) 80.95 85.59 83.93 84.38 88.92
Yu et. al. (2017) (JWE) 77.78 78.81 81.70 81.64 85.13
Baseline (skip-gram) 69.84 77.12 80.80 81.25 86.65

Chinese name tagging.

  • This test measures how much the name tagging task benefits from different word vectors.
  • There is no agreed-upon baseline (e.g., name tagging code), so it is difficult to compare across papers.
  • This evaluation evaluates entity taggers on three types of entities: Person (PER), Location (LOC), and Organization (ORG): Levow (2006)
Test set Size (words) Genre
SIGHAN 2006 NER MSRA 100,000 Newswire, Broadcast News, Weblog

Results

System F1 score
Sun et. al. (2019) (VCWE) 85.77
Yu et. al. (2017) (JWE) 85.30

Resources

Train set Size (words) Genre
SIGHAN 2006 NER MSRA 1.3M Newswire, Broadcast News, Weblog

Other Resources

Various Word embeddings

Name Additional features Training Corpus Size Source
FastText - 374M characters Grave et al., 2018
Mimick Interpolate between similar characters to improve rare words, multilingual Pinter et al., 2017
Glyph2vec Uses character bitmaps, canjie to address OOV problem 10M chars Chen et al., 2020

Text corpora

Corpus Size (words) Size (vocabulary) Genre
Wikipedia dump 153,278,000 66,856 General
People’s Daily 31,000,000 105,000 News

Suggestions? Changes? Please send email to [email protected]