You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Word embedding ingests a large corpus of text and outputs, for each word type, an n-dimensional vector of real numbers. This vector captures syntactic and semantic information about the word that can be employed to solve various NLP tasks. In Chinese, the unit of encoding may be a character or a sub-character unit, rather than a word.
Word vectors can be evaluated intrinsically (e.g., whether similar words get similar vectors) or extrinsically (e.g., to what extent word vectors can improve a sentiment analyzer).
Intrinsic evaluation looks at
Word relatedness : Spearman correlation (⍴) between human-labeled scores and scores generated by the embeddings on Chinese word similarity datasets wordsim-240 and wordsim-296 (translations of English resources).
Word Analogy: Accuracy on the word analogy task (e.g: “ 男人 (man) : 女人 (woman) :: 父亲 (father) : X ”, where X chosen by cosine similarity). Different types of word analogy tasks (1) Capitals of countries (2) States/provinces of cities (3) Family words
The SoTA system (VCWE) published in NAACL 2019, combines intra-character compositionality (computed via convolutional neural network ) and inter-character compositionality (computed via a recurrent neural network with self-attention) to compute the word embeddings