Simple algorithm to tokenize Chinese texts into words using CC-CEDICT. You can try it out at the demo page. The code for the demo page can be found in the gh-pages
branch of this repository.
This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.
Use npm to install:
npm install chinese-tokenizer --save
Make sure to provide the CC-CEDICT data.
const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')
console.log(JSON.stringify(tokenize('我是中国人。'), null, ' '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, ' '))
Output:
[
{
"text": "我",
"traditional": "我",
"simplified": "我",
"position": { "offset": 0, "line": 1, "column": 1 },
"matches": [
{
"pinyin": "wo3",
"pinyinPretty": "wǒ",
"english": "I/me/my"
}
]
},
{
"text": "是",
"traditional": "是",
"simplified": "是",
"position": { "offset": 1, "line": 1, "column": 2 },
"matches": [
{
"pinyin": "shi4",
"pinyinPretty": "shì",
"english": "is/are/am/yes/to be"
}
]
},
{
"text": "中國人",
"traditional": "中國人",
"simplified": "中国人",
"position": { "offset": 2, "line": 1, "column": 3 },
"matches": [
{
"pinyin": "Zhong1 guo2 ren2",
"pinyinPretty": "Zhōng guó rén",
"english": "Chinese person"
}
]
},
{
"text": "。",
"traditional": "。",
"simplified": "。",
"position": { "offset": 5, "line": 1, "column": 6 },
"matches": []
}
]
Reads the CC-CEDICT file from given path
and returns a tokenize function based on the dictionary.
Parses CC-CEDICT string content from content
and returns a tokenize function based on the dictionary.
Tokenizes the given text
string and returns an array with tokens of the following form:
{
"text": <string>,
"traditional": <string>,
"simplified": <string>,
"position": { "offset": <number>, "line": <number>, "column": <number> },
"matches": [
{
"pinyin": <string>,
"pinyinPretty": <string>,
"english": <string>
},
...
]
}