本仓库提供前沿、详细和完备的中文自然语言理解系统构建指南。
TODO
- 提供中文语料库
- 提供语料库转换工具,帮助用户转移语料数据
- 提供多种基于 RASA NLU 的中文语言处理流程
- 提供模型性能评测工具,帮助自动选择和优化模型
Python 3 (也许支持 python2, 但未经过良好测试)
详情请访问 workflow.md
- jieba 提供中文分词功能
- MITIE 负责
intent classification
和slot filling
pip install git+https://github.com/mit-nlp/MITIE.git
pip install jieba
MITIE 需要一个模型文件,在本人的另一个项目: MITIE_Chinese_Wikipedia_corpus 的 release 下载 total_word_feature_extractor.dat.tar.gz
. 解压后将 total_word_feature_extractor.dat
放至 data
language: "zh"
pipeline:
- name: "nlp_mitie"
model: "data/total_word_feature_extractor.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"
trainer/MITIE+jieba.bash
cross_validation/MITIE+jieba.bash
- jieba 提供中文分词功能
- tensorflow_embedding 负责
intent classification
- MITIE 负责
slot filling
pip install git+https://github.com/mit-nlp/MITIE.git
pip install jieba
pip install tensorflow
MITIE 需要一个模型文件,在本人的另一个项目: MITIE_Chinese_Wikipedia_corpus 的 release 下载 total_word_feature_extractor.dat.tar.gz
. 解压后将 total_word_feature_extractor.dat
放至 data
language: "zh"
pipeline:
- name: "nlp_mitie"
model: "data/total_word_feature_extractor.dat"
- name: "tokenizer_jieba"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
- name: "ner_mitie"
- name: "ner_synonyms"
trainer/tensorflow_embedding.bash
cross_validation/tensorflow_embedding.bash
- Chinese_models_for_SpaCy 负责
intent classification
andslot filling
pip install https://github.com/howl-anderson/Chinese_models_for_SpaCy/releases/download/v2.0.3/zh_core_web_sm-2.0.3.tar.gz
./spacy_model_link.bash
language: "zh"
pipeline:
- name: "nlp_spacy"
model: "zh"
- name: "tokenizer_spacy"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_spacy"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_classifier_sklearn"
trainer/spacy.bash
cross_validation/spacy.bash
Intent | Entity | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
train | test | train | test | |||||||||
No | ACC | F1 | PRC | ACC | F1 | PRC | ACC | F1 | PRC | ACC | F1 | PRC |
1 | 0.986 | 0.986 | 0.986 | 0.665 | 0.631 | 0.648 | 0.987 | 0.987 | 0.988 | 0.967 | 0.968 | 0.973 |
2 | 0.990 | 0.990 | 0.990 | 0.434 | 0.406 | 0.432 | 0.987 | 0.987 | 0.988 | 0.968 | 0.970 | 0.975 |
3 | 0.992 | 0.992 | 0.992 | 0.657 | 0.598 | 0.587 | 0.987 | 0.987 | 0.988 | 0.939 | 0.934 | 0.947 |
ACC: Accuracy; F1: F1-score; PRC: Precision; |
No | Pipeline | Configure |
---|---|---|
1 | MITIE+jieba | 使用 MITIE_Chinese_Wikipedia_corpus 项目提供的 total_word_feature_extractor.dat |
2 | tensorflow_embedding | 使用 MITIE_Chinese_Wikipedia_corpus 项目提供的 total_word_feature_extractor.dat |
3 | spacy | 使用 Chinese_models_for_SpaCy 项目提供的中文 SpaCy 模型 |
请阅读 CONTRIBUTING.md , 然后提交 pull requests 给我们.
我们使用 SemVer 做版本化的标准. 查看 tags
以了解所有的版本.
- Xiaoquan Kong - Initial work - howl-anderson
更多贡献者信息,请参考 contributors
.
MIT License - 详见 LICENSE.md
- TODO