Open Sentiment Training Data

自己蒐集的training data、字典和stopwords並且包成package，讓大家不用重複造輪子。

Usage

安裝：pip install udicOpenData

載入實驗室字典, import dictionary

from udicOpenData.dictionary import *

濾掉stopwords, remove stopwords p.s. rm stop words時就會跟著載入1.實驗室字典了

from udicOpenData.stopwords import *

# default
rmsw(input string, flag=False)

# return segmentation with part of speech.
rmsw(input string, flag=True)

demo:

zh:

>>> doc = '首先，對區塊鏈需要的第一個理解是，它是一種「將資料寫錄的技術」。'
>>> list(rmsw(doc, flag=True))
[('區塊鏈', 'n'), ('需要', 'n'), ('第一個', 'm'), ('理解', 'n'), ('一種', 'm'), ('資料', 'n'), ('寫錄', 'v'), ('技術', 'n')]

en:

>>> doc = 'The City of New York, often called New York City (NYC) or simply New York, is the most populous city in the United States.'
>>> list(rmsw_en(doc))
['City', 'New York', 'called', 'New York City', 'NYC', 'simply', 'New York', 'populous', 'city', 'United States']

>>> list(rmsw_en(doc, flag=True))
>>> [('City', 'NNP'), ('New York', 'NNP/NNP'), ('called', 'VBN'), ('New York City', 'NNP/NNP/NNP'), ('NYC', 'NN'), ('simply', 'RB'), ('New York', 'NNP/NNP'), ('populous', 'JJ'), ('city', 'NN'), ('United States', 'NNP/NNS')]

For elasticsearch

dump2es.py this command will generate two file with different filename extension

please move these two files into elasticsearch plugin folder

ik:dump2es.py ik

stopword:ext_stopword.dic
dictionary:mydict.dic

巨蛋
遠雄
趙藤雄
蔡英文
陳水扁
立法院
蔡正元
頂新
食安
柯p
...
...
...

jieba:dump2es.py jieba

stopword:ext_stopword.txt
dictionary:mydict.dict

巨蛋 99
遠雄 99
趙藤雄 99
蔡英文 99
陳水扁 99
立法院 99
蔡正元 99
頂新 99
食安 99
柯p 99
...
...
...

stopword:

上 
上來 
上去 
將不 
為
www 
http 
https 
.com 
– 
● 
○ 
～ 
...
...
...

所有語料大小：

正面情緒：約有309163筆，44M
負面情緒：約有320456筆，15M

訓練好的Model：

政治版Model：

成份：以下資料皆濾掉標題包含 [公告] 的文章
- pos.txt 正面情緒的model，將下列版的內容做shuffle，包含：
  - adulation版：標題 + 內文
  - dreams-wish版：標題 + 內文
  - happy版：標題 + 內文
  - kindness版：標題 + 內文裡面的好人行為區段的文字
  - luchky版：標題
- neg.txt 負面情緒的model，包含：
  - HatePolitics版：標題 + 內文（只有包含黑特且不包含RE的才納入）
大小：
- pos.txt：13222筆
- neg.txt：13222筆

產生出資料給 `Swinger`

python text2json.py positive的檔名(文檔，以一句一句為單位) positive.json（為output檔案的檔名） True/False(若為True就代表要把stopword濾掉 )：會自動把一行一行的語料，斷詞段好給Swinger當input data

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
udicOpenData		udicOpenData
.gitignore		.gitignore
MANIFEST		MANIFEST
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Sentiment Training Data

Usage

For elasticsearch

所有語料大小：

訓練好的Model：

產生出資料給 `Swinger`

About

Releases

Packages

Contributors 5

Languages

UDICatNCHU/UdicOpenData

Folders and files

Latest commit

History

Repository files navigation

Open Sentiment Training Data

Usage

For elasticsearch

所有語料大小：

訓練好的Model：

產生出資料給 Swinger

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

產生出資料給 `Swinger`

Packages