Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use HanLP or Jieba create word_cloud_cn #430

Closed
wants to merge 14 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified examples/wc_cn/LuXun.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed examples/wc_cn/LuXun_black.jpg
Binary file not shown.
Binary file removed examples/wc_cn/LuXun_black_colored.jpg
Binary file not shown.
Binary file modified examples/wc_cn/LuXun_colored.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
87 changes: 69 additions & 18 deletions examples/wordcloud_cn.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,27 @@
Wordcloud is a very good tools, but if you want to create
Chinese wordcloud only wordcloud is not enough. The file
shows how to use wordcloud with Chinese. First, you need a
Chinese word segmentation library jieba, jieba is now the
most elegant the most popular Chinese word segmentation tool in python.
You can use 'PIP install jieba'. To install it. As you can see,
at the same time using wordcloud with jieba very convenient
Chinese word segmentation library jieba or HanLp.You can
use 'PIP install jieba' or 'PIP install pyhanlp' or to
install it.But yhanlp provides a python interface for hanlp.
Hanlp is currently the best-performing open source Chinese
natural language processing class library, but because it is
implemented in Java, so we must use the jpype to call Java
classes.So we have to use "pip install jpype1" to install
it.As you can see,at the same time using wordcloud with
jieba or pyhanlp very convenient.While jieba is lighter,
hanlp requires more downloads, but is more powerful HanLP's
level of identity of named entity,word segmentation was
better than jieba,and has more ways to do it.You'll save
a lot of time when you use it.
"""

import jieba
jieba.enable_parallel(4)
# Setting up parallel processes :4 ,but unable to run on Windows
from os import path
from os import path, getcwd
from scipy.misc import imread
import matplotlib.pyplot as plt
import os
# jieba.load_userdict("txt\userdict.txt")
# add userdict by load_userdict()
from wordcloud import WordCloud, ImageColorGenerator

# get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
d = path.dirname(__file__) if "__file__" in locals() else getcwd()

stopwords_path = d + '/wc_cn/stopwords_cn_en.txt'
# Chinese fonts must be set
Expand All @@ -39,10 +41,16 @@
# Read the whole text.
text = open(path.join(d, d + '/wc_cn/CalltoArms.txt')).read()

# if you want use wordCloud,you need it
# add userdict by add_word()
# if you want use wordCloud,you need it add userdict
# If use HanLp,Maybe you don't need to use it
userdict_list = ['阿Q', '孔乙己', '单四嫂子']

isUseJieba = True

# use HanLP
# You can use the stop word feature to improve performance, or disable it to increase speed
isUseStopwordsByHanLP = True


# The function for processing text with Jieba
def jieba_processing_txt(text):
Expand All @@ -63,11 +71,54 @@ def jieba_processing_txt(text):
return ' '.join(mywordlist)


wc = WordCloud(font_path=font_path, background_color="white", max_words=2000, mask=back_coloring,
max_font_size=100, random_state=42, width=1000, height=860, margin=2,)
# The function for processing text with HaanLP
def pyhanlp_processing_txt(text, isUseStopwordsOfHanLP=True):
CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JClass is not defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use hanlp must install jpype1 .I forgot about it......

"pip install jpype1"......

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyhanlp provides a python interface for hanlp. Hanlp is currently the best-performing open source Chinese natural language processing class library, but it is implemented in Java, so we must use jpype to call Java classes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but you must also import it, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make sure the continuous integration passes.

Copy link
Contributor Author

@TianFengshou TianFengshou Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need to import it.The pyhanlp import jpype1 in the __init__.py.

from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use pyhanlp,In line 114 there's a line like “from pyhanlp import *“
If use jieba,In line 106 there's a line like “import jieba“。This increases efficiency, and the code is feasible.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tests don't pass (see the 8 failing checks below) there's an issue with the code.

for word in userdict_list:
CustomDictionary.add(word)

mywordlist = []
HanLP.Config.ShowTermNature = False
CRFnewSegment = HanLP.newSegment("viterbi")

fianlText = []
if isUseStopwordsByHanLP == True:
CoreStopWordDictionary = JClass("com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary")
text_list = CRFnewSegment.seg(text)
CoreStopWordDictionary.apply(text_list)
fianlText = [i.word for i in text_list]
else:
fianlText = list(CRFnewSegment.segment(text))
liststr = "/ ".join(fianlText)

with open(stopwords_path, encoding='utf-8') as f_stop:
f_stop_text = f_stop.read()
f_stop_seg_list = f_stop_text.splitlines()

for myword in liststr.split('/'):
if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1:
mywordlist.append(myword)
return ' '.join(mywordlist)


result_text = ''
if isUseJieba == True:
import jieba

jieba.enable_parallel(4)
# Setting up parallel processes :4 ,but unable to run on Windows
# jieba.load_userdict("txt\userdict.txt")
# add userdict by load_userdict()
result_text = jieba_processing_txt(text)
else:
from pyhanlp import *

result_text = pyhanlp_processing_txt(text, isUseStopwordsOfHanLP=True)

wc = WordCloud(font_path=font_path, background_color="white", max_words=2000, mask=back_coloring,
max_font_size=100, random_state=42, width=1000, height=860, margin=2, )

wc.generate(jieba_processing_txt(text))
wc.generate(result_text)

# create coloring from image
image_colors_default = ImageColorGenerator(back_coloring)
Expand Down