-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use HanLP or Jieba create word_cloud_cn #430
Conversation
pyhanlp is One of the most powerful natural language processing libraries in Chinese today, and it's extremely easy to use.You can use 'PIP install pyhanlp'. To install it,like Jieba. Its level of identity of named entity,word segmentation was better than jieba, and has more ways to do it.You'll save a lot of time when you use it. And thanks to its excellent performance, when we handle a large number of Chinese texts We don't have to use the User-defined dictionaries.
Thanks for the contribution. Would it be possible to squash the commits together and fix the styling issues ? To run the style check locally, you could run |
Wordcloud is a very good tools, but if you want to create Chinese wordcloud only wordcloud is not enough. The file shows how to use wordcloud with Chinese. First, you need a Chinese word segmentation library jieba or HanLp.You can use 'PIP install jieba' or 'PIP install pyhanlp' or to install it.As you can see,at the same time using wordcloud with jieba or HanLP very convenient.While jieba is lighter, hanlp requires more downloads, but is more powerfulHanLP's level of identity of named entity,word segmentation was better than jieba,and has more ways to do it.You'll save a lot of time when you use it.
What do I need to do now? |
examples/wordcloud_cn.py
Outdated
max_font_size=100, random_state=42, width=1000, height=860, margin=2,) | ||
# The function for processing text with HaanLP | ||
def pyhanlp_processing_txt(text, isUseStopwordsOfHanLP=True): | ||
CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JClass is not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If use hanlp must install jpype1 .I forgot about it......
"pip install jpype1"......
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pyhanlp provides a python interface for hanlp. Hanlp is currently the best-performing open source Chinese natural language processing class library, but it is implemented in Java, so we must use jpype to call Java classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but you must also import it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to make sure the continuous integration passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't need to import it.The pyhanlp import jpype1 in the __init__.py
.
from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If use pyhanlp,In line 114 there's a line like “from pyhanlp import *“
If use jieba,In line 106 there's a line like “import jieba“。This increases efficiency, and the code is feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the tests don't pass (see the 8 failing checks below) there's an issue with the code.
now, you can use pyhanlp to create wordcloud with Chinese.
the tests still don't pass, meaning there are issues with your implementation. Are you intending to fix those? |
I tried it many times, but it always stops the class library called in pyhanlp. I don't quite understand why this is happening? |
Oh it's actually just flake8 failing now, it looks like. |
Indeed, errors are the following:
To run flake8 locally:
|
Simplified part of the code Removed some code that references the class library Modified some code that references the class library
I downloaded flake8 and made some changes to the code. It may be ok now. |
With the development of time, Chinese natural language processing technology has become very mature. But jieba is still the fastest native python library, and other libraries need to call other codes or use deep learning models, so I personally think that as an example, maybe jieba is the best choice. As for the issue of optimal performance, it should be left to the user to solve. So it's almost time to close the pull. |
删除冲突代码以合并源代码
So are you saying you'd leave the example as it was in #329 and close this pull request? |
I'm sorry for closing this merge request years later. I didn't merge it at the time due to concerns - it was just a demonstration, not an engineering application. Hanlp is still the best statistical based Chinese model to this day, and even has certain advantages over deep learning models in some fields. However, for a demonstration, his model is still too large. |
pyhanlp is One of
the most powerful natural language processing libraries in Chinese
today, and it's extremely easy to use.You can use 'PIP install pyhanlp'.
To install it,like Jieba.
Its level of identity of named entity,word segmentation was better than jieba,
and has more ways to do it.You'll save a lot of time when you use it.
And thanks to its excellent performance,
when we handle a large number of Chinese texts
We don't have to use the User-defined dictionaries。