Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use HanLP or Jieba create word_cloud_cn #430

Closed
wants to merge 14 commits into from

Conversation

TianFengshou
Copy link
Contributor

pyhanlp is One of
the most powerful natural language processing libraries in Chinese
today, and it's extremely easy to use.You can use 'PIP install pyhanlp'.
To install it,like Jieba.

Its level of identity of named entity,word segmentation was better than jieba,
and has more ways to do it.You'll save a lot of time when you use it.

And thanks to its excellent performance,
when we handle a large number of Chinese texts
We don't have to use the User-defined dictionaries。

TianFengshou and others added 2 commits September 10, 2018 22:04
pyhanlp is One of
the most powerful natural language processing libraries in Chinese
today, and it's extremely easy to use.You can use 'PIP install pyhanlp'.
To install it,like Jieba.

Its level of identity of named entity,word segmentation was better than jieba,
and has more ways to do it.You'll save a lot of time when you use it.

And thanks to its excellent performance,
when we handle a large number of Chinese texts
We don't have to use the User-defined dictionaries.
@jcfr
Copy link
Collaborator

jcfr commented Sep 10, 2018

Thanks for the contribution.

Would it be possible to squash the commits together and fix the styling issues ?

To run the style check locally, you could run flask8 from the source directory.

Wordcloud is a very good tools, but if you want to create
Chinese wordcloud only wordcloud is not enough. The file
shows how to use wordcloud with Chinese. First, you need a
Chinese word segmentation library jieba or HanLp.You can
use 'PIP install jieba' or 'PIP install pyhanlp' or to
install it.As you can see,at the same time using wordcloud
with jieba or HanLP very convenient.While jieba is lighter,
hanlp requires more downloads, but is more powerfulHanLP's
level of identity of named entity,word segmentation was
better than jieba,and has more ways to do it.You'll save
a lot of time when you use it.
@TianFengshou TianFengshou changed the title use HanLP replace Jieba use HanLP or Jieba create word_cloud_cn Sep 19, 2018
@TianFengshou
Copy link
Contributor Author

What do I need to do now?

max_font_size=100, random_state=42, width=1000, height=860, margin=2,)
# The function for processing text with HaanLP
def pyhanlp_processing_txt(text, isUseStopwordsOfHanLP=True):
CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JClass is not defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use hanlp must install jpype1 .I forgot about it......

"pip install jpype1"......

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyhanlp provides a python interface for hanlp. Hanlp is currently the best-performing open source Chinese natural language processing class library, but it is implemented in Java, so we must use jpype to call Java classes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but you must also import it, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make sure the continuous integration passes.

Copy link
Contributor Author

@TianFengshou TianFengshou Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need to import it.The pyhanlp import jpype1 in the __init__.py.

from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use pyhanlp,In line 114 there's a line like “from pyhanlp import *“
If use jieba,In line 106 there's a line like “import jieba“。This increases efficiency, and the code is feasible.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tests don't pass (see the 8 failing checks below) there's an issue with the code.

@amueller
Copy link
Owner

the tests still don't pass, meaning there are issues with your implementation. Are you intending to fix those?

@TianFengshou
Copy link
Contributor Author

I tried it many times, but it always stops the class library called in pyhanlp. I don't quite understand why this is happening?

@amueller
Copy link
Owner

Oh it's actually just flake8 failing now, it looks like.

@jcfr
Copy link
Collaborator

jcfr commented Dec 13, 2018

Indeed, errors are the following:

./examples/wordcloud_cn.py:81: [F405] 'HanLP' may be undefined, or defined from star imports: pyhanlp
    HanLP.Config.ShowTermNature = False
    ^
./examples/wordcloud_cn.py:82: [F405] 'HanLP' may be undefined, or defined from star imports: pyhanlp
    CRFnewSegment = HanLP.newSegment("viterbi")
                    ^
./examples/wordcloud_cn.py:85: [E712] comparison to True should be 'if cond is True:' or 'if cond:'
    if isUseStopwordsByHanLP == True:
                             ^
./examples/wordcloud_cn.py:114: [F403] 'from pyhanlp import *' used; unable to detect undefined names
    from pyhanlp import *
    ^
./examples/wordcloud_cn.py:115: [F401] 'jpype.startJVM' imported but unused
    from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM
    ^
./examples/wordcloud_cn.py:115: [F401] 'jpype.getDefaultJVMPath' imported but unused
    from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM
    ^
./examples/wordcloud_cn.py:115: [F401] 'jpype.isThreadAttachedToJVM' imported but unused
    from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM
    ^
./examples/wordcloud_cn.py:115: [F401] 'jpype.attachThreadToJVM' imported but unused
    from jpype import JClass, startJVM, getDefaultJVMPath, isThreadAttachedToJVM, attachThreadToJVM
    ^

To run flake8 locally:

  1. Activate your environment
  2. Go to source directory
  3. Make sure development requirements are installed
  4. Execute flake8
workon wordcloud # Or similar command to activate your python environment
cd /path/to/src/word_cloud
pip install requirements-dev.txt
flake8

Simplified part of the code
Removed some code that references the class library
Modified some code that references the class library
@TianFengshou
Copy link
Contributor Author

I downloaded flake8 and made some changes to the code. It may be ok now.

@TianFengshou
Copy link
Contributor Author

With the development of time, Chinese natural language processing technology has become very mature. But jieba is still the fastest native python library, and other libraries need to call other codes or use deep learning models, so I personally think that as an example, maybe jieba is the best choice. As for the issue of optimal performance, it should be left to the user to solve. So it's almost time to close the pull.

删除冲突代码以合并源代码
@amueller
Copy link
Owner

So are you saying you'd leave the example as it was in #329 and close this pull request?

@TianFengshou
Copy link
Contributor Author

I'm sorry for closing this merge request years later. I didn't merge it at the time due to concerns - it was just a demonstration, not an engineering application. Hanlp is still the best statistical based Chinese model to this day, and even has certain advantages over deep learning models in some fields. However, for a demonstration, his model is still too large.

@TianFengshou TianFengshou deleted the master branch October 30, 2024 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants