-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is nothing better than better documentation #277
Comments
@KEINOS Thanks for the suggestion. Would it be better to put the details in the wiki and link to it from the README? The wiki of this repository is open and you are free to add to it. |
@ikawaha (cc: @CaptainDario)
Thank you!
I agree. I would like to start with the "keywords per page" of the Wiki. For example, start with " |
I have no idea 😇 , so let's start with "wakati". |
@ikawaha (cc: @CaptainDario) I have finally started editing the Wiki. But I think it is premature to link from the README.md, as I am just copying and pasting the issues. Ideally, we would like to translate the official Japanese documentation into English. However, for the time being, it would be realistic to add topics one by one to the Wiki and later create a separate repository for Also, I was looking at the official documentation and thought that enriching |
Thank you so much! It's great! 🙏 Even at this point, we have a few Example tests, and it's a great idea to enrich e.g. Example test for the word filter Lines 167 to 187 in a16f933
|
Yes, indeed. Go Playground is a no-go for However, as long as How about creating an
|
It sounds good 👍. I created |
@KEINOS I am currently playing around with the different dictionaries. While doing this I figured out that, when using unidic processing: Results in Notice: ワタクシ, ニン However, when running with ipadic the result is [名詞, 代名詞, 一般, *, *, *, 私, ワタシ, ワタシ], Notice: ワタシ, ニッポンジン I think the results from using ipadic are clearly better. |
As you point out, the size of the dictionary is proportional to its speed, but not to its size and accuracy. In my personal experience, I believe that they can be classified as follows:
This is because each dictionary is created for a different purpose and requires different precision.
tl; drIn summary, IPADIC is typically used for grammatical analysis and UNIDIC for retrieval analysis. IPADIC is lightweight and accurate in most use cases and UNIDIC is good for word-splitting for word search purposes. IPADIC is recommended when part of speech (PoS) is important. For example, when PoS is used as an information vector for analysis, machine learning, or etc. And NEologd is a kind of IPADIC + user dictionary. This dictionary has been extended by the community to cover the new vocabulary missing in IPADIC. However, it is huge. UNIDIC, on the other hand, is recommended when it is necessary to split a sentence into smaller example units for retrieval. Search engines, for example. When a search engine needs to measure the distance between the divided units. Levenshtein distance or Cosine similarity for example. Or, using each unit ID (word ID? token?) as a discrete feature value for machine learning. Depending on what and how you are analyzing, in my opinion, I would recommend using IPADIC plus a home-made user dictionary. ts; drDisadvantage of UNIDICAs you may have already experienced, you may be uncomfortable with the difference in accuracy and speed of delimitation. Compared to IPADIC, UNIDIC seems to be less accurate despite its larger amount of information (larger dictionary size). $ # IPA DICT
$ time echo "私は日本人です。" | kagome -sysdict ipa
私 名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
日本人 名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。 記号,句点,*,*,*,*,。,。,。
EOS
real 0m1.021s
user 0m1.114s
sys 0m0.090s
$ # UNI DICT
$ time echo "私は日本人です。" | kagome -sysdict uni
私 代名詞,*,*,*,*,*,ワタクシ,私-代名詞,私,ワタクシ,私,ワタクシ,和,*,*,*,*
は 助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
人 接尾辞,名詞的,一般,*,*,*,ニン,人,人,ニン,人,ニン,漢,*,*,*,*
です 助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
real 0m4.807s
user 0m5.303s
sys 0m0.273s The problem here is the difference between " UNIDIC is a dictionary based on "short units" (
This "short units" is known that the division is too short to be used in "natural language processing" for syntactic and semantic analysis. Thus, in most use cases, IPADIC is faster and more convenient. This is why my recommendation is to use IPADIC with a custom user dictionary. Advantage and use cases of UNIDICAn advantage of UNIDIC is the "consistency" in word segmentation. The difference between the two dictionaries,
Both are correct and mean the same thing, such as "I drank apple juice". But, sensibly, " And both dictionaries include the word " $ # IPA DICT
$ echo "りんご" | kagome -sysdict ipa
りんご 名詞,一般,*,*,*,*,りんご,リンゴ,リンゴ
EOS
$ echo "リンゴ" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
EOS
$ # UNI DICT
$ echo "りんご" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
EOS
$ echo "リンゴ" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
EOS And here comes the problem. $ # IPA DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict ipa
りん 副詞,助詞類接続,*,*,*,*,りん,リン,リン
ご 接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ジュース 名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん 動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ 助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # UNI DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
ジュース 名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん 動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS Note the difference between " IPADIC recognized " The simplest solution, apart from registering a user dictionary, is to use katakana notation. $ # IPADICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
ジュース 名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん 動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ 助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # UNIDICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
ジュース 名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん 動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS The difference is that IPADIC attempted to interpret them grammatically, while UNIDIC interpreted them in short units.
In both cases, the latter delimitation is divided into units suitable for search engines, etc. This means that "short units" are effective in unifying the units of "search examples" in search engines and other information retrieval systems. Thus, UNIDIC has more advantage for word searching purposes. Are you convinced by this explanation? > @CaptainDario Let me know so I can fix it and add it to the Wiki. |
@KEINOS well first of all thank you for this very detailed explanation. It really helped me a lot! In your opinion, is neologd worth it over standard ipadic for Japanese NLP? |
I'm glad to hear that! So far so good. 👍
Neologd is a great dictionary. However, for my current usage, I choose IPADIC. If speed is not important, it is worth using Neologd, which is just an extension of IPADIC. Actually, there is a Japanese text linter implemented in Javascript, but due to speed issues and the need to install Node.js separately, I was secretly struggling to implement it in Go with Kagome. However, the dictionary lookup part seems to be the bottleneck, and even a simple test implementation using Neologed, its speed is not as good as the original Textlint. So I'm currently losing motivation to build a text linter in Go. I wish I could help speed up Kagome, but I just started learning Go in earnest after this Corona disaster thing, so I can't keep up with its technology yet. 😭 Documenting is the only thing I can contribute for now. |
@CaptainDario (cc: @ikawaha ) FYI, I added the FAQ and a document about it to the wiki. Feel free to fix them! |
JFYI. I added the below article to the Wiki.
|
@KEINOS Thank you for the very clear explanation. I appreciate your contribution. |
KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.
Originally posted by @CaptainDario in #274 (comment)
The text was updated successfully, but these errors were encountered: