Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is nothing better than better documentation #277

Open
ikawaha opened this issue Aug 4, 2022 · 14 comments
Open

There is nothing better than better documentation #277

ikawaha opened this issue Aug 4, 2022 · 14 comments

Comments

@ikawaha
Copy link
Owner

ikawaha commented Aug 4, 2022

KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

Originally posted by @CaptainDario in #274 (comment)

@ikawaha
Copy link
Owner Author

ikawaha commented Aug 4, 2022

CaptainDario Indeed. There is nothing better than better documentation!

ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?

@KEINOS Thanks for the suggestion.

Would it be better to put the details in the wiki and link to it from the README? The wiki of this repository is open and you are free to add to it.

@KEINOS
Copy link
Contributor

KEINOS commented Aug 5, 2022

@ikawaha (cc: @CaptainDario)

The wiki of this repository is open and you are free to add to it.

Thank you!

Would it be better to put the details in the wiki and link to it from the README?

I agree. I would like to start with the "keywords per page" of the Wiki. For example, start with "wakati". We should think more about this when we have more keywords, shouldn't we?

@ikawaha
Copy link
Owner Author

ikawaha commented Aug 6, 2022

I have no idea 😇 , so let's start with "wakati".
There is extensive documentation on janome, which may be helpful.

@KEINOS
Copy link
Contributor

KEINOS commented Dec 6, 2022

@ikawaha (cc: @CaptainDario)

I have finally started editing the Wiki. But I think it is premature to link from the README.md, as I am just copying and pasting the issues.

Ideally, we would like to translate the official Japanese documentation into English. However, for the time being, it would be realistic to add topics one by one to the Wiki and later create a separate repository for kagome-doc.

Also, I was looking at the official documentation and thought that enriching ExampleXXX and godoc would be a Golang approach.

@ikawaha
Copy link
Owner Author

ikawaha commented Dec 6, 2022

Thank you so much!

It's great! 🙏

Even at this point, we have a few Example tests, and it's a great idea to enrich ExampleXXX and godoc.
( '-`).oO( But, they may not work with go-playground because the build timed out 😇.

e.g. Example test for the word filter

kagome/filter/word_test.go

Lines 167 to 187 in a16f933

func ExampleWordFilter() {
d, err := dict.LoadDictFile(testDictPath)
if err != nil {
panic(err)
}
t, err := tokenizer.New(d, tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
stopWords := filter.NewWordFilter([]string{"私", "は", "が", "の", "。"})
tokens := t.Tokenize("私の猫の名前はアプロです。")
stopWords.Drop(&tokens)
for _, v := range tokens {
fmt.Println(v.Surface)
}
// Output:
// 猫
// 名前
// アプロ
// です
}
)

@KEINOS
Copy link
Contributor

KEINOS commented Dec 7, 2022

But, they may not work with go-playground because the build timed out 😇.

Yes, indeed. Go Playground is a no-go for kagome for now 😭

However, as long as godoc can run ExamplesXXX, it is worth including whenever possible.

How about creating an _example directory and putting some working examples there? Along with Wiki and godoc improvements, of course.

@ikawaha
Copy link
Owner Author

ikawaha commented Dec 7, 2022

How about creating an _example directory and putting some working examples there? Along with Wiki and godoc improvements, of course.

It sounds good 👍.

I created ./sample/_exmple folder for adding working examples in PR #296.

@CaptainDario
Copy link

@KEINOS I am currently playing around with the different dictionaries. While doing this I figured out that, when using unidic processing:
私は日本人です。

Results in
[代名詞, *, *, *, *, *, ワタクシ, 私-代名詞, 私, ワタクシ, 私, ワタクシ, 和, *, *, *, *],
[助詞-係助詞, 係助詞, *, *, *, *, ハ, は, は, , は, ワ, 和, *, *, *, *],
[名詞-固有名詞-地名-国, 固有名詞, 地名, 国, *, *, ニッポン, 日本, 日本, ニッポン, 日本, ニッポン, 固, *, *, *, *],
[接尾辞-名詞的-一般, 名詞的, 一般, *, *, *, ニン, 人, 人, ニン, 人, ニン, 漢, *, *, *, *],
[助動詞, *, *, *, 助動詞-デス, 終止形-一般, デス, です, です, , です, デス, 和, *, *, *, *],
[補助記号-句点, 句点, *, *, *, *, , 。, 。, , 。, , 記号, *, *, *, *]

Notice: ワタクシ, ニン

However, when running with ipadic the result is

[名詞, 代名詞, 一般, *, *, *, 私, ワタシ, ワタシ],
[助詞, 係助詞, *, *, *, *, は, ハ, ワ],
[名詞, 一般, *, *, *, *, 日本人, ニッポンジン, ニッポンジン],
[助動詞, *, *, *, 特殊・デス, 基本形, です, デス, デス], [記号, 句点, *, *, *, *, 。, 。, 。]

Notice: ワタシ, ニッポンジン

I think the results from using ipadic are clearly better.
While I really appreciate your previous answer (and creating the wiki), could I ask you to elaborate a bit more what the disadvantages/advantages of the different dictionaries are?
I though
Accuracy of results: ipadic < unidic < neologd
Size / speed: neologd < unidic < ipadic
But that seems to not reallly hold.

@KEINOS
Copy link
Contributor

KEINOS commented Jan 21, 2023

@CaptainDario


I though
Accuracy of results: ipadic < unidic < neologd
Size / speed: neologd < unidic < ipadic
But that seems to not reallly hold.

As you point out, the size of the dictionary is proportional to its speed, but not to its size and accuracy.

In my personal experience, I believe that they can be classified as follows:

  • Size: ipadic < unidic < neologd
  • Speed: neologd < unidic < ipadic
  • Accuracy:
    • grammar analysis: unidic < ipadic < neologd
    • word split by proper noun: ipadic < unidic < neologd
    • word split by general-purpose: neologd < ipadic < unidic

This is because each dictionary is created for a different purpose and requires different precision.


what the disadvantages/advantages of the different dictionaries are?

tl; dr

In summary, IPADIC is typically used for grammatical analysis and UNIDIC for retrieval analysis. IPADIC is lightweight and accurate in most use cases and UNIDIC is good for word-splitting for word search purposes.

IPADIC is recommended when part of speech (PoS) is important.

For example, when PoS is used as an information vector for analysis, machine learning, or etc. And NEologd is a kind of IPADIC + user dictionary. This dictionary has been extended by the community to cover the new vocabulary missing in IPADIC. However, it is huge.

UNIDIC, on the other hand, is recommended when it is necessary to split a sentence into smaller example units for retrieval. Search engines, for example.

When a search engine needs to measure the distance between the divided units. Levenshtein distance or Cosine similarity for example. Or, using each unit ID (word ID? token?) as a discrete feature value for machine learning.

Depending on what and how you are analyzing, in my opinion, I would recommend using IPADIC plus a home-made user dictionary.

ts; dr

Disadvantage of UNIDIC

As you may have already experienced, you may be uncomfortable with the difference in accuracy and speed of delimitation. Compared to IPADIC, UNIDIC seems to be less accurate despite its larger amount of information (larger dictionary size).

$ # IPA DICT
$ time echo "私は日本人です。" | kagome -sysdict ipa
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
日本人	名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

real	0m1.021s
user	0m1.114s
sys	0m0.090s

$ # UNI DICT
$ time echo "私は日本人です。" | kagome -sysdict uni
私	代名詞,*,*,*,*,*,ワタクシ,私-代名詞,私,ワタクシ,私,ワタクシ,和,*,*,*,*
は	助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
日本	名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
人	接尾辞,名詞的,一般,*,*,*,ニン,人,人,ニン,人,ニン,漢,*,*,*,*
です	助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
。	補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

real	0m4.807s
user	0m5.303s
sys	0m0.273s

The problem here is the difference between "日本人" and "日本, ".

UNIDIC is a dictionary based on "short units" (短単位たんたんい) defined by the NINJAL to facilitate the collection of examples for the BCCWJ.

  • NINJAL (National Institute of Japanese Language and Linguistics)
  • BCCWJ (Balanced Corpus of Contemporary Written Japanese)

This "short units" is known that the division is too short to be used in "natural language processing" for syntactic and semantic analysis.

Thus, in most use cases, IPADIC is faster and more convenient. This is why my recommendation is to use IPADIC with a custom user dictionary.

Advantage and use cases of UNIDIC

An advantage of UNIDIC is the "consistency" in word segmentation.

The difference between the two dictionaries, IPA and UNI, is illustrated by a well-known example.

"りんごジュースを飲んだ。" vs "リンゴジュースを飲んだ。"

Both are correct and mean the same thing, such as "I drank apple juice".

But, sensibly, "りんごジュース" is easier to read than "リンゴジュース" because the words are visually separated (katakana-hiranaga-mixture vs all-in-katakana).

And both dictionaries include the word "りんご" and "リンゴ" as a noun (名詞).

$ # IPA DICT
$ echo "りんご" | kagome -sysdict ipa
りんご	名詞,一般,*,*,*,*,りんご,リンゴ,リンゴ
EOS

$ echo "リンゴ" | kagome -sysdict ipa
リンゴ	名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
EOS

$ # UNI DICT
$ echo "りんご" | kagome -sysdict uni
りんご	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
EOS

$ echo "リンゴ" | kagome -sysdict uni
リンゴ	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
EOS

And here comes the problem.

$ # IPA DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict ipa
りん	副詞,助詞類接続,*,*,*,*,りん,リン,リン
ご	接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ジュース	名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん	動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ	助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。	記号,句点,*,*,*,*,。,。,。
EOS

$ # UNI DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict uni
りんご	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
ジュース	名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を	助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん	動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ	助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。	補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Note the difference between "りん, " and "りんご".

IPADIC recognized "りんご" as an adverb/prefix (副詞/接頭詞) combination and UNIDIC as a noun (名詞).

The simplest solution, apart from registering a user dictionary, is to use katakana notation.

$ # IPADICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict ipa
リンゴ	名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
ジュース	名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん	動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ	助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。	記号,句点,*,*,*,*,。,。,。
EOS

$ # UNIDICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict uni
リンゴ	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
ジュース	名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を	助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん	動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ	助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。	補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

The difference is that IPADIC attempted to interpret them grammatically, while UNIDIC interpreted them in short units.

  1. "日本人" (noun) vs "日本, " (noun + postfix)
  2. "りん, , ジュース" (adverb + prefix + noun) vs "りんご, ジュース" (noun+noun)

In both cases, the latter delimitation is divided into units suitable for search engines, etc.

This means that "short units" are effective in unifying the units of "search examples" in search engines and other information retrieval systems.

Thus, UNIDIC has more advantage for word searching purposes.


Are you convinced by this explanation? > @CaptainDario
Am I on the right track in my explanation? > @ikawaha

Let me know so I can fix it and add it to the Wiki.

@CaptainDario
Copy link

CaptainDario commented Jan 25, 2023

@KEINOS well first of all thank you for this very detailed explanation. It really helped me a lot!
I think this should definitely be added to the wiki, for starters this is gold.

In your opinion, is neologd worth it over standard ipadic for Japanese NLP?

@KEINOS
Copy link
Contributor

KEINOS commented Jan 25, 2023

It really helped me a lot!
I think this should definitely be added to the wiki, for starters this is gold.

I'm glad to hear that! So far so good. 👍

In your opinion, is neologd worth it over standard ipadic for Japanese NLP?

Neologd is a great dictionary. However, for my current usage, I choose IPADIC. If speed is not important, it is worth using Neologd, which is just an extension of IPADIC.

Actually, there is a Japanese text linter implemented in Javascript, but due to speed issues and the need to install Node.js separately, I was secretly struggling to implement it in Go with Kagome.

However, the dictionary lookup part seems to be the bottleneck, and even a simple test implementation using Neologed, its speed is not as good as the original Textlint. So I'm currently losing motivation to build a text linter in Go.

I wish I could help speed up Kagome, but I just started learning Go in earnest after this Corona disaster thing, so I can't keep up with its technology yet. 😭 Documenting is the only thing I can contribute for now.

@KEINOS
Copy link
Contributor

KEINOS commented Jan 30, 2023

@CaptainDario (cc: @ikawaha )

FYI, I added the FAQ and a document about it to the wiki. Feel free to fix them!

@KEINOS
Copy link
Contributor

KEINOS commented Apr 14, 2024

JFYI. I added the below article to the Wiki.

@ikawaha
Copy link
Owner Author

ikawaha commented Apr 14, 2024

@KEINOS Thank you for the very clear explanation. I appreciate your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants