Improve detection accuracy for CJK text #121
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello, after finding again issue #84, I decided to share my attempt to fix it (or rather, improve the situation a bit).
The approach proposed by this PR leverages the delta between the
cmn
score and thejpn
one. The issue in #84 is caused by the fact thatcmn
doesn't match kana (Japanese-only characters), butjpn
matches (many) Chinese characters, so it will end up with a higher score thancmn
.In particular, the example sentence mentioned in the issue, has a
0.86
score onjpn
, and a0.74
score oncmn
, due to the presence of 5 katakana characters out of a total of 42 characters. This means that the delta is around 12% ((0.86 - 0.74) * 100).This change enforces a minimum of
0.15
higherjpn
score, otherwisecmn
gets priority. This seems reasonable, as we can consider anything above 15% (around 1 every 6 characters) "a fair amount of kana".With this new approach, the example that I had originally raised as "this should be detected as Japanese" in #77 would fail, and be detected as Mandarin instead, because it contains just 1 kana out of a total of 11 characters. However, that example was pretty far-fetched, and it is unlikely to find such a kanji-dense sentence in a regular Japanese text. And as usual, this disclaimer always apply...
This approach is still fragile when compared to what machine translators (like Google translate) do, but it was the best solution I could think of without recurring to grammar checks (which is what Google translate likely does), as that is what kana are mostly used for in Japanese.
Also, this is missing a similar check on Korean vs Mandarin. Unfortunately, I do not know Korean, so I cannot add this check myself.
I'm open to suggestions/opinions on the proposed approach, especially from people involved in the original discussion (if they are still around and interested in the topic). @wooorm @kewang @niftylettuce
Fixes #84.