Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight longest overlapping token #546

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

stevennic
Copy link
Contributor

@stevennic stevennic commented Jul 15, 2019

The highlighter Formatter was not appropriately handling overlapping tokens. In Chinese, words can be subsets of other words, so unlike in English, multiple tokens can overlap. For example, in "中国化", both 中国 and 中国化 are valid tokens, each with a different meaning. Because the Formatter was failing to account for this scenario, it was picking the first matching token to highlight and then moving to the next character, causing it to skip the second token. Furthermore, because the shorter token had been detected before the longer one, it was always the shorter one that got highlighted.

However, the longest token among overlapping ones is always the most accurate, because it is the most specific. The proposed fix in this PR therefore sorts by match position first, and then by descending token length before formatting, ensuring the longest overlapping token is the one highlighted.

Fixed #532.

@stevennic stevennic changed the title #532: Highlight longest overlapping token Highlight longest overlapping token Jul 15, 2019
@codecov
Copy link

codecov bot commented Jul 17, 2019

Codecov Report

Merging #546 into master will increase coverage by 0.02%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
+ Coverage   83.01%   83.03%   +0.02%     
==========================================
  Files         132      132              
  Lines       29641    29650       +9     
==========================================
+ Hits        24605    24620      +15     
+ Misses       5036     5030       -6
Impacted Files Coverage Δ
tests/test_highlighting.py 100% <100%> (ø) ⬆️
src/whoosh/highlight.py 82.91% <50%> (ø) ⬆️
src/whoosh/collectors.py 92.85% <0%> (+0.21%) ⬆️
src/whoosh/index.py 76.01% <0%> (+1.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26153e2...cbb9f77. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange Excerpt
1 participant