Highlight longest overlapping token #546
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The highlighter Formatter was not appropriately handling overlapping tokens. In Chinese, words can be subsets of other words, so unlike in English, multiple tokens can overlap. For example, in "中国化", both 中国 and 中国化 are valid tokens, each with a different meaning. Because the Formatter was failing to account for this scenario, it was picking the first matching token to highlight and then moving to the next character, causing it to skip the second token. Furthermore, because the shorter token had been detected before the longer one, it was always the shorter one that got highlighted.
However, the longest token among overlapping ones is always the most accurate, because it is the most specific. The proposed fix in this PR therefore sorts by match position first, and then by descending token length before formatting, ensuring the longest overlapping token is the one highlighted.
Fixed #532.