Strange Excerpt #532

ghost · 2019-01-20T02:49:25Z

So, I noticed that in some cases I may search for two terms and a document that contains both of those shows bits in the excerpt only for the first word (even though both are present inside the document)

How is this justified? Shouldn't it show at least one bit from the other word as well since it matches?

stevennic · 2019-03-12T23:38:07Z

It would be helpful if you can post an example or reproduction.

gary02 · 2019-07-09T08:28:50Z

in my case：

# query: 马克思

text = "两次历史性飞跃与马克思主义中国化"
sa = jieba.analyse.ChineseAnalyzer()
terms = ["马克"， "马克思"]
ouput = highlight(
                            text,
                            terms,
                            sa,
                            WholeFragmenter(),
                            formatter
                    )

ouput:

两次历史性飞跃与<em>马克</em>思主义中国化

but i expect:

两次历史性飞跃与<em>马克思</em>主义中国化

stevennic · 2019-07-09T08:41:28Z

Thanks. Could you also include a sample document with indexing code?

gary02 · 2019-07-10T07:16:20Z

Thanks. Could you also include a sample document with indexing code?

do you mean it ? if i'm wrong, sorry about my poor English.

# python3.6.8
# jieba==0.39
# Whoosh==2.7.4 

import jieba.analyse
from whoosh.highlight import (
    WholeFragmenter, highlight, HtmlFormatter
)

query_string = '马克思'
text = "两次历史性飞跃与马克思主义中国化"
sa = jieba.analyse.ChineseAnalyzer()
formatter = HtmlFormatter()

terms = [token.text for token in sa(query_string)]

assert terms == ['马克', '马克思']

output = highlight(
        text,
        terms,
        sa,
        WholeFragmenter(),
        formatter
)


# pass
assert output == '两次历史性飞跃与<strong class="match term0">马克</strong>思主义中国化'

# failed
assert output == '两次历史性飞跃与<strong class="match term0">马克思</strong>主义中国化'

stevennic added a commit to stevennic/whoosh that referenced this issue Jul 15, 2019

whoosh-community#532: Highlight longest overlapping token

07db88d

stevennic linked a pull request Jul 15, 2019 that will close this issue

Highlight longest overlapping token #546

Open

stevennic added a commit to stevennic/whoosh that referenced this issue Jul 17, 2019

whoosh-community#532: Add jieba to Travis

4b02612

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange Excerpt #532

Strange Excerpt #532

ghost commented Jan 20, 2019 •

edited by ghost

Loading

stevennic commented Mar 12, 2019

gary02 commented Jul 9, 2019

stevennic commented Jul 9, 2019

gary02 commented Jul 10, 2019

Strange Excerpt #532

Strange Excerpt #532

Comments

ghost commented Jan 20, 2019 • edited by ghost Loading

stevennic commented Mar 12, 2019

gary02 commented Jul 9, 2019

stevennic commented Jul 9, 2019

gary02 commented Jul 10, 2019

ghost commented Jan 20, 2019 •

edited by ghost

Loading