Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Excerpt #532

Open
ghost opened this issue Jan 20, 2019 · 4 comments · May be fixed by #546
Open

Strange Excerpt #532

ghost opened this issue Jan 20, 2019 · 4 comments · May be fixed by #546

Comments

@ghost
Copy link

ghost commented Jan 20, 2019

So, I noticed that in some cases I may search for two terms and a document that contains both of those shows bits in the excerpt only for the first word (even though both are present inside the document)

How is this justified? Shouldn't it show at least one bit from the other word as well since it matches?

@stevennic
Copy link
Contributor

It would be helpful if you can post an example or reproduction.

@gary02
Copy link

gary02 commented Jul 9, 2019

in my case:

# query: 马克思

text = "两次历史性飞跃与马克思主义中国化"
sa = jieba.analyse.ChineseAnalyzer()
terms = ["马克", "马克思"]
ouput = highlight(
                            text,
                            terms,
                            sa,
                            WholeFragmenter(),
                            formatter
                    )

ouput:

两次历史性飞跃与<em>马克</em>思主义中国化

but i expect:

两次历史性飞跃与<em>马克思</em>主义中国化

@stevennic
Copy link
Contributor

Thanks. Could you also include a sample document with indexing code?

@gary02
Copy link

gary02 commented Jul 10, 2019

Thanks. Could you also include a sample document with indexing code?

do you mean it ? if i'm wrong, sorry about my poor English.

# python3.6.8
# jieba==0.39
# Whoosh==2.7.4 

import jieba.analyse
from whoosh.highlight import (
    WholeFragmenter, highlight, HtmlFormatter
)

query_string = '马克思'
text = "两次历史性飞跃与马克思主义中国化"
sa = jieba.analyse.ChineseAnalyzer()
formatter = HtmlFormatter()

terms = [token.text for token in sa(query_string)]

assert terms == ['马克', '马克思']

output = highlight(
        text,
        terms,
        sa,
        WholeFragmenter(),
        formatter
)


# pass
assert output == '两次历史性飞跃与<strong class="match term0">马克</strong>思主义中国化'

# failed
assert output == '两次历史性飞跃与<strong class="match term0">马克思</strong>主义中国化'

stevennic added a commit to stevennic/whoosh that referenced this issue Jul 15, 2019
@stevennic stevennic linked a pull request Jul 15, 2019 that will close this issue
stevennic added a commit to stevennic/whoosh that referenced this issue Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants