We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ik_smart(v7.10.0) 对英文数词+中文量词 组合的分词效果与预期不符,特别不能理解的是"7天"的 7 为什么是 CN_WORD? (ps:相同环境在 es 7.10.2 + ik 7.10.2 英文数词+中文量词 被标记为 TYPE_CQUAN)
POST /_analyze '{"field":"content","analyzer":"ik_smart","text":"7天 44天 55天"}'
"7天"应该是一个 TYPE_CQUAN { "token": "7天", "start_offset": 0, "end_offset": 2, "type": "TYPE_CQUAN", "position": 0 }
{ "token": "7", "start_offset": 0, "end_offset": 1, "type": "CN_WORD", "position": 0 }, { "token": "天", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }
The text was updated successfully, but these errors were encountered:
按照我的预期,"7天"应该是一个 token,而不是被拆分为"7"和"天"。看起来是因为"7"被认为是一个 CN_WORD,所以没有和"天"组合。有人能解惑一下吗
Sorry, something went wrong.
@medcl
嗯,这块是需要优化,IK 项目近期会整理遗留的 Bug,我们会继续完善。
No branches or pull requests
Description
ik_smart(v7.10.0) 对英文数词+中文量词 组合的分词效果与预期不符,特别不能理解的是"7天"的 7 为什么是 CN_WORD?
(ps:相同环境在 es 7.10.2 + ik 7.10.2 英文数词+中文量词 被标记为 TYPE_CQUAN)
Steps to reproduce
POST /_analyze
'{"field":"content","analyzer":"ik_smart","text":"7天 44天 55天"}'
Expected behavior
"7天"应该是一个 TYPE_CQUAN
{
"token": "7天",
"start_offset": 0,
"end_offset": 2,
"type": "TYPE_CQUAN",
"position": 0
}
Actual behavior
{
"token": "7",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 0
},
{
"token": "天",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}
Environment
The text was updated successfully, but these errors were encountered: