英文数词被标记为 CN_WORD #1046

stormyi · 2024-02-27T04:17:44Z

Description

ik_smart（v7.10.0）对英文数词+中文量词组合的分词效果与预期不符，特别不能理解的是"7天"的 7 为什么是 CN_WORD？
（ps：相同环境在 es 7.10.2 + ik 7.10.2 英文数词+中文量词被标记为 TYPE_CQUAN）

Steps to reproduce

POST /_analyze
'{"field":"content","analyzer":"ik_smart","text":"7天 44天 55天"}'

Expected behavior

"7天"应该是一个 TYPE_CQUAN
{
"token": "7天",
"start_offset": 0,
"end_offset": 2,
"type": "TYPE_CQUAN",
"position": 0
}

Actual behavior

{
"token": "7",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 0
},
{
"token": "天",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}

Environment

Versions: Elasticsearch 7.10.0

stormyi · 2024-02-28T02:37:30Z

按照我的预期，"7天"应该是一个 token，而不是被拆分为"7"和"天"。看起来是因为"7"被认为是一个 CN_WORD，所以没有和"天"组合。有人能解惑一下吗

stormyi · 2024-02-28T12:37:11Z

@medcl

medcl · 2024-02-29T01:03:43Z

嗯，这块是需要优化，IK 项目近期会整理遗留的 Bug，我们会继续完善。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

英文数词被标记为 CN_WORD #1046

英文数词被标记为 CN_WORD #1046

stormyi commented Feb 27, 2024

stormyi commented Feb 28, 2024

stormyi commented Feb 28, 2024

medcl commented Feb 29, 2024

英文数词被标记为 CN_WORD #1046

英文数词被标记为 CN_WORD #1046

Comments

stormyi commented Feb 27, 2024

Description

Steps to reproduce

Expected behavior

Actual behavior

Environment

stormyi commented Feb 28, 2024

stormyi commented Feb 28, 2024

medcl commented Feb 29, 2024