Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何对中文古籍中的生僻字进行分词?一些属于Unicode扩展区汉字会被过滤掉 #1068

Open
gwisdomroof opened this issue Aug 7, 2024 · 1 comment

Comments

@gwisdomroof
Copy link

gwisdomroof commented Aug 7, 2024

Description

在用IK分词器处理中文古籍时,发现它会自动过滤一些属于Unicode扩展区的生僻字,不知要如何解决?

Steps to reproduce

以字符串“习𮊸𨻸𰄊𰶃”为例,如下:
111

Expected behavior

期望这些汉字都能正确分词。

Environment

Versions: Elasticsearch 7.17.9(Docker)

@yangzhongke
Copy link
Contributor

新PR已经解决这个问题,请更新
#1071
请验证后close这个issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants