Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Result - Search for unicode character in fuzy search when using use_fast_fuzzy #80

Open
keywan-ghadami-oxid opened this issue Jan 31, 2022 · 4 comments

Comments

@keywan-ghadami-oxid
Copy link

Using words with special characters with fuzzy search does not give any result.

how to reproduce

I put two document into the index (together with a lot of others). One containing the German word "Fußbodenheizung", whith contains a special character 'ß'. And another one with a slightly wrong spelling "Fusbodenheizung".

when index was created using use_fast_fuzzy a fuzzy query does not give the expected result:

{
  "query": {
    "fuzzy": { "ctx": "Fußbodenheizung" }
  }
}

-> no hits

searching without the special character finds one hit instead of two:

{
  "query": {
    "fuzzy": { "ctx": "Fubodenheizung" }
  }
}

-> one hit "Fusbodenheizung"

expected

when the index was created using use_fast_fuzzy=false
the expected behavior is given:

{
  "query": {
    "fuzzy": { "ctx": "Fußbodenheizung" }
  }
}

-> two hits "Fußbodenheizung" and "Fusbodenheizung"

and for the query

{
  "query": {
    "fuzzy": { "ctx": "Fubodenheizung" }
  }
}

-> two hits "Fußbodenheizung" and "Fusbodenheizung"

normal query finds one hit as expected:

{
  "query": {
    "normal": { "ctx": "Fußbodenheizung" }
  }
}

-> one hit "Fußbodenheizung"

@ChillFish8
Copy link
Collaborator

Are you able to hit the /indexes/:index/hint endpoint with the payload:

{
  "query": "Fußbodenheizung"
}

and send the result back?

@keywan-ghadami-oxid
Copy link
Author

{"status":200,"data":{"hint":"fussbodenheizung"}}

@ChillFish8
Copy link
Collaborator

Hmm, looks like something weird is going on with the unicode normalisation and / or some weird handling of the document data.

Are you able to share an example dataset?

@ChillFish8
Copy link
Collaborator

Don't worry about sharing the dataset I see what's happening.

This seems to be an interesting behaviour of our Unicode decoder which gets used when using the fast-fuzzy system.
Which is essentially translating Fußbodenheizung into Fussbodenheizung which IIRC is it's ASCII equivalent (at least it thinks it is) although what is interesting is the fact that the system then doesn't match on the values when it should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants