Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance with special characters #114

Closed
erikradisch opened this issue Apr 8, 2018 · 3 comments
Closed

Distance with special characters #114

erikradisch opened this issue Apr 8, 2018 · 3 comments

Comments

@erikradisch
Copy link

It seems to me, that edlib does not calculate the right distance, if there are special characters (with diacritic signs).
for example:
übund - ubung should have a distance of 1 but I end up with 3. Is this a bug or is it wanted?

@Martinsos
Copy link
Owner

Hi @erikradisch , thanks for reaching out!
Please check similar issues: #109 #104 #79 #89, each of them should hold an answer to your question with some more detailed explanation and suggestions from my side.
To put it super shortly: unfortunately, edlib for now does not support multibyte characters. ü gets represented as two chars, and therefore edit distance is not what you would expect. It is not a bug, it is expected behaviour at the moment. However, I do plan to add feature to support multibyte strings and actually any type of sequence very soon.
Btw., would you mind answering me a question (I am trying to understand better how people use Edlib): are you using edlib as python package or as C/C++ library? What is the main purpose you are using it for? Thanks!

@erikradisch
Copy link
Author

Sure! I use the python package. I use it to align historical place names to a gazetteer. Your algorithm has two huge puses. first, it can be aborted, if levenshtein reaches a limit, second, you can align additional equalities, which is very important, as there are a lot of predictable differences, which are in fact equalities in historical place names (c instead of an k for example)

@Martinsos
Copy link
Owner

Awesome, thanks for sharing, and it is nice to hear those features are useful :). When we do #90 and #77 it should support your use case even better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants