-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change project description to reflect edlib only aligns bytes #123
Comments
Hi @bertsky, I did following:
While this means there is still no support for multi-byte sequences of characters in C++ version, we do now have solution from @jbaiter that takes care of such sequences in Python (as long as total number of unique values is <= 256), which should take care of most cases (and it not we throw error, so that is also good). I believe this should be enough! Also, if you wish to contribute by adding support for multi-byte sequences in C++, pls do let me know, I can give some guidelines. I will close this issue for now, but please let me know if anything is wrong and we can continue discussion / reopen issue. |
@Martinsos, I fully understand – this is probably the most natural gotcha when providing Python bindings for
instead of
I am not so convinced by the approach of automatic alphabet mapping in the Python version, though. Having an implicit restriction on the alphabet size, however large it may be (and 256 is not particularly large for practical purposes in many scripts, including Latin-based ones), only hides the problem for a while: Instead of forcing me to choose another library right away, it has me implement something based on edlib, and wait for the error to appear when I begin testing (if I am lucky). I would rather recommend making the Python version overtly based on |
Hi @Martinsos,
I have run into the same problem as #79, #89, #104, #109, #114. That is: if the inputs contain non-ASCII characters, then alignment (as represented by the CIGAR string) will be wrong.
The same goes for
additionalEqualities
, which merely takes the first byte and ignores the rest of the character.Of course I could decode my strings into byte sequences and pass those instead. But this really is not the same problem – I do need minimal character/codepoint alignments (and character equalities cannot be expressed with isolated bytes at all). And it is definitely not what the module is announced to be doing.
All issues mentioned above have been closed already, but as long as the code has not been generalised to work on arbitrary characters (as you said you will), this is either an open bug or a misrepresentation of the module's capabilities.
Please change the (Python) module description (and docstring) and README.md to reflect that, or others will keep running into this as well.
Otherwise: great project, thanks!
The text was updated successfully, but these errors were encountered: