Make sequence more abstract, so it can be anything, not just array of chars. #90

Martinsos · 2017-08-21T20:57:52Z

Multiple people where asking about support for multybyte characters (unicode).
One way to provide that and even more is by making a sequence not an array of chars, but instead an array of objects that satisfy the condition that they have equality operator defined over them.

What would the impact on speed be in this case? I think it would not be big impact, since they are anyway used only to calculate Peq and after that Peq is used.

Would it make it harder to use edlib for usual cases? Would it become to general, hard to use for strings? How could we make sure it is still easy to use while offering flexilibity?

Finally, this might be easier to implement if I decide before that to go with just C++ interface, so I should think about that first.

Martinsos · 2018-02-03T13:18:37Z

So far there have been 3 issues asking for multibyte support, so I assigned important label to this feature as it seems to be important to users.

This is a workaround until Martinsos#90 is implemented. If either query or target contain non-ascii values, they are mapped into an ASCII alphabet and the resulting byte sequences are used for doing the alignment.

This is a workaround until #90 is implemented. If either query or target contain non-ascii values, they are mapped into an ASCII alphabet and the resulting byte sequences are used for doing the alignment. This works only if whole alphabet does not have more than 256 characters.

Martinsos · 2019-05-04T22:13:15Z

With @jbaiter 's addition to Python version of Edlib this issue is less pressing, but still, it should be the next one to do.

Martinsos · 2020-01-15T08:53:17Z

This is also linked to this: #141 (Unicode support in python edlib).

Martinsos · 2020-09-26T13:24:55Z

@masri2019 has been working on this for some time now with a little bit of my guidance, so I will document here what has been done and what is yet to be done to call this feature complete!

Replacing edlib.h and edlib.cpp with edlib.hpp and edlib.tpp. Additional equalities does not work yet, tests and aligner are not updated, and C interface does not exist anymore. Python binding is also no updated. DONE WITH Supporting Generic Sequences #148 .
Updating tests and aligner. DONE WITH Modifying tests and aligner-app #150 .
Update additional equalities to work again. DONE WITH Enabling additional equalities for generic sequences (using sets) #154 . CPP codebase is now fully working. Performance seems to keep up.
Update documentation for C/CPP (README.md).
Update python binding so it works with new CPP implementation, and also update its documentation. Check https://www.benjack.io/2018/02/02/python-cpp-revisited.html, might be helpful.
Consider if we should add C wrappers or not. If yes, we add cedlib.h and cedlib.cpp files, which will be small and short. Possibly we add a test or two, since it is merely using the CPP interface and most of the stuff is checked in compile time. We don't do performance testing, since it is not needed. We also add docs for it. If we don't do this now, we can always do it later. Check Edlib and C (and C++) #80 also, I was pondering more about this there.
Do final polishing, check that CI is passing, possibly run some final performance checks, and release new version (both cpp/c and python), with version bumped to 2.0.0 due to the new interface.

We are using "big" feature branch gen-seqs where we are collecting these changes, and will merge them back into master once it is done.

Additional ideas/considerations:

Document requirements on templates (Element has to have == operator)?
Make sure CMAKE is in good shape (best to do this when rebasing on master?).
In some internal functions, consider using name Element instead of AlphabetIdx, if they don't need to know it is AlphabetIdx.
Consider making API more C++ish (vectors, strings, ...). We could overload edlibAlign method to take different types of parameters (cpp string, vector, ...). We could also make returned types and other structures that we use nicer. This all should probably be tackled as a separate issue due to the amount of changes.
Ensure there are no 'using namespace' in header files.
Try including edlib.hpp two times, from two different files, and see if we get double definitions error! Make sure we have automatic test for this and that it is run by CI. @masri2019 already implement the test (https://github.com/Martinsos/edlib/tree/gen-seqs/test/testMultiDefinition), however I am not sure how to best make it part of CI, that is what needs further consideration.

Martinsos · 2021-08-31T18:31:47Z

Hey @masri2019, how are you doing? We made great progress with this one and then stopped -> are you still interested in possibly continuing with it, how are you with time?

mobinasri · 2021-09-03T00:35:36Z

Hi Martin! Thanks for asking. Yes I'm definitely interested in finishing what we have started. I have been busy doing some other projects but I can plan to dedicate some time to edlib. Based on what you sent, the next step is updating the readme. I'll create a pull request for that.

…

-Mobin

On Tue, Aug 31, 2021 at 11:31 AM Martin Šošić ***@***.***> wrote: Hey @masri2019 <https://github.com/masri2019>, how are you doing? We made great progress with this one and then stopped -> are you still interested in possibly continuing with it, how are you with time? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#90 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANLIBF55QRLUOAKGYESXSXLT7UNZ5ANCNFSM4DXXI44A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Martinsos · 2021-09-03T13:01:16Z

@masri2019 that is awesome :)!! I will also do my best to help you, I believe the two us can finish it together, if needed I can involve myself more, I should also be able to carve out some time.

Yes, the next step is README based on the checklist I created above (which I am now really happy I made because I would have no idea where we stopped otherwise :D). And then python bindings. I am sure we can get both of those done.

Next will be discussion about C wrapper, that might be a bit harder, but ok that is also doable. And then final polishing!

All together sounds like we (you) did the hardest part already, so really looking forward to this. Although, you know how they say: last 20% takes 80% of the time. But let's hope in this case percentages will be gentle to us.

Martinsos · 2021-09-03T13:02:24Z

@masri2019 I am guessing it might be a bit hard getting back into it after so much time, so I would advise you do what you can and if you get stuck somewhere no worries, make a draft PR and I can jump in, we will figure it out together. I also forget a lot of things but I am sure we will remember it relatively quickly, since we were writing pretty nice code.

Martinsos self-assigned this Aug 21, 2017

Martinsos added enhancement feature request labels Aug 21, 2017

Martinsos added the important label Feb 3, 2018

Martinsos mentioned this issue Mar 31, 2018

Failure to align strings represented in Unicode (misleading results) #109

Closed

Martinsos mentioned this issue Apr 11, 2018

Distance with special characters #114

Closed

jbaiter mentioned this issue Mar 10, 2019

Python: Add support for arbitrary sequences of hashable objects #128

Merged

Martinsos assigned Martinsos and unassigned Martinsos Sep 26, 2020

Martinsos mentioned this issue Sep 26, 2020

Enabling additional equalities for generic sequences (using sets) #154

Merged

Martinsos pinned this issue Aug 31, 2021

Martinsos added the help wanted label Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sequence more abstract, so it can be anything, not just array of chars. #90

Make sequence more abstract, so it can be anything, not just array of chars. #90

Martinsos commented Aug 21, 2017

Martinsos commented Feb 3, 2018

Martinsos commented May 4, 2019

Martinsos commented Jan 15, 2020

Martinsos commented Sep 26, 2020 •

edited

Loading

Martinsos commented Aug 31, 2021

mobinasri commented Sep 3, 2021 via email

Martinsos commented Sep 3, 2021

Martinsos commented Sep 3, 2021

Make sequence more abstract, so it can be anything, not just array of chars. #90

Make sequence more abstract, so it can be anything, not just array of chars. #90

Comments

Martinsos commented Aug 21, 2017

Martinsos commented Feb 3, 2018

Martinsos commented May 4, 2019

Martinsos commented Jan 15, 2020

Martinsos commented Sep 26, 2020 • edited Loading

Martinsos commented Aug 31, 2021

mobinasri commented Sep 3, 2021 via email

Martinsos commented Sep 3, 2021

Martinsos commented Sep 3, 2021

Martinsos commented Sep 26, 2020 •

edited

Loading