-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling additional equalities for generic sequences (using sets) #154
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @masri2019 ! I think we are there, we just need a little bit of reorganization of the code, the main comment on it is here #154 (comment) and there are also other supporting comments.
edlib/src/edlib.tpp
Outdated
makeEqualitySet<Element, AlphabetIdx>(elementToAlphabetIdx, | ||
config.additionalEqualities, | ||
config.additionalEqualitiesLength); | ||
EqualityDefinition<AlphabetIdx> equalityDefinition(equalitySet, hasAdditionalEquality); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at all this, I think the approach / algorithm is good, however it feels (as mentioned in some comments above) that we could encapsulate some things better, so let's see how to do that.
We have config.additionalEqualities array, we have its size, and we have elementToAlphabetIdx map. If I am correct, that is all that we need to construct the EqualityDefinition object, that will then allow us to call .areEqual(a, b) to check if two alphabet indexes are equal, right?
If that is so, then in edlibAlign function, we would ideally have only the following logic: passing inputs (config.additionalEqualities + their size, elementToAlphabetIdx, maybe smth else I missed) to a function, and receiving as a result EqualityDefinition object! Nothing more, no internals exposed. Then, we use that object where needed.
Now, the function that generates EqualityDefinition -> it could either be constructor of EqualityDefinition, in which case most of the logic will go into constructor, or it could be standalone function and constructor of EqualityDefinition will be as simple as it is now. I personally prefer doing stuff in constructor, but I also heard opposite opinions so I am not 100% sure at the moment what is better approach, probably it is not very important.
Also, in this approach, important thing is that EqualityDefinition would worry about allocating and deallocating memory it needs, reducing chance of memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Martinsos Let me explain why I put the function that generates EqualityDefinition
out of its constructor. As you mentioned the EqualityDefinition
has just one method and that is areEqual
. When we call .areEqual
we have to pass two variables of type AlphabetIdx
to be compared so knowing the custom type of AlphabetIdx
is enough and there is no need to carry Element
. areEqual
has nothing to do with Element
.
On the other hand, the pairs in additionalEqualities
are of type Element
and generating EqualityDefintion
needs mapping those pairs from the type Element
to AlphabetIdx
. If we put this mapping procedure in the constructor we have to add a template argument to EqualityDefinition
for Element
, which is needed just for generating the object.
I decided to avoid adding this template argument. Maybe you have better idea? (or you think we can have it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what you said I put all the procedure for generating EqualityDefintion
in a single function.
This is the new commit -> 4e6c4dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, good point with the template argument, I missed that!
Makes sense then, I agree with the approach of external constructor. Although it maybe somewhat complicates the situation with allocating the memory hm, since I think now we might have some copying happening with vectors/sets, although I am not 100% sure. While if it was happening in constructor, it would be easier, because vector/set would get allocated as object is constructed and you would just fill them and that is it.
Hm here is a thought: if we first transformed additionalEqualities
from EdlibEqualityPair<Element>*
into vector<EdlibEqualityPair<AlphabetIdx>>
, which is very easy to do, we could just pass that one vector to constructor of EqualityDefinition
and that is it! It would take that vector by reference, so no copying, and then it would just fill the set and vector it already has as private members.
Hm this sounds great to me, what do you think? So to summarize:
- Very simple function that takes
EdlibEqualityPair<Element>*
and itslength
andelementToAlphabetIdx
and returnsvector<EdlibEqualityPair<AlphabetIdx>>
. So basically just transforms array of elements to vector of alphabet indexes. Or maybe we can even inline this code, whatever you like better. - Constructor in
EqualityDefinition
that takes onlyvector<EdlibEqualityPair<AlphabetIdx>>
by reference and constructsequalitySet
andhasAdditionalEqualities
. It doesn't even have to call that template argumentAlphabetIdx
, it can call it anything at this point, justT
, since it is really in no way connected toAlphabetIdx
per se any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All agreed!
edlib/src/edlib.tpp
Outdated
bool areEqual(AlphabetIdx a, AlphabetIdx b) const { | ||
return a == b; | ||
return (a == b) || (hasAdditionalEqualities[a] && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, and nice formatting!
One thing only: I prefer, and I believe I am doing it so in the rest of the codebase (at least I hope so), to put newline before the binary operator when I have to break the line into multiple lines. So let's do that? (Put &&
at start of the line, not the end.)
There is also good discussion here that I believe nicely demonstrates pros and cons of two approaches: https://softwareengineering.stackexchange.com/questions/160522/should-i-put-newlines-before-or-after-binary-operators .
edlib/src/edlib.tpp
Outdated
SetOfPairs<AlphabetIdx> equalitySet; | ||
std::vector<bool> hasAdditionalEqualities; | ||
public: | ||
EqualityDefinition(const SetOfPairs<AlphabetIdx>& equalitySet_, const std::vector<bool>& hasAdditionalEqualities_){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am wondering here is, if you take these sets/vector by reference, what happens exactly? I believe if it was not reference, they could get completely copied, together with all the contents, and new memory would be allocated on heap, that would live as long as this object lives. However, in this case with reference hm, I am not sure what happens. Maybe they also get copied in that fashion when you do assignment below? Yes, that is probably so I think.
This means they will survive as long as they should, but we do have one copying that maybe could have been avoided. I am not sure what is the impact though, probably not very big.
edlib/src/edlib.tpp
Outdated
} | ||
}; | ||
|
||
/** | ||
* Takes additional equalities for alphabet elements, transforms them into index equalities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Takes additional equalities for alphabet elements, transforms them into index equalities | |
* Builds equality definition from additional equalities. |
I think this is enough, who wants to know more can check additional equalities, no need for us to explain here what they do.
Btw, we could even completely remove this description, it would still be clear from the name.
Although I might vote then to rename function to buildEqualityDefinition, because "build" indicates that something resource intensive might be happening, at least I perceive it so.
edlib/src/edlib.tpp
Outdated
} | ||
}; | ||
|
||
/** | ||
* Takes additional equalities for alphabet elements, transforms them into index equalities | ||
* @param [in] additionalEqualitiesLength |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add other params?
edlib/src/edlib.tpp
Outdated
makeEqualitySet<Element, AlphabetIdx>(elementToAlphabetIdx, | ||
config.additionalEqualities, | ||
config.additionalEqualitiesLength); | ||
EqualityDefinition<AlphabetIdx> equalityDefinition(equalitySet, hasAdditionalEquality); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, good point with the template argument, I missed that!
Makes sense then, I agree with the approach of external constructor. Although it maybe somewhat complicates the situation with allocating the memory hm, since I think now we might have some copying happening with vectors/sets, although I am not 100% sure. While if it was happening in constructor, it would be easier, because vector/set would get allocated as object is constructed and you would just fill them and that is it.
Hm here is a thought: if we first transformed additionalEqualities
from EdlibEqualityPair<Element>*
into vector<EdlibEqualityPair<AlphabetIdx>>
, which is very easy to do, we could just pass that one vector to constructor of EqualityDefinition
and that is it! It would take that vector by reference, so no copying, and then it would just fill the set and vector it already has as private members.
Hm this sounds great to me, what do you think? So to summarize:
- Very simple function that takes
EdlibEqualityPair<Element>*
and itslength
andelementToAlphabetIdx
and returnsvector<EdlibEqualityPair<AlphabetIdx>>
. So basically just transforms array of elements to vector of alphabet indexes. Or maybe we can even inline this code, whatever you like better. - Constructor in
EqualityDefinition
that takes onlyvector<EdlibEqualityPair<AlphabetIdx>>
by reference and constructsequalitySet
andhasAdditionalEqualities
. It doesn't even have to call that template argumentAlphabetIdx
, it can call it anything at this point, justT
, since it is really in no way connected toAlphabetIdx
per se any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masri2019 Ok we are really close, but I just realized something based on your explanation of why you used external constructor, please check #154 (comment), I think this is the way to go at the end. Other comments are minor.
edlib/src/edlib.tpp
Outdated
* @return equalitySet as a set of all equality pairs where each pair contains the indexes of an additional equality | ||
*/ | ||
template<class Element, class AlphabetIdx> | ||
EqualityDefinition<AlphabetIdx> makeEqualityDefinition(const EdlibAlignConfig<Element> config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused with the implementation here, you are taking both config and then additionalEqualities and their length! I would say you need either config or additionalEqualities and their length, not both at the same time, right?
ALso, below you have two for loops, why? Why not just have one for loop, merge them together?
Btw check this comments, I think this is the way to go: #154 (comment) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great I think this is it, just couple of small comments, one more important (one with assert).
edlib/src/edlib.tpp
Outdated
elementToAlphabetIdx, | ||
config.additionalEqualities, | ||
config.additionalEqualitiesLength); | ||
EqualityDefinition<AlphabetIdx>(additionalEqualities, elementToAlphabetIdx.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can just do inline: EqualityDefinition<AlphabetIdx>(additionalEqualities, elementToAlphabetIdx.size());
, no need for assignment, you are just doing redundant copying possibly.
for(int i = 0; i < additionalEqualitiesLength; i++){ | ||
Element a = additionalEqualities[i].first; | ||
Element b = additionalEqualities[i].second; | ||
if (elementToAlphabetIdx.count(a) && elementToAlphabetIdx.count(b)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If correct arguments are given, this check should not fail, right?
What I am worried about is if this check fails, we will never know -> it will be a silent fail!
If it fails, it means we made a mistake in our program, it is not users fault, in this case.
So the whole thing should crash. How about we put assert here, instead of this if-else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did this was that when at least one of the elements in an equality is not present in the alphabet, we can remove that equality. I thought that maybe the purpose of test12
in runTests.cpp
is to ignore such equalities. Could you check that test? Maybe I have to change that test too?
Lines 427 to 442 in ba784e5
bool test12() { | |
EdlibEqualityPair additionalEqualities[24] = {{'R','A'},{'R','G'},{'M','A'},{'M','C'},{'W','A'},{'W','T'}, | |
{'S','C'},{'S','G'},{'Y','C'},{'Y','T'},{'K','G'},{'K','T'}, | |
{'V','A'},{'V','C'},{'V','G'},{'H','A'},{'H','C'},{'H','T'}, | |
{'D','A'},{'D','G'},{'D','T'},{'B','C'},{'B','G'},{'B','T'}}; | |
const char* query = "GCATATCAATAAGCGGAGGA"; | |
const char* target = "TAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTATCGAATAAACTTGATGGGTTGTCGCTGGCTTCTAGGAGCATGTGCACATCCGTCATTTTTATCCATCCACCTGTGCACCTTTTGTAGTCTTTGGAGGTAATAAGCGTGAATCTATCGAGGTCCTCTGGTCCTCGGAAAGAGGTGTTTGCCATATGGCTCGCCTTTGATACTCGCGAGTTACTCTAAGACTATGTCCTTTCATATACTACGAATGTAATAGAATGTATTCATTGGGCCTCAGTGCCTATAAAACATATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATTAAATTCTCAACCCCTTCCGGTTTTTTGACTGGCTTTGGGGCTTGGATGTGGGGGATTCATTTGCGGGCCTCTGTAGAGGTCGGCTCCCCTGAAATGCATTAGTGGAACCGTTTGCGGTTACCGTCGCTGGTGTGATAACTATCTATGCCAAAGACAAACTGCTCTCTGATAGTTCTGCTTCTAACCGTCCATTTATTGGACAACATTATTATGAACACTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACAAGGATTCCCCTAGTAACTGCGAGTGAAGCGGGAAAAGCTCAAATTTAAAATCTGGCGGTCTTTGGCCGTCCGAGTTGTAATCTAGAGAAGCGACACCCGCGCTGGACCGTGTACAAGTCTCCTGGAATGGAGCGTCATAGAGGGTGAGAATCCCGTCTCTGACACGGACTACCAGGGCTTTGTGGTGCGCTCTCAAAGAGTCGAGTTGTTTGGGAATGCAGCTCTAAATGGGTGGTAAATTCCATCTAAAGCTAAATATTGGCGAGAGACCGATAGCGAACAAGTACCGTGAGGGAAAGATGAAAAGAACTTTGGAAAGAGAGTTAAACAGTACGTGAAATTGCTGAAAGGGAAACGCTTGAAGTCAGTCGCGTTGGCCGGGGATCAGCCTCGCTTTTGCGTGGTGTATTTCCTGGTTGACGGGTCAGCATCAATTTTGACCGCTGGAAAAGGACTTGGGGAATGTGGCATCTTCGGATGTGTTATAGCCCTTTGTCGCATACGGCGGTTGGGATTGAGGAACTCAGCACGCCGCAAGGCCGGGTTTCGACCACGTTCGTGCTTAGGATGCTGGCATAATGGCTTTAATCGACCCGTCTTGAAACACGGACCAAGGAGTCTAACATGCCTGCGAGTGTTTGGGTGGAAAACCCGAGCGCGTAATGAAAGTGAAAGTTGAGATCCCTGTCGTGGGGAGCATCGACGCCCGGACCAGAACTTTTGGGACGGATCTGCGGTAGAGCATGTATGTTGGGACCCGAAAGATGGTGAACTATGCCTGAATAGGGTGAAGCCAGAGGAAACTCTGGTGGAGGCTCGTAGCGATTCTGACGTGCAAATCGATCGTCAAATTTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCTGCCGAAGTTTCCCTCAGGATAGCAGAAACTCATATCAGATTTATGTGGTAAAGCGAATGATTAGAGGCCTTGGGGTTGAAACAACCTTAACCTATTCTCAAACTTTAAATATGTAAGAACGAGCCGTTTCTTGATTGAACCGCTCGGCGATTGAGAGTTTCTAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGAACGCGAGGTTAAGGTGCCGGAATTCACGCTCATCAGACACCACAAAAGGTGTTAGTTCATCTAGACAGCAGGACGGTGGCCATGGAAGTCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATGAACTAGCCCTGAAAATGGATGGCGCTTAAGCGTGATACCCATACCTCGCCGTCAGCGTTGAAGTGACGCGCTGACGAGTAGGCAGGCGTGGAGGTCAGTGAAGAAGCCTTGGCAGTGATGCTGGGTGAAACGGCCTCC"; | |
EdlibAlignResult result = edlibAlign(query, static_cast<int>(std::strlen(query)), | |
target, static_cast<int>(std::strlen(target)), | |
edlibNewAlignConfig(-1, EDLIB_MODE_HW,EDLIB_TASK_LOC, additionalEqualities, 24)); | |
bool pass = result.status == EDLIB_STATUS_OK && result.editDistance == 0; | |
printf(pass ? "\x1B[32m""OK""\x1B[0m\n" : "\x1B[31m""FAIL""\x1B[0m\n"); | |
edlibFreeAlignResult(result); | |
return pass; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ufff you are right! I am being silly, yes let's keep it this way!
I somehow missed the changes you made following my comments! So, all is good I believe and we can merge, but I see that continuous integration on AppVeyor is failing, with error |
Or maybe we just shouldn't care about VS 10 and drop it from AppVeyor check hm! I like that better, if people complain we can see what to do, but I don't like not using modern features just to support very old systems. |
@masri2019 ok, I figured it out! I updated AppVeyor (our CI service for Windows) to use newer versions of Visual Studio, so hopefully that will help. I did that update on master, so you could cherry pick it here if you want or we will rebase later. This PR is ready to merge, but could you first squash the commits? Let's squash them into optimaly one (or multiple if there are distinct sets of changes) commit, and then I can merge it. |
Also, this change just happened on master: #158 . Since we did big changes, I wouldn't want us to miss that and override once we merge, so maybe we should do it now here also? Or we write it down and remember later when merging to keep an eye out for it. |
Defined hash functions for std::pair to be able to create set of pairs, implemented EqualityDefinition to support generic types and updated tests and aligner app accordingly
Hi @Martinsos, My apologies for being so late. Because of fire evacuation order in California and its conflict with some of my university deadlines I completely missed to finish this PR. I just added a commit based on #158 and re-based this branch. |
@masri2019 no problem, I didn't forget about it but assumed you are on vacation and did not want to bother you! Sorry to hear about you being affected by the situation in California, I hope all is well! Next step would be updating main README.md according to the changes you have made -> would you like to take that on? Let me know how you would like to continue, I would love it of course if you take on as much as possible :D, but I understand if you have time limits and similar. |
@Martinsos Thanks for your understanding. Great idea for tracking the progress. I'd love to take the next steps with your guidance. I start updating the README.md this week and send you the PR as soon as it is done. |
Awesome let's continue then :)! |
This PR contains the implementation of an unordered_set of pairs and using it to activate additional equalities for generic sequences.
Here is the list of main changes:
It can speed up the process because we expect that a large portion of comparisons fall into at least one of the above scenarios.