Failure to align strings represented in Unicode (misleading results) #109

xtknight · 2018-02-10T17:20:23Z

In Python:

edlib.align("고통스러워", "고통스럽다", task="path", mode="NW")
{'locations': [(0, 4)], 'cigar': '5=', 'alphabetLength': 5, 'editDistance': 0}

(I think that the first three characters should be kept the same and the last two should be changed, but the 'cigar' output implies all five characters are the same!)

Is there any possible way to get this working properly for characters only representable in Unicode, even if it's messy? :( Or, at least, if I input it as UTF-16 or UTF-32 can the algorithm just work on 2 or 4 byte intervals? I realize UTF-8 might be difficult because it's variable-length but...desperate to get something working just for research purposes only. Any pointers would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

Martinsos · 2018-02-14T22:13:06Z

Hi @xtknight, sorry for late response: please check this issue, as it is exactly the same thing but with cyrilic characters: #104.

I would advise for now encoding the characters you have into bytes, as mentioned in the link I provided, if that is enough bits for all your characters.

I plan to add support for custom types of characters, that would support your case, however I have not yet got to it and it will probably take me some time to find to do it!

Martinsos · 2018-03-30T09:10:19Z

@xtknight did the answer help? Should we close this one?

xtknight · 2018-03-30T14:34:26Z

@Martinsos sorry for the lack of response. I suppose this would work fine, and eventually that's what I'll probably end up doing. But even in this case, I think it might break with certain UTF characters that require special sequences. And Unicode equivalence also would have no chance of working properly. In the ideal situation, the library would accept Unicode and handle Unicode equivalence, etc...but I realize that might be difficult to implement.

I actually went ahead and modified all the C++ source code to try to use UTF-16 a while ago and ended up with a mess. For some reason things weren't working as I desired. If it were something useful I certainly would have posted it here, and I'll have to use vague terms because I don't remember everything exactly, but the interface between Python and C++ also only transmitted the first byte of Unicode characters. On the C++ side there were just a bunch of problems because of the way things were hard-coded to use 8-bytes, such as the constant declarations exploding in size (changing sizes from 256(8-bit) to 65536(16-bit) basically) and causing extremely hard-to-debug segmentation faults. I'd love to contribute and help, but due to my lack of familiarity with the code base I don't think it would be practical and I'd probably just end up making bugs. I ended up with some extremely stripped down version that didn't perform the kind of alignment I was looking for and eventually couldn't figure out the alignment algorithm or why, and figured I wouldn't be able to receive support in that matter as I had already mangled the code quite a bit. Sorry for the vagueness.

I don't remember exactly where I left off, but anyways I think the only sensible solution to this issue is to accept an arbitrary type of any size and perform alignments based on that arbitrary type, rather than forcing to 8-bit unsigned integer input. And also we'd need an arbitrary equality operator function for the arbitrary types as well. Anyways, I'm sure you're aware this is probably the best solution but it's just a matter of implementation. If you already have a bug open for implementing such a feature, it would make sense to mark this as duplicate or close this. Otherwise, I suppose it's quite an important feature for the library to accept at least true Unicode from the Python side. Thanks~

Martinsos · 2018-03-31T16:38:08Z

Hi @xtknight , I completely agree, and there is actually an issue for that, and it will be one of the very next things to implement: #90 . I don't think it is actually that hard to implement, however it does take good understanding of how things are now implemented right now and will require changes on multiple places, plus updating both C/C++ and python interface, so it will take decent amount of work.
I appreciate your effort and I am sorry I can't help more right now, but I do hope we will have this enhancement soon :)! If you would like to take it on I would be completely fine with that, but it will require you to understand a few concepts and then propagate it through whole edlib. My thought was actually to maybe first switch from C/C++ to pure C++ interface, which will also allow this change to be more easier to do.
I am closing this issue then but feel free to comment further if you wish.

Martinsos closed this as completed Mar 31, 2018

Martinsos mentioned this issue Apr 9, 2018

Distance with special characters #114

Closed

bertsky mentioned this issue Oct 22, 2018

change project description to reflect edlib only aligns bytes #123

Closed

jbaiter mentioned this issue Mar 10, 2019

Python: Add support for arbitrary sequences of hashable objects #128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to align strings represented in Unicode (misleading results) #109

Failure to align strings represented in Unicode (misleading results) #109

xtknight commented Feb 10, 2018 •

edited

Loading

Martinsos commented Feb 14, 2018

Martinsos commented Mar 30, 2018

xtknight commented Mar 30, 2018

Martinsos commented Mar 31, 2018

Failure to align strings represented in Unicode (misleading results) #109

Failure to align strings represented in Unicode (misleading results) #109

Comments

xtknight commented Feb 10, 2018 • edited Loading

Martinsos commented Feb 14, 2018

Martinsos commented Mar 30, 2018

xtknight commented Mar 30, 2018

Martinsos commented Mar 31, 2018

xtknight commented Feb 10, 2018 •

edited

Loading