-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to align strings represented in Unicode (misleading results) #109
Comments
Hi @xtknight, sorry for late response: please check this issue, as it is exactly the same thing but with cyrilic characters: #104. I would advise for now encoding the characters you have into bytes, as mentioned in the link I provided, if that is enough bits for all your characters. I plan to add support for custom types of characters, that would support your case, however I have not yet got to it and it will probably take me some time to find to do it! |
@xtknight did the answer help? Should we close this one? |
@Martinsos sorry for the lack of response. I suppose this would work fine, and eventually that's what I'll probably end up doing. But even in this case, I think it might break with certain UTF characters that require special sequences. And Unicode equivalence also would have no chance of working properly. In the ideal situation, the library would accept Unicode and handle Unicode equivalence, etc...but I realize that might be difficult to implement. I actually went ahead and modified all the C++ source code to try to use UTF-16 a while ago and ended up with a mess. For some reason things weren't working as I desired. If it were something useful I certainly would have posted it here, and I'll have to use vague terms because I don't remember everything exactly, but the interface between Python and C++ also only transmitted the first byte of Unicode characters. On the C++ side there were just a bunch of problems because of the way things were hard-coded to use 8-bytes, such as the constant declarations exploding in size (changing sizes from 256(8-bit) to 65536(16-bit) basically) and causing extremely hard-to-debug segmentation faults. I'd love to contribute and help, but due to my lack of familiarity with the code base I don't think it would be practical and I'd probably just end up making bugs. I ended up with some extremely stripped down version that didn't perform the kind of alignment I was looking for and eventually couldn't figure out the alignment algorithm or why, and figured I wouldn't be able to receive support in that matter as I had already mangled the code quite a bit. Sorry for the vagueness. I don't remember exactly where I left off, but anyways I think the only sensible solution to this issue is to accept an arbitrary type of any size and perform alignments based on that arbitrary type, rather than forcing to 8-bit unsigned integer input. And also we'd need an arbitrary equality operator function for the arbitrary types as well. Anyways, I'm sure you're aware this is probably the best solution but it's just a matter of implementation. If you already have a bug open for implementing such a feature, it would make sense to mark this as duplicate or close this. Otherwise, I suppose it's quite an important feature for the library to accept at least true Unicode from the Python side. Thanks~ |
Hi @xtknight , I completely agree, and there is actually an issue for that, and it will be one of the very next things to implement: #90 . I don't think it is actually that hard to implement, however it does take good understanding of how things are now implemented right now and will require changes on multiple places, plus updating both C/C++ and python interface, so it will take decent amount of work. |
In Python:
(I think that the first three characters should be kept the same and the last two should be changed, but the 'cigar' output implies all five characters are the same!)
Is there any possible way to get this working properly for characters only representable in Unicode, even if it's messy? :( Or, at least, if I input it as UTF-16 or UTF-32 can the algorithm just work on 2 or 4 byte intervals? I realize UTF-8 might be difficult because it's variable-length but...desperate to get something working just for research purposes only. Any pointers would be greatly appreciated.
The text was updated successfully, but these errors were encountered: