-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaveletMatrix #4
Comments
I never used or worked with those, but from skimming the paper, it seems a lot less painful than my attempts at implementing succinct trees. Adding a feature that actually has use cases and is easier to implement definitely sounds more fun. |
Yeah, they are both implementing WaveletMatrices. I think the main insight for wavelet matrices comes from the fact that the total number of bits in each level stays constant for a balances wavelet tree (this invariant doesn't hold for compressed versions like huffman trees), which makes them also a lot easier to implement. I'd be happy to also give an implementation a shot, but I'd probably aready try to build it with zerocopy deserializability in mind (#5) if that's ok with you. Because I need this for an on-disk db storage format. |
Yeah feel free to. Zero copy deserializability is a feature that I'd like, so I don't think there is a reason to keep the two issues disjoint. |
Implementing it with zero-copy deserializability in mind is obviously possible, but not really necessary. While it is true, that the data structures are immutable, they do not freely lend themselves to zero-copy (de)serialization. I started implementing a Zero copy versions of other data structures, as outlined in #5, need these traits too, inducing potentially breaking changes. |
Thanks for pushing this! I carry the wavelet matrix paper around with me in my backpack, but there were other fires that popped up that need to be put out first. 😓 I've also recently encountered the endianness issue while working with candle and model weight storage formats like I would still want to make sure that that is the case and add something like: #[cfg(not(target_endian = "little"))]
compile_error!("Zero copy is only supported on LE architectures.) But after some philosophical wrangling with my inner perfectionist, I've concluded that this is a valid approach 😅. There are few BE-only systems supported by Rust, and they are all embedded systems that don't have SIMD and maybe don't even have a file system let alone the ability to Edit: |
Maybe it is easier when using I still strongly prefer to have this issue decoupled from zero-copying (despite what I said two months ago), but I'd be happy to see a proof of concept with |
Definitely fair. With |
That is unfortunately already true because of serde (I think). Changes that need to be made are collected in #9 |
It would be nice if there was a WaveletMatrix implementation, as it would enable the most common succinct-datastructure applications in text-processing and databases.
The text was updated successfully, but these errors were encountered: