-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-Copy Serialization/Deserialization #5
Comments
rkyv looks good, and adding that with an optional dependency (because I am quite keen on keeping it zero-dependencies) might be an option. I'll look further into it, because it seems like a nice additon. |
When adding a new serialization framework, it's worth thinking about the serialization-breaking change of reducing stack size by changing |
Endianess is a good point. But I think rkyv handles it somewhat ungracefully, by using feature flags, and I'm not sure what happens when you use two libraries that transitively use both archive_be and archive_le. |
No, using features is actually convenient for me, because I just disable all features, and let the downstream crate decide. |
I think the same late binding could be achieved with generic paramethers though? Without the conflict problem where different pieces of code want different endianness. |
I mean, as long as you don't import serialized data from systems with opposite endianness, using only native endianness shouldn't create any issues, no? |
Well if you're using the (de)serialization as way to create a data-exchange/file format (think .jpg not application instance specific .dat), then that format will want to decide on some endianess. In my case it's a file format for knowledge graphs, with the added bonus that you can query it without having to build any indexes first, just mmap and go. So it's always going to be in be. The Stable Cross-Platform Database File section in the SQLite documentation is probably the best description of that use case. Avoiding breaking changes caused by the way rkyv stores things is also an argument for rolling our own framework agnostic data layout. Edit: Btw it's also completely fine if such a use-case doesn't align with the project goals 😄 |
Okay, I see where you are coming from, but exporting into pre-defined file formats using zero-copy serialization seems difficult. For example, let's pretend you write a Wavelet Matrix that looks like the database file and can be directly serialized and deserialized to and from that format. The rank/select data structures still need to be written to file, if zero-copy deserialization is a goal, which will interleave the data format with junk data. |
I'm not sure I'm following. Do you mean undefined data caused by padding / unitialised memory? Rkyv for example zeroes these to get determinism and also avoid accidental memory data leakage. Rkyv has an unsafe way to deserialize without any checks btw, but the default is having a validation/sanity-check step on read. So it's not just transmute and go. On a more philosophical level and ignoring the more difficult mutable/growable succinct datastructures for now, I feel that the static/write-once-ness of most succinct data-structures makes them an interesting special case. I'm not certain myself where I would pinpoint "serialization" in their lifecycle; Is it at the point they are constructed from their mutable "builder" precursor, or is it at the point where one actually calls Similarly, having sanitation performed only at serialization might be worthwhile, so that people that don't care about serialization don't have to pay for that, but on the other hand it might actually be cheaper to initialize the datastructure in a sanitized state, e.g. by calling |
No, for example pub struct RsVec {
data: Vec<u64>,
len: usize,
blocks: Vec<BlockDescriptor>,
super_blocks: Vec<SuperBlockDescriptor>,
select_blocks: Vec<SelectSuperBlockDescriptor>,
rank0: usize,
rank1: usize,
} And I assume your data format only wants the There are actually even more problems to this approach:
So all in all, this is unfortunately not an easy issue, a lot of code needs to be written to support rkyv with a big endian serializer, and manually implement the operations in a way that you can call them on |
I would question that assumption and ask why shouldn't they be included? They are an essential component of the datastructure and enable the nice time complexities. The only pitfall is that the support data should be deterministic. Your points are related to what I tried to express with my previous musings, wondering at which moment the serialization happens. With that line of thought, I would make I suspect this isn't as expensive as it sounds:
And yes, this would probably be a major rewrite of |
Perhaps I still don't understand what you are doing, but it sounded like you want the data structure to serialize into an existing format (you mentioned jpg as an example for a specced format, and I assume you mean a database index format). Obviously, the support data structures are proprietary and thus do not adhere to any existing format specification, hence my concerns. But now it seems you don't want to do that.
This is not the reason why
Finally, the largest hurdle that If you want to store the vector as big endian and then Edit: I am not saying it is impossible btw, I am just saying it involves major refactoring of the entire code base, and a lot of efforts to keep the efficiency (because having more indirections would suck) |
No I'm just trying to (de)serialize a bunch of wavelet matrices, but for my own to-be-spec-ed database index format similar to HDT.
Fair point, but I feel like I would have that problem with any implementation, even if I managed to find the most textbook I figured, given that I'll have this problem regardless, that I'm just gonna go with the library made in Germany™️ (I'm from Bremen), then it'll at least look nice if we ever co-author a paper, and might give Rust more street-cred in Germany. 🤣
My point was that from an API/user perspective
Yeah when I said "munged together into a single contiguous allocation" I meant that the reworked implementation would get rid of the
I was assuming (based on a quick glance and the way other implementations work) that the implementation uses SIMD only to
Sure sure, my hunch is that it might be a zero sum game, some perf improvements from removing vector bounds checks, some slowdown from having to do some on the fly endianness conversion. But I'll probably know more once I've properly read through the entire codebase. I'm aware that it would take some major rework. I'm also definitely not asking you to do it, I'd do it myself as part of the wavelet matrix. 😄 |
I pushed some changes to a new branch All functionality of As it stands, this constitutes a breaking change, because you now have to import the trait to access methods on |
Awesome, I'll check it out asap! |
I just started some work on a minimalist approach to this, using a combination of techniques from This would still require some form of archive format (i.e. a header with length and layer count), but most of the data (data/blocks/...) can just be sliced from the binary data source. The main reason I'm writing this comment though is that I've noticed that pub struct WaveletMatrix {
data: Box<[RsVec]>,
bits_per_element: u16,
} appears to have the information about the number of layers redundantly. Is this an atavism from a version in which data was only a single Really awesome implementation btw, I'm super impressed. It smells like you were able to justify working on it as part of your PhD 😆 |
Oh damn, didn't even realize that. I was too occupied not putting
Really? I thought about doing that, but it seemed counterproductive to me, as it would complicate the rank-query patters with no real benefit. The layers are still so far apart that cache efficiency isn't increased. And the small memory gains due to packed RS-support structs are negligible.
No, I handed in my master thesis, and therefore had some free time. This ends Monday, though, so let's see if I can find that justification eventually. |
You're right, I just checked again, and I remembered wrong 😅, it's actually only the terminusdb implementation that does it like that. 😅 I think the reason they do it is because it makes loading from memory a bit more straightforward, but I agree that that is probably not a worthy tradeoff.
Congratulations! ✨🎉 🍻🎉✨ |
Whops, sorry for fat-finger closing the issue 🙇♂️. |
*but use specific sizes in the structs 😆 Could we fixate the
const SUPER_BLOCK_SIZE: usize = 1 << 13; The The Lastly the fields Btw this should read /// The indices do not point into the bit-vector, but into the ~~select~~super-block vector. right? |
No, the counter isn't reset between super-blocks, lest you'd need to generate a prefix-sum of all super-blocks before the query.
Yes. I'll add it to #9.
Yes
Can't you just add a value to the file header that tells the loading code how big the slice is? |
Ah gotcha, sorry
I'll create a PR
True, it's just something that needs to be stored per layer, so Btw which memory layout do you prefer? Storing the Block/SuperBlock/...s so that each level is stored consecutively |
When you load a WaveletMatrix struct, the Box and Vec members contain their sizes. How do the zerocopy crates load a vec? More specifically, how is the vec pointer recovered? Or is the archived vec continuous with its header, without indirection? |
Maybe this requries some preliminaries ^^' I can only speak for the way I would do it with the What I do try to do is store the data as flat and with as little "dynamic" data as possible, using Note: What the What I did there with the So: Edit: |
I just misread your question 😅. I don't plan on adding slab allocation or anything, because this isn't a dynamic data structure, and thus it shouldn't be worth messing with allocation here. The only thing you should keep in mind is that I do plan on merging I also don't think changing the layout here gets any improvements in locality, because for that, multiple block-lists would have to share cache lines, which shouldn't happen for non-trivial vectors. So for simplicity, |
Gotcha 😄! I do see how that's potentially nice for |
Good point, that needs to be evaluated during implementation |
I did some thinking. Maybe it's not a good idea to try to provide the zerocopy serialization/deserialization as a pre-packaged one-size-fits-all solution. Someone designing a on-disk file format, will have thoughts and insights about the layout themselves. For example if you store multiple wavelet-matrixes with the same alphabet but different orderings, you can share that information between them, whereas an implementation provided by us would have to replicate that into every serialised instance. So I think it might be better to make the internals part of the public interface, and expose constructors that can take something like the generic read-only I mean the cool thing about succinct vectors is that they are somewhat simple Data-structures, they are somewhat canonical by construction (modulo hyperparameters like superblock ranges). |
so you suggest just a from-raw-parts layer, maybe some conversions with common libs and then let downstream crates handle it? |
Yeah exactly. Focusing on making the raw parts stable and |
Yeah, that actually sounds reasonable, and gets rid of a lot of inelegant decision-making. And the raw-parts are pseudo-stable anyway, since I don't want to break serde compatibility between versions. This also means that it's probably possible to implement the necessary parts of this issue without a major version bump, theyll just break alongside everything else when I do one. |
Maybe. pub struct RsVec {
data: Vec<u64>,
len: usize,
blocks: Vec<BlockDescriptor>,
super_blocks: Vec<SuperBlockDescriptor>,
select_blocks: Vec<SelectSuperBlockDescriptor>,
rank0: usize,
rank1: usize,
} Becomes pub struct RsVec {
len: usize,
pub(crate) rank0: usize,
pub(crate) rank1: usize,
data: PackedSlice<u64>,
blocks: PackedSlice<BlockDescriptor>,
super_blocks: PackedSlice<SuperBlockDescriptor>,
select_blocks: PackedSlice<SelectSuperBlockDescriptor>,
} Or any other I think it should serialise similarly with Serde, but it would probably require a custom (de)serializer implementation that creates a byte owning |
It also just occured to me that |
I think it would be best if zerocopy/anybytes was an optional feature that would replace the struct. This way everything stays normal and optimized for everyone who doesn't need this feature, and we only lose compatibility between the serde-version and the packed-version |
Yeah I think that's fair. I did some preliminary benchmarks, and saw no impact on performance (the One interesting side effect, wrt. performance, of using the I also just realized that there is a third option of me implementing I've created a draft pull request to make discussions about the code easier at #14. ARM SIMD support is also on my wishlist but that's a different issue 😉
|
Wouldn't that induce |
Yes, definitely, but the way Since |
I just stumbled upon I've look at their code and it's less flexible and more cumbersomethan what we have imho, so it's also nice that we can make some improvements in this space. From friday on I'll be in Sicily for two weeks to harvest some olives, so I might find the time to push this a bit 😁 |
Good thing we had this convo! It caused me to take a closer look at the code I inherited from |
The pointer-free nature of succinct data-structures makes them very amenable to (de)serialization by simply casting their memory to/from a bunch of bytes.
Not only would this remove most (de)serialization costs, it could also enable very fast and simple on-disk storage when combined with mmap.
One might want to implement this via
rkyv
, but simply providing a safe transmute to and frombytes::Bytes
(with a potential rkyv implementation on top of that) might be the simpler, more agnostic solution.The text was updated successfully, but these errors were encountered: