Access to serialized data without parsing/unpacking (e.g. using flatbuffers) #44

vladar · 2021-04-21T21:16:14Z

vladar
Apr 21, 2021

@kriszyp This is probably an offtopic but I am curious to hear your thoughts on using some binary format for accessing data directly in lmdb value without deserializing/copying it to node heap. For instance flatbuffers or capnproto.

The use-case: imagine you have a deeply nested object with large strings in some fields but you only need to read one top-level field of this object in runtime. With serialization/deserialization - there will be a significant overhead (memory, parsing, etc). So binary format directly accessible without deserialization sounds like a good idea (on the first look).

I've seen your cool DPack project that sounds like a pretty similar idea (with blocks, lazy parsing, etc).

But looks like you've decided to focus on Messagepack/CBOR instead. Did you discover some pitfalls with the approach taken in DPack? Also, did you consider those existing encodings like flatbuffers?

kriszyp · 2021-04-23T03:45:52Z

kriszyp
Apr 23, 2021
Maintainer

Yes, I have wrestled with different approaches to deferring parsing and parsing on-demand, and have implemented such approaches in dpack and looked. In our applications, we deal with storing a lot of study data, and there are situations where we want to access some top-level information about study like authors and publication data, without having to load all the study result data (often a few MBs of structured data).

There are a few considerations that have driven my current approach. First, it is certainly easiest to deal with plain JS objects, where all properties are accessed as simple properties. It is also a lot easier in our applications to not have to necessarily define a rigid schema/structure upfront. We have a lot of dynamic properties, and it would be challenging to have define schemas for everything. Likewise, with lmdb-store, I have tried to reduce barriers to use.

With dpack, I did implement lazy/on-demand parsing of data using JS Proxies. And this certainly is nice from an API perspective; objects can still be treated as plain objects with properties. Property access can be intercepted and data can be parsed on-demand. However, the drawback of this is that proxies are much slower than plain objects. So while there is a lot of efficiency/performance to be gained by deferring or avoiding parsing of some embedded data, there is a lot lost in plain access to data that has been parsed. And this has a fair bit of complexity involved, especially as you start dealing with modifying objects.

So it is certainly possible to have an improved performance with careful structuring of objects where we define which objects or properties should be lazily parsed and which shouldn't, but in the end, it seemed like it required a lot of careful optimizations and definitions of structures to out-perform plain objects. And with msgpackr and plain objects with V8's amazingly fast performance, the performance is just so good, it is a tough bar to beat.

And in our current iteration of our application, where I really need to defer parsing of certain related objects, I simply store them in a separate table (db in LMDB), and load them separately. With LMDB, get()s are so ridiculously fast, that making everything that I really want to parse separately, as separate addressable entities in database, has seemed like the best approach, while keeping the overall serialization/deserialization process simple and easy to understand.

But, there certainly is a place for schemes that involved more randomly accessible data, with on-demand parsing. I just haven't felt like it has ended up being worth the effort in our application.

Also, related, dpack was also very optimized for browser usage, to work well in conjunction with huffman encoding used by compression algorithms. This is still kind of an issue with MessagePack, it is doesn't compress well, and so we actually just use JSON for communicating with the browser.

3 replies

vladar Apr 23, 2021
Author

Yeah, makes perfect sense to me. Thanks for sharing this.

Our situation is a bit different. We in gatsby basically have a build system for static sites. Some of those sites can be relatively big - millions of data objects and hundreds of thousands of pages.

Data is schemaless, user-defined, and can be nested so we don't know upfront which part of the object is "heavy" to put it in a different table (unless we traverse data objects before saving - and maybe we'll have to do that eventually).

We run build in parallel in multiple Node processes but all of them need to access the same data almost at the same time. One process may need just a small piece of data from the object but we have to deserialize it fully to access this small piece. So for us, deserialization means memory and CPU overhead multiplied by the number of CPU cores (and we end up having copies of data in each node process).

I am exploring ways to avoid that and leverage LMDB's values as shared buffers instead to minimize memory and CPU spikes during builds.

I did some research and flatbuffers now also support a schema-less version - flexbuffers. There is no documentation for JS version but they do have implementation in TS and some tests.

Also found this interesting Rust project (that seems to be wasm-compatible) - NoProto

And DPack of course. Not sure if that will work - going to do some experiments but was intrigued to hear about your experience with DPack. So thanks again for sharing.

kriszyp Apr 24, 2021
Maintainer

Another aspect you might consider as you evaluate different approaches is how binary/buffer data is transferred from LMDB. Creating new Node Buffer/ArrayBuffer is a surprisingly expensive operation (typically over one microsecond, IIRC). They may not be a big deal for large blocks of data were other costs are higher, but with smaller objects, it is an important consideration. Consequently, there are a few different approaches to how data is retrieved, with different performance and characteristics:

Create a new node buffer for each get operation, and memcpy the data from the LMDB memory map entries to the node buffer. This is the simplest and easiest to use, since you get a normal node buffer that is valid and stable until GCed. But this is the slowest since it requires the instantiation of a node buffer and a memcpy. But if you are doing any type lazy deserialization, where more buffer data could be parsed at some undetermined point in the future, this is probably what you need. This is the type of buffer returned by lmdb-store when using encoding: 'binary'.
Create a new node buffer for each get operation that has its memory address pointed to the LMDB memory map entry (shared memory). This eliminates the memcpy operation and gives you direct access to the memory in the LMDB memory map. However this comes with a number of hazards: the memory mapped data is only valid for the duration of the read transaction used to perform the get() (you can have long-lived read transactions, but that comes with other problems). Also, you must "detach" the buffer (destroy it) before another get() operation for the same entry (V8 throws an error if you attempt to create a new node buffer with the same address as a previous one). Because of these complications and the fact that it still requires an expensive instantiation of a node buffer/ArrayBuffer for each get, I have not used it at all in current lmdb-store (but I could provide an API for this if you would want it).
Use a single reuseable buffer for all get operations, and memcpy the data from the LMDB memory map entries into the reusable buffer on each get operation. This eliminates the need for creating new buffers for each get() (only creates new one if we need to expand it), and can be significantly faster for small objects. memcpy is very fast, and I believe it is faster than creating new node buffers (2nd approach) even up to 10KB data entries. However, since each get() operation basically overwrites/destroys the binary/buffer data from the last get(), the binary data must be fully deserialized for each get. This is the technique used for the default MessagePack encoding (same as encoding: 'msgpack').

Another interacting aspect of this is compression. When using lmdb-store's compression, we perform the decompression directly from the shared memory map into the reusable buffer data that is used for deserializing, so the decompression replaces the memcpy operation, making a very efficient process without any unnecessary memory copying or extra node buffer instantiations. Of course, (de)compression does always add extra overhead, and for smaller databases probably should be avoided, but I tend to be a fan of using compression. The LZ4 compression we use is extremely fast, and even though it will always lose in microbenchmarking, I tend to believe that in full context of large databases running alongside other databases/processes, that the reduced memory usage and consequently reduced page faults, both for the LMDB app in question, as well as the reduced memory pressure for other processes (and page faults) is probably often beneficial (in our app, our databases total about 100GB of LZ4 compressed data).

Also found this interesting Rust project (that seems to be wasm-compatible) - NoProto

Perhaps this could help in targeting specific blocks to deserialize, but even with wasm, the biggest cost here is in the conversion between binary <-> JS, and wasm doesn't have any capabilities that C doesn't have for that.

vladar Apr 26, 2021
Author

I didn't realize that encoding: 'binary' copies memory from LMDB to the node buffer. I actually thought it points directly to LMDB memory map. That's good to know, thanks!

you can have long-lived read transactions, but that comes with other problems

I've read that it causes problems when you also do writes (the database grows quickly). And this probably also adds quite a bit of complexity to DB usage (e.g. explicitly closing read transactions, dealing with binary data, etc). Anything else I am missing?

Anyways, we'll try simple approaches first. Maybe memory consumption of 1st or 3rd option with GC will be just fine for us. The one task is to partition our values more efficiently - to avoid unnecessary deserialization of huge chunks of data (so essentially do what you do in your app). That's probably doable, just requires some exploration first.

However, since each get() operation basically overwrites/destroys the binary/buffer data from the last get(), the binary data must be fully deserialized for each get

So with this 3rd approach, we can't add a custom extension to msgpackr that will "defer" deserialization of some big value to a later moment? I.e. something along the lines:

addExtension({
  Class: MyBigString,
  type: 11,
  pack(instance) {
    return extPackr.pack(instance.toString());
  }
  unpack(buffer) {
    return new MyBigString(buffer)
  }
});

class MyBigString {
  constructor(buffer, data) {
    this.myString = data
    this.myBuffer = buffer
  }
  toString() {
    if (!this.myString) {
      this.myString = extPackr.unpack(this.myBuffer)
    }
    return this.myString
  }
}

so the decompression replaces the memcpy operation, making a very efficient process without any unnecessary memory copying or extra node buffer instantiations

This is cool! 😎

P.S. Again, thanks a ton for your reply. It helped me to understand different caveats without wasting hours figuring it out in the wild!

kriszyp · 2021-04-27T03:40:18Z

kriszyp
Apr 27, 2021
Maintainer

re: read transactions. Anything else I am missing?

No, I think you have a good idea of the issues (I thought the issue with having to detach buffers before next read to same location was the most onerous to track/deal with).

So with this 3rd approach, we can't add a custom extension to msgpackr that will "defer" deserialization of some big value to a later moment?

No, not as written, but you can certainly use this approach if you copy the provided buffer so you have a stable copy (again recognizing there is some cost to this):

  unpack(buffer) {
    return new MyBigString(Buffer.from(buffer))
  }

Or alternately you could use encoding: 'binary' to get copied buffers and manually call unpack so all the sub-buffers are stable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to serialized data without parsing/unpacking (e.g. using flatbuffers) #44

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Access to serialized data without parsing/unpacking (e.g. using flatbuffers) #44

vladar Apr 21, 2021

Replies: 2 comments · 3 replies

kriszyp Apr 23, 2021 Maintainer

vladar Apr 23, 2021 Author

kriszyp Apr 24, 2021 Maintainer

vladar Apr 26, 2021 Author

kriszyp Apr 27, 2021 Maintainer

vladar
Apr 21, 2021

Replies: 2 comments 3 replies

kriszyp
Apr 23, 2021
Maintainer

vladar Apr 23, 2021
Author

kriszyp Apr 24, 2021
Maintainer

vladar Apr 26, 2021
Author

kriszyp
Apr 27, 2021
Maintainer