Overlapping Sync mode #131
kriszyp
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Version 2 of lmdb-js introduced an overlapping sync mode (enabled with the
overlappingSync
flag) and this is a description of how this works. This is actually an extension/modification of LMDB itself, although it heavily leverages existing mechanisms in LMDB. Overlapping sync mode is designed to achieve several goals:LMDB uses two "meta" pages that alternately point to the latest MVCC version of the database. When data is written to the database, that data is written to new pages that do not alter existing data in the database. Once the data (updated b-tree pages) is written to the filesystem, one of the meta pages is updated to the latest starting entry node for the updated b-tree. This technique is known as shadow paging. Typically in LMDB, in order to ensure a crash-proof writes, the meta page can not be updated until all the data in the new pages have been fully flushed to disk. Otherwise, if a meta page is updated after a filesystem write (this mode can be enabled with noSync), but before the data has actually been flushed to disk (potentially still in OS cache, queued to be written) it is possible for a crash to occur and the latest version of the database to point to pages that have been written, which is a corrupt state. So LMDB normally waits for the flush to finish before updating the meta page, and after that is done, the transaction will be committed and the write lock is unlocked and other waiting write transactions can proceed. However, flushes can take be time-consuming and waiting for these can easily become a bottleneck with many write transactions.
The overlapping sync mode introduces a new technique that allows commits to "finish" and the write lock to unlock once the writes have been performed, but prior to waiting for the flush to finish. This accomplished by introducing a third meta entry. The first two meta entries correspond to LMDB's existing entries and these are treated as "transient" meta entries, and are updated after each transaction commits/writes, and are used at runtime to indicate the current/latest state of the database. The new third meta entry is considered the "durable" meta entry, and this entry is only updated to the latest transaction that has fully flushed all its data.
In lmdb-js when a transaction is executed, the transaction is committed (written to disk, but not yet flushed), and when completed, this notifies the JS code that the transaction is committed. The transaction thread than proceeds to flush the data to disk. Once this is completed, LMDB updates the durable meta entry, and notifies JS that the data is flushed.
When LMDB is restarted (and no other LMDB processes are open) the "durable" meta entry is always consulted for the latest version of the database (if there is a "transient" entry that is newer, it is ignored/dropped as it may potentially contain data that hasn't been truly saved to disk).
Empirically verifying this is crash-proof is challenging since it involves abruptly halting the OS/machine, but I believe this is a robust design. From a performance perspective, this allows for greater concurrency between disk I/O and application code execution. An application can certainly still be bottlenecked by I/O, but anecdotally I have seen about 10-25% performance gain in I/O bound database testing (but of course YMMV).
It is also worth noting that flushing files is usually prohibitively expensive on large files in Windows (I believe Windows method of tracking dirty pages is structured in a way that requires an O(n) scan of a file to ensure all dirty pages are written), so I do not recommend using the overlapping sync mode in Windows.
Beta Was this translation helpful? Give feedback.
All reactions