Skip to content

RocksDB vs TuplDB on Optane

Brian S. O'Neill edited this page Jul 25, 2024 · 1 revision

TuplDB has a typical B-Tree design, with fixed sized pages, 4KiB by default. I wanted to see how it performs on the Intel Optane 900P drive compared to RocksDB, which has a log-structured merge-tree (LSM) design. An LSM tree outperforms a B-Tree when performing randomly ordered "blind" inserts, that is, each insert overwrites any existing record without a constraint check. The LSM performance benefit diminishes with larger records, and eventually an LSM can become slower than a B-Tree, due to the overhead caused by periodic merging.

After experimenting with RocksDB configuration, I settled on a set which performs well with a balanced workload. It's possible to tune RockDB for specific benchmarks, but I wanted to use something practical and which didn't cause excessive memory growth and swapping.

  • Compression type: no compression
  • Compaction style: level
  • Write buffer size: 64MiB
  • Max write buffer number: 3
  • Max background compactions: 16
  • Level 0 file num compaction trigger: 8
  • Level 0 slowdown writes trigger: 17
  • Num levels: 4
  • Max bytes for base level: 512MiB
  • Max bytes for level multiplier: 8
  • Block cache size: 8GiB

For both the RocksDB and TuplDB tests, 16 threads were inserting records in a random order, using an 8 byte key. RocksDB stored to the EXT4 file system, but TuplDB was configured to write to the Optane drive as a block device, using direct I/O. This is the recommended configuration for TuplDB with Optane, to get the best performance from the drive. RocksDB doesn't have an option to write directly to the block device. The TuplDB redo log does depend on a file system, and a smaller EXT4 partition was created on the Optane drive for the redo logs.

  • RAM: 16GB
  • Storage: Intel Optane SSD 900P 480GB
  • CPU: Ryzen 7 1700
  • Kernel: 4.13.0-37

This first test shows the maximum insert rate achieved compared to the total number of records which have been inserted. The horizontal axis essentially shows time, but not in a linear fashion. The total record size is about 128 bytes, excluding any encoding overhead.

It should be clear from these results that RocksDB outperforms TuplDB, but the TuplDB insert rate is much less erratic. An application which favors consistent latency over throughput would likely be better off with TuplDB, even though overall throughput is lower.

The next test is the same except the record size is doubled, to 256 bytes. The total number of records inserted is half as much (1 billion) to match the total overall database size.

RocksDB initially inserts at a faster rate, but TuplDB overtakes it as the database gets larger. It seems that the crossover record size is about 250 bytes. Applications which are inserting records at least this large should always do better with TuplDB than with RocksDB. The next test doubled the record size again, and the performance gap is much wider.

Also see the RocksDB vs TuplDB counters test, and the RocksDB vs TuplDB on 960 Pro test

Clone this wiki locally