Binary Log Group Commit with TokuDB

MySQL uses two phase commit protocol to synchronize the MySQL binary log with the recovery logs of the various storage engines when a transaction commits. Since fsync's are used to ensure the durability of the data in the various logs, and fsync's can be very slow, the fsync can easily become a system bottleneck. A group commit algorithm can be used to amortize the fsync cost over many log writes. The binary log group commit algorithm is an algorithm intended to amortize the cost of the binary log fsync over many transactions.

The implementation of binary log group commit is different on MySQL and MariaDB. Both of these algorithms are discussed here since TokuDB runs on both MySQL and MariaDB, and TokuDB must accommodate the differences.

Binary log group commit in MySQL 5.6

The binary log group commit blog describes how two phase commit works in MySQL 5.6.

When a transaction commits, a transaction runs through a prepare phase and a commit phase. Hey, it is called 2 phase commit for a reason.

During the prepare phase, TokuDB writes a prepare event to its recovery log and uses a group commit algorithm to fsync its recovery log. Since there can be many transactions in the prepare phase concurrently, the transaction prepare throughput scales with the number of transactions.

During the commit phase, the transactions write events are written to the binary log and the binary log is fsync'ed. MySQL 5.6 uses a group commit algorithm to fsync the binary log.

Also during the commit phase, TokuDB writes a commit event to its recovery log and uses a group commit algorithm to fsync its recovery log. Since the transaction has already been prepared and the binlog has already been written, the fsync of the TokuDB recovery log is not necessary. XA crash recovery will commit all of the prepared transactions that the binary log knows about and abort the others.

Unfortunately, MySQL 5.6 serializes the commit phase so that the commit order is the same as the write order in the binary log. Since the commit phase is serialized, TokuDB's group commit algorithm is ineffective. Luckily, MySQL 5.6 tells TokuDB to ignore durability in the commit phase (the HA_IGNORE_DURABILITY property is set), so TokuDB does not fsync its recovery log. This fixes the throughput bottleneck caused by serialized fsync's of the TokuDB recovery log during the commit phase of 2PC.

Binary log group commit in MariaDB

Identify when MariaDB is running a 2PC transaction and turn off the fsync of TokuDB's recovery log in TokuDB's commit method.

The criteria is:

The transaction is prepared.
The binlog is the transaction coordinator.
The transaction is written into the binlog.

Since the MMAP transaction coordinator erases the transaction from its file during the commit phase of 2PC, the transaction is not known during recovery. The storage engine must ensure that the transaction is committed not just prepared when using the MMAP transaction coordinator.

Why can't the binary log be a TokuDB table like TokuMX?

With enough MySQL hacking, perhaps it could.

Sysbench

Create a single table with 1M rows for the non-indexed update test. This table is small enough to fit in memory so that the test focuses on log I/O.

sysbench --test=$HOME/launchpad/sysbench/sysbench/tests/db/update_non_index.lua --mysql-socket=/tmp/rfp.sock --mysql-user=root --oltp-table-size=1000000 --mysql-table-engine=tokudb prepare

Run tests with varying number of clients.

for ((n=1;n<=256;n*=2)) ; do
    sysbench --max-requests=0 --max-time=60 --num-threads=$n --test=$HOME/launchpad/sysbench/sysbench/tests/db/update_non_index.lua --mysql-socket=/tmp/rfp.sock --mysql-user=root --oltp-table-size=1000000 --mysql-table-engine=tokudb run
done

Use this set of mysql configuration variables.

[mysqld]
tokudb-cache-size=8G
tokudb-directio=1

innodb-buffer-pool-size=8G
innodb-flush-method=O_DIRECT

max_connections=1024
table_open_cache=1024

log-bin=mysql-bin
sync_binlog=1
binlog_format=ROW

The following experiments were run on an Intel Core i7-4770 @ 3.4 GHz, 32 GB RAM, and a Samsung 840 EVO 750 GB SSD. The SSD is the bottleneck and has poor performance compared to other SSDs, but the results still validate the group commit algorithm.

Sysbench update throughput on MySQL 5.5

The throughput of the sysbench update test on MySQL 5.5 hits the binlog fsync bottleneck and does NOT scale with concurrent clients. MySQL 5.5 does NOT implement the binlog group commit algorithm which is the cause of the scaling problem.

sysbench update test on MySQL 5.5.40 and TokuDB 7.5.3

Sysbench update throughput on Percona Server 5.6

MySQL 5.6 (and Percona Server 5.6) implement a binlog group commit algorithm. As a result, sysbench throughput scales with the number of concurrent clients. TokuDB turns off the fsync during the commit phase of 2PC, which avoids the scaling problems that occur when fsync's are serialized. TokuDB and InnoDB throughput are nearly identical for the in memory sysbench update test.

sysbench update test on Percona Server 5.6.21 and TokuDB 7.5.4

Sysbench update throughput on MariaDB 5.5

The throughput of the sysbench update test on MariaDB 5.5 does scales with concurrent clients. MariaDB 5.5 has a nice implementation of the binlog group commit algorithm.

The fsync during the commit phase of the two phase commit sequence used by MariaDB 5.5 degrades throughput a bit as the attached graph shows. Since the fsync during the commit phase is not necessary, it can be skipped, which is done by InnoDB and by TokuDB when commit sync is OFF.

sysbench update test on MariaDB 5.5.40 and TokuDB 7.5.3