Transactions and Concurrency

TokuDB uses two-phase locking in its transactional concurrency control algorihm. A transaction acquires locks while it executes. This is the lock expansion phase. A transaction releases its locks when it commits or aborts. This is the retiring phase. Within a single transaction, the phases never overlap. This algorithm and its variants have been proven to provide serializability, which is a nice property to have.

Transaction isolation levels

MySQL supports several transaction isolation levels. Each of these transaction isolation levels use different locking strategies.

TokuDB implements locking strategies similar to InnoDB's locking strategies.

Lock types

TokuDB use two types of locks.

Read locks are exclusive locks (yes, exclusive locks). This may limit concurrency in some cases. Read locks were shared locks in previous versions of TokuDB, but we simplified and sped up the TokuDB lock tree by not implementing shared locks. The cost is less concurrency. This can be reexamined.
Write locks are exclusive locks.

Lock tree

TokuDB stores locks in an in memory data structure called the lock tree. The lock tree contains the set of locks currently owned by each live transaction. In addition, the lock tree contains the set of transactions that are waiting for a lock owned by another transaction.

Lock ranges

Each range lock includes a transaction id, a lock type, and a copy of the left and right edges of the range being locked. A point lock is a range lock where the left and right edges are the same.

A lock consists of some overhead bytes, some length fields, and a copy of the key for a point lock, or a copy of the left and right ends of the key range for a range lock. An estimate on the lock size is about two times the size of the key.

Lock tree escalation

Since the lock tree is an in memory data structure with a limit on its size, what happens when the limit is reached? When this happens, lock escalation runs. The goal of lock escalation is to shrink the memory footprint of the locks. It does this by merging adjacent range locks that are owned by the same transaction into a single larger range lock. The merge will include the gap space between the original range locks in the merged lock.

A big transaction is a transaction with a rollback log that got too big and spilled to the rollback file. The rollback log for a small transaction still fits in memory.

The lock tree runs lock escalation when the size of the locks in the lock tree exceeds one of two limits. The first limit applies to a big transaction and is 1/2 of the lock memory. When this limit is hit and a big transaction is trying to acquire a lock, lock escalation is run on this client thread. Other big transactions that try to acquire a lock wait for lock escalation to complete. Other small transaction that try to acquire a lock continue to run. This algorithm assigns the cost of lock escalation to the clients that own a lot of locks, and allows the small transactions to not be stalled by the escalation activity caused by big transactions.

The second limit applies to all transactions and is hit when the lock memory limit is reached. In this case, all transactions get stalled until lock escalation completes.

Lock escalation runs on all of the fractal trees in no particular order.

If the locks are interleaved, then lock escalation has no way to merge adjacent locks since they are owned by different transactions. In this case, lock escalation fails and the transaction must be aborted.

Locks taken for various database operations

Write locks are taken on all keys that were written while executing the transaction.

Read locks are taken on all keys that were read while executing the transaction. Sometimes, TokuDB will prelock key ranges that match the range that is expected to be traversed since prelocking the range has less overhead than locking each key that is iterated over.

Insert

All isolation levels
- Write lock points written

Insert select

Serializable
- Grab read locks on all key ranges read by query
- Grab write locks on all keys written
Repeatable read
- Grab read locks on all key ranges read by query
- Grab write locks on all keys written
Read committed
- Grab write locks on all keys written
Read uncommitted
- Grab write locks on all keys written

Create select

Serializable
- Grab read locks on all key ranges read by query
- Write lock target table
Repeatable read
- Grab read locks on all key ranges read by query
- Write lock target table
Read committed
- Write lock target table
Read uncommitted
- Write lock target table

Select

Serializable
- Grab read locks on all key ranges read by query
Repeatable read
- Snapshot read, no locks taken
Read committed
- Snapshot read, no locks taken
Read uncommitted
- No locks taken

Select for update

All
- Grab write locks on all keys read by query

Select lock in share mode

Serializable
- Grab read locks on all key ranges read by query
Others
- No locks taken

Update

All isolation levels
- Grab read locks on all key ranges read by query
- Grab write locks on all keys updated

Delete

All isolation levels
- Grab read locks on all key ranges read by query
- Grab write locks on all keys deleted

Examples

The keys are just hex dumps of the key memory. Sorry that we don't nicely decode the memory.

Locks taken for inserts into a table with a primary key

When two rows are inserted into a table with a primary key, TokuDB takes point locks of the primary keys in the primary key's fractal tree index.

create table t (x bigint not null, y bigint not null, primary key(x)) engine=tokudb

insert into t values (1,2),(3,4)

select * from information_schema.tokudb_locks

| locks_trx_id | locks_mysql_thread_id | locks_dname   | locks_key_left     | locks_key_right    |
|       146928 |                     2 | ./test/t-main | 000100000000000000 | 000100000000000000 |
|       146928 |                     2 | ./test/t-main | 000300000000000000 | 000300000000000000 |

Locks taken for inserts into a table with a primary and clustering key

When two rows are inserted into a table with a primary key and a clustering secondary key, TokuDB takes point locks of the primary keys in the primary key's fractal tree index, and TokuDB takes point locks of the secondary keys in the secondary clustering key's fractal tree index.

create table t (x bigint not null, y bigint not null, primary key(x), clustering key(y)) engine=tokudb

insert into t values (1,2),(3,4)

| locks_trx_id | locks_mysql_thread_id | locks_dname    | locks_key_left                     | locks_key_right                    |
|       146934 |                     2 | ./test/t-main  | 000100000000000000                 | 000100000000000000                 |
|       146934 |                     2 | ./test/t-main  | 000300000000000000                 | 000300000000000000                 |
|       146934 |                     2 | ./test/t-key-y | 0002000000000000000100000000000000 | 0002000000000000000100000000000000 |
|       146934 |                     2 | ./test/t-key-y | 0004000000000000000300000000000000 | 0004000000000000000300000000000000 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transactions and Concurrency