-
Notifications
You must be signed in to change notification settings - Fork 131
Transactions and Concurrency
TokuDB uses two-phase locking in its transactional concurrency control algorihm. A transaction acquires locks while it executes. This is the lock expansion phase. A transaction releases its locks when it commits or aborts. This is the retiring phase. Within a single transaction, the phases never overlap. This algorithm and its variants have been proven to provide serializability, which is a nice property to have.
MySQL supports several transaction isolation levels. Each of these transaction isolation levels use different locking strategies.
TokuDB implements locking strategies similar to InnoDB's locking strategies.
TokuDB use two types of locks.
- Read locks are exclusive locks (yes, exclusive locks). This may limit concurrency in some cases. Read locks were shared locks in previous versions of TokuDB, but we simplified and sped up the TokuDB lock tree by not implementing shared locks. The cost is less concurrency. This can be reexamined.
- Write locks are exclusive locks.
TokuDB stores locks in an in memory data structure called the lock tree. The lock tree contains the set of locks currently owned by each live transaction. In addition, the lock tree contains the set of transactions that are waiting for a lock owned by another transaction.
Each range lock includes a transaction id, a lock type, and a copy of the left and right edges of the range being locked. A point lock is a range lock where the left and right edges are the same.
A lock consists of some overhead bytes, some length fields, and a copy of the key for a point lock, or a copy of the left and right ends of the key range for a range lock. An estimate on the lock size is about two times the size of the key.
Since the lock tree is an in memory data structure with a limit on its size, what happens when the limit is reached? When this happens, lock escalation runs. The goal of lock escalation is to shrink the memory footprint of the locks. It does this by merging adjacent range locks that are owned by the same transaction into a single larger range lock. The merge will include the gap space between the original range locks in the merged lock.
A big transaction is a transaction with a rollback log that got too big and spilled to the rollback file. The rollback log for a small transaction still fits in memory.
The lock tree runs lock escalation when the size of the locks in the lock tree exceeds one of two limits. The first limit applies to a big transaction and is 1/2 of the lock memory. When this limit is hit and a big transaction is trying to acquire a lock, lock escalation is run on this client thread. Other big transactions that try to acquire a lock wait for lock escalation to complete. Other small transaction that try to acquire a lock continue to run. This algorithm assigns the cost of lock escalation to the clients that own a lot of locks, and allows the small transactions to not be stalled by the escalation activity caused by big transactions.
The second limit applies to all transactions and is hit when the lock memory limit is reached. In this case, all transactions get stalled until lock escalation completes.
Lock escalation runs on all of the fractal trees in no particular order.
If the locks are interleaved, then lock escalation has no way to merge adjacent locks since they are owned by different transactions. In this case, lock escalation fails and the transaction must be aborted.
Write locks are taken on all keys that were written while executing the transaction.
Read locks are taken on all keys that were read while executing the transaction. Sometimes, TokuDB will prelock key ranges that match the range that is expected to be traversed since prelocking the range has less overhead than locking each key that is iterated over.
- All isolation levels
- Write lock points written
- Serializable
- Grab read locks on all key ranges read by query
- Grab write locks on all keys written
- Repeatable read
- Grab read locks on all key ranges read by query
- Grab write locks on all keys written
- Read committed
- Grab write locks on all keys written
- Read uncommitted
- Grab write locks on all keys written
- Serializable
- Grab read locks on all key ranges read by query
- Write lock target table
- Repeatable read
- Grab read locks on all key ranges read by query
- Write lock target table
- Read committed
- Write lock target table
- Read uncommitted
- Write lock target table
- Serializable
- Grab read locks on all key ranges read by query
- Repeatable read
- Snapshot read, no locks taken
- Read committed
- Snapshot read, no locks taken
- Read uncommitted
- No locks taken
- All
- Grab write locks on all keys read by query
- Serializable
- Grab read locks on all key ranges read by query
- Others
- No locks taken
- All isolation levels
- Grab read locks on all key ranges read by query
- Grab write locks on all keys updated
- All isolation levels
- Grab read locks on all key ranges read by query
- Grab write locks on all keys deleted
The keys are just hex dumps of the key memory. Sorry that we don't nicely decode the memory.
When two rows are inserted into a table with a primary key, TokuDB takes point locks of the primary keys in the primary key's fractal tree index.
create table t (x bigint not null, y bigint not null, primary key(x)) engine=tokudb
insert into t values (1,2),(3,4)
select * from information_schema.tokudb_locks
| locks_trx_id | locks_mysql_thread_id | locks_dname | locks_key_left | locks_key_right |
| 146928 | 2 | ./test/t-main | 000100000000000000 | 000100000000000000 |
| 146928 | 2 | ./test/t-main | 000300000000000000 | 000300000000000000 |
When two rows are inserted into a table with a primary key and a clustering secondary key, TokuDB takes point locks of the primary keys in the primary key's fractal tree index, and TokuDB takes point locks of the secondary keys in the secondary clustering key's fractal tree index.
create table t (x bigint not null, y bigint not null, primary key(x), clustering key(y)) engine=tokudb
insert into t values (1,2),(3,4)
| locks_trx_id | locks_mysql_thread_id | locks_dname | locks_key_left | locks_key_right |
| 146934 | 2 | ./test/t-main | 000100000000000000 | 000100000000000000 |
| 146934 | 2 | ./test/t-main | 000300000000000000 | 000300000000000000 |
| 146934 | 2 | ./test/t-key-y | 0002000000000000000100000000000000 | 0002000000000000000100000000000000 |
| 146934 | 2 | ./test/t-key-y | 0004000000000000000300000000000000 | 0004000000000000000300000000000000 |