High level design

Introduction

This high level design describes new history feature for MariaDB, mainly inspired by Oracle's Flashback, MS SQL Server Change Data Capture (see also Introduction to Change Data Capture (CDC) in SQL Server 2008) and Point in Time database Architecture (PTA).

There are few different approaches to implement the system versioning feature with their pros and cons depending on the main use case of the feature.

Oracle Flashback seems mostly designated for point-in-time recovery of short terms. So history tables are generated on-the-fly from undo log and so have limited time to live as well as indexing abilities. Maintaining relatively long history using the approach can lead to significant performance issues (see also feature request Bug #75540, Major regression having many row versions). It's also not viable against non-in-line DDL changes.

While the paper Transaction Time Indexing with Version Compression is published by Microsoft Research, it seems MS SQL Server CDC uses a different, somewhat similar to PTA, approach: it monitors undo log changes and writes history records to separate tables. It seems the most straightforward and easy to implement approach. Since new tables are created for historical data, the approach can survive DDL. The cons of the approach is that a row update involves two tables update (however, the history update is done in background) with consequent performance impact.

Transaction Time Indexing with Version Compression proposes to use TSB-tree (Time-Split B-tree), a variation of Write Once B-Tree (WOBT) which has to keep old items since it can't rewrite them on tree balancing and node splitting. However, the tree lookup skips 'old versions' of the items and use the latest ones. TSB-tree splits index or leaf nodes by key value and current time or by current time only. The first split allocates new nodes where actual only items are copied to, while the original node keeps old versions of the items after split. The second split is used when there have been many updates and the number of current versions is so small, thus only current version of items are copied to new nodes. The approach requires less space in comparison with the previous one. However, the resulting index becomes heavier and eviction of too old records, executing on the same table, can affect performance of current data DML.

TSB-tree is a tradeoff between two other extreme index approaches. The first one is R-tree. It's more space greedy, but it provides better performance for time slices.

The second one is extended B-tree using undo log files for historical data. InnoDB makes all changes in-place in the table space, moving old records to undo log such that other concurrent transactions can find required version of a row. Actual row in table space has designated field ROLL_PTR pointing to old versions of the row in undo log. Different versions of a row in undo log are also linked into list, so a transaction looking for a particular row scans the list of all versions of the row instead of scanning whole undo log containing versions of other records. System versioning feature can just reuse undo log mechanism to keep historical data.

Design considerations

System versioning for MariaDB is designed to satisfy following requirements:

SQL standard compliance;
It must be able to quickly fetch one year and older history;
It must be possible to query history over DDL operations (e.g. ALTER TABLE). This feature is known as DDL Survival and is unique for MariaDB implementation.
The implementation must be as database engine independent as possible. It must support at least InnoDB and MyISAM. Besides these storage engines it must be easily extended to support other engines. So most of the logic is implemented on SQL layer.
Database history grows very quickly and significantly impacts performance of a database node, so scenario for System version is to be deployed on an asynchronous slave dedicated to store the database history.
It's more convenient for a user to work with timestamps than with transaction IDs, however timestamps are unreliable. Operating system doesn't guarantee that two calls of getting system time return strictly monotonically increasing time. They can return same time for two events that occur in a very close, but different times, or they can return even lower time for later event on a different CPUs. Thus, system versioning uses timestamps by default, but a user can require to use transaction ID for fine-grained versions controw (available for InnoDB only).

System versioned tables

Each system versioned table is extended by two new hidden columns defining stored rows' versions. InnoDB table with system versioning looks as:

> create table t(x int primary key,
                 sys_trx_start timestamp(6) as row start invisible,
		 sys_trx_end timestamp(6) as row end invisible,
		 period for system_time(sys_trx_start, sys_trx_end))
                ) with system versioning;
> desc t;
+---------------+--------------+------+-----+---------+-----------+
| Field         | Type         | Null | Key | Default | Extra     |
+---------------+--------------+------+-----+---------+-----------+
| x             | int(11)      | NO   | PRI | NULL    |           |
| sys_trx_start | timestamp(6) | NO   |     | NULL    | INVISIBLE |
| sys_trx_end   | timestamp(6) | NO   | PRI | NULL    | INVISIBLE |
+---------------+--------------+------+-----+---------+-----------+

Note that sys_end is added to primary key to make historical queries run faster.

To create InnoDB table versioned by transaction ID you need to set global variable system_versioning_transaction_registry, which is false by default and prevents:

> set global system_versioning_transaction_registry=true;
> create or replace table t(x int primary key,
                            sys_trx_start bigint(20) unsigned as row start invisible,
			    sys_trx_end bigint(20) unsigned as row end invisible,
			    period for system_time(sys_trx_start, sys_trx_end)
			   ) with system versioning;
> desc t;
+---------------+---------------------+------+-----+---------+-----------+
| Field         | Type                | Null | Key | Default | Extra     |
+---------------+---------------------+------+-----+---------+-----------+
| x             | int(11)             | NO   | PRI | NULL    |           |
| sys_trx_start | bigint(20) unsigned | YES  |     | NULL    | INVISIBLE |
| sys_trx_end   | bigint(20) unsigned | NO   | PRI | NULL    | INVISIBLE |
+---------------+---------------------+------+-----+---------+-----------+

System versioned tables in MyISAM, since the engine doesn't have transactions, always use timestamps for row versioning:

> create table tm(x int primary key,
                  sys_trx_start timestamp(6) as row start invisible,
                  sys_trx_end timestamp(6) as row end invisible,
                  period for system_time(sys_trx_start, sys_trx_end))
                 ) engine=myisam with system versioning;
> desc tm;
+---------------+--------------+------+-----+---------+-----------+
| Field         | Type         | Null | Key | Default | Extra     |
+---------------+--------------+------+-----+---------+-----------+
| x             | int(11)      | NO   | PRI | NULL    |           |
| sys_trx_start | timestamp(6) | NO   |     | NULL    | INVISIBLE |
| sys_trx_end   | timestamp(6) | NO   | PRI | NULL    | INVISIBLE |
+---------------+--------------+------+-----+---------+-----------+

Transactions handling

One of the most fundamental change in InnoDB code required by System Versioning is different handling of transaction ID. Standard InnoDB assign ID (and increment the global transaction counter) when a new transaction starts. System versioned InnoDB assigns two IDs to each transaction: one for transaction begin (sys_trx_start, just the same as in original InnoDB) and one for transaction commit (sys_trx_end, the new one). So the global transaction counter is incremented twice for each transaction. The maximum value 18446744073709551615 for sys_trx_end means that a row still alive:

> insert into t values (1);
> select * from t,sys_trx_start,sys_trx_end for system_time all;
+---+---------------+----------------------+
| x | sys_trx_start | sys_trx_end          |
+---+---------------+----------------------+
| 1 |          1288 | 18446744073709551615 |
+---+---------------+----------------------+

Since MyISAM uses timestamps, the maximum timestamp 2038-01-19 06:14:07.999999 is used to specify live records:

> insert into tm values (1);
> select *,sys_trx_start,sys_trx_end from tm for system_time all;
+---+----------------------------+----------------------------+
| x | sys_trx_start              | sys_trx_end                |
+---+----------------------------+----------------------------+
| 1 | 2017-12-13 22:16:33.217435 | 2038-01-19 06:14:07.999999 |
+---+----------------------------+----------------------------+

mysql.transaction_registry described below clearly shows the double-increment nature of transaction ID: if we query the table for just after the insertion from the example above, then we notice the record with two different IDs for the transaction:

> select * from mysql.transaction_registry;
+----------------+-----------+----------------------------+----------------------------+-----------------+
| transaction_id | commit_id | begin_timestamp            | commit_timestamp           | isolation_level |
+----------------+-----------+----------------------------+----------------------------+-----------------+
|           1288 |      1289 | 2017-12-13 21:39:22.146570 | 2017-12-13 21:39:22.175836 | REPEATABLE-READ |
+----------------+-----------+----------------------------+----------------------------+-----------------+

The commit transaction ID makes AS OF queries fast and trivial to implement. For example, if we run one more transaction and query the database state just before the transaction

> update t set x=2;
> select transaction_id,commit_id from mysql.transaction_registry;
+----------------+-----------+
| transaction_id | commit_id |
+----------------+-----------+
|           1288 |      1289 |
|           1291 |      1292 |
+----------------+-----------+
> select * from t for system_time as of transaction 1290;
+---+
| x |
+---+
| 1 |
+---+

, then, to get MVCC-like view just before transaction 1291, we don't need to keep lists of concurrent transactions ever ran in the database as the MVCC engine does. Instead, we just compare sys_trx_end against the transaction ID passed as the query argument. See Ordering of transactions for more details about internal query algorithms.

mysql.transaction_registry system table

The table tracks the whole transactions history and matches transaction IDs to appropriate timestamps as well as stores isolation level of each transaction to satisfy MVCC-like history queries (read more on Ordering of transactions page).

The table is required and used for transactions running against InnoDB tables with enabled transaction ID versioning. I.e. by default, if system_versioning_transaction_registry is false, the table isn't used.

> desc mysql.transaction_registry;
+------------------+----------------------------------------------------------------------------+------+-----+----------------------------+-------+
| Field            | Type                                                                       | Null | Key | Default                    | Extra |
+------------------+----------------------------------------------------------------------------+------+-----+----------------------------+-------+
| transaction_id   | bigint(20) unsigned                                                        | NO   | PRI | NULL                       |       |
| commit_id        | bigint(20) unsigned                                                        | NO   | UNI | NULL                       |       |
| begin_timestamp  | timestamp(6)                                                               | NO   | MUL | 0000-00-00 00:00:00.000000 |       |
| commit_timestamp | timestamp(6)                                                               | NO   | MUL | 0000-00-00 00:00:00.000000 |       |
| isolation_level  | enum('READ-UNCOMMITTED','READ-COMMITTED','REPEATABLE-READ','SERIALIZABLE') | NO   |     | NULL                       |       |
+------------------+----------------------------------------------------------------------------+------+-----+----------------------------+-------+

The two transactions, INSERT and UPDATE for table t, are represented in mysql.transaction_registry as

> select * from mysql.transaction_registry;
+----------------+-----------+----------------------------+----------------------------+-----------------+
| transaction_id | commit_id | begin_timestamp            | commit_timestamp           | isolation_level |
+----------------+-----------+----------------------------+----------------------------+-----------------+
|           1288 |      1289 | 2017-12-13 21:39:22.146570 | 2017-12-13 21:39:22.175836 | REPEATABLE-READ |
|           1291 |      1292 | 2017-12-13 21:42:33.175504 | 2017-12-13 21:42:33.179434 | REPEATABLE-READ |
+----------------+-----------+----------------------------+----------------------------+-----------------+

Each transaction has two IDs for the transaction start and commit and corresponding timestamps, also for the transaction start and commit. Transaction isolation level is used for query algorithms as described in Ordering of transactions.

MariaDB system versioning for InnoDB internally works with transaction ID to fetch appropriate rows, so mysql.transaction_registry is used to convert timestamps to transaction IDs in timestamp-based AS OF queries, e.g.:

> select * from t for system_time as of timestamp '2017-12-13 21:40';
+---+
| x |
+---+
| 1 |
+---+

In this example mysql.transaction_registry is firstly queried to convert the passed timestamp to a ID of last committed transaction before the timestamp, and next appropriate rows are fetched from t by transaction IDs.

Note that updates on MyISAM tables don't affect mysql.transaction_registry anyhow:

> select count(*) from mysql.transaction_registry;
+----------+
| count(*) |
+----------+
|        2 |
+----------+
> update tm set x=2;
> select count(*) from mysql.transaction_registry;
+----------+
| count(*) |
+----------+
|        2 |
+----------+

DML and query algorithms

Besides querying and updating mysql.transaction_registry on InnoDB queries and DML operations, it's worth to note that SELECT and DML operators, such as INSERT, UPDATE and DELETE, against system versioned tables do somewhat more work that usual DML and queries.

SELECT

SELECT statement for historical data adds implicit WHERE clause to a query. In particular query

> select * from t for system_time as of transaction 1290;
+---+
| x |
+---+
| 1 |
+---+

, with all simplifications, works as

> select x from t for system_time all
  where sys_trx_end > 1290 and sys_trx_start <= 1290;
+---+
| x |
+---+
| 1 |
+---+

INSERT

INSERT statement just adds the sys_trx_start and sys_trx_end fields to the new row and inserts it into the table.

UPDATE

UPDATE does two operations:

update current row with current transaction ID (or timestamp for MyISAM) of sys_trx_end field;
insert a new row, which is updated copy of the original row.

DELETE

DELETE actually just updates a row by the new value of sys_trx_end equal to current timestamp for MyISAM or transaction ID for InnoDB.

DELETE can not be used to real deletion of historical data, so all the data ever inserted into a database is persistent and special privileges are required to delete the historical data.

ALTER TABLE

You can add and drop system versioning property on a table:

> show create table t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `x` int(11) NOT NULL,
  `sys_trx_start` bigint(20) unsigned GENERATED ALWAYS AS ROW START INVISIBLE,
  `sys_trx_end` bigint(20) unsigned GENERATED ALWAYS AS ROW END INVISIBLE,
  PRIMARY KEY (`x`,`sys_trx_end`),
  PERIOD FOR SYSTEM_TIME (`sys_trx_start`, `sys_trx_end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 WITH SYSTEM VERSIONING

> alter table t drop system versioning;
> show create table t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `x` int(11) NOT NULL,
  PRIMARY KEY (`x`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Also you can add or drop system versioning for columns separately:

> alter table t add column y int;
> alter table t
  add column sys_trx_start bigint unsigned as row start invisible,
  add column sys_trx_end bigint unsigned as row end invisible,
  add period for system_time(sys_trx_start, sys_trx_end),
  add system versioning;
> set @@versioning_alter_history=keep;
> alter table t modify y int without system versioning;
> show create table t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `x` int(11) NOT NULL,
  `y` int(11) DEFAULT NULL WITHOUT SYSTEM VERSIONING,
  `sys_trx_start` bigint(20) unsigned GENERATED ALWAYS AS ROW START INVISIBLE,
  `sys_trx_end` bigint(20) unsigned GENERATED ALWAYS AS ROW END INVISIBLE,
  PRIMARY KEY (`x`,`sys_trx_end`),
  PERIOD FOR SYSTEM_TIME (`sys_trx_start`, `sys_trx_end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 WITH SYSTEM VERSIONING

Now only changes in column x sill be tracked by the system versioning.

Note usage; of set @@versioning_alter_history=keep command to keep historical system rows and subject them to ALTER TABLE. The default value ERROR of the variable doesn't allow you to alter table if it has some history, i.e. alter table t modify y int without system versioning from the example above would fail if @@versioning_alter_history isn't set to KEEP since there are history data in t. In KEEP mode added columns of history rows are filled with NULL values.

Foreign keys

Technically historical records are common records, i.e. if you delete a row from a system versioned table, then the row still exists in the table. It means that foreign key constrains do not work on system versioned tables as following example shows:

> create table p(x int unique key);
> create table c(px int, foreign key(px) references p(x),
                 sys_trx_start timestamp(6) as row start invisible,
		 sys_trx_end timestamp(6) as row end invisible,
		 period for system_time(sys_trx_start, sys_trx_end))
                ) with system versioning;
> insert into p values(1);
> insert into c values(1);
> delete from c;
> delete from p;
> select *,sys_trx_start,sys_trx_end from c for system_time all;
+------+---------------+-------------+
| px   | sys_trx_start | sys_trx_end |
+------+---------------+-------------+
|    1 |          1308 |        1311 |
+------+---------------+-------------+

History partitioning

Since history data grows rapidly, it's wise to use data partitioning. E.g. monthly paritioning can be done by:

> create table tp (x int,
                   sys_trx_start timestamp(6) as row start invisible,
		   sys_trx_end timestamp(6) as row end invisible,
		   period for system_time(sys_trx_start, sys_trx_end))
                  ) with system versioning
  partition by system_time interval 1 month (
    partition p0 versioning,
    partition p1 versioning,
    partition pn as of current_timestamp);

So history queries running against the table use early partition pruning for fetching data from relevant partitions only.

History purging

Historical data can be deleted using special TRUNCATE statement and only if special priviledges were granted. Only whole history untill particular point in time can be purged, there is no opportunity to purge part of the history from the middle.

> select *,sys_trx_start,sys_trx_end from t for system_time all;
+------+---------------+----------------------+------+
| x    | sys_trx_start | sys_trx_end          | y    |
+------+---------------+----------------------+------+
|    1 |          2885 | 18446744073709551615 | NULL |
|    1 |          2857 |                 2869 | NULL |
|    1 |          2869 |                 2874 | NULL |
|    1 |          2874 |                 2877 | NULL |
|    1 |          2877 |                 2880 | NULL |
|    1 |          2880 |                 2885 | NULL |
+------+---------------+----------------------+------+

> truncate t to system_time timestamp '2017-12-14 19:13:12';

> select *,sys_trx_start,sys_trx_end from t for system_time all;
+------+---------------+----------------------+------+
| x    | sys_trx_start | sys_trx_end          | y    |
+------+---------------+----------------------+------+
|    1 |          2885 | 18446744073709551615 | NULL |
|    1 |          2874 |                 2877 | NULL |
|    1 |          2877 |                 2880 | NULL |
|    1 |          2880 |                 2885 | NULL |
+------+---------------+----------------------+------+

> truncate t to system_time now();
> select *,sys_trx_start,sys_trx_end from t for system_time all;
+------+---------------+----------------------+------+
| x    | sys_trx_start | sys_trx_end          | y    |
+------+---------------+----------------------+------+
|    1 |          2885 | 18446744073709551615 | NULL |
+------+---------------+----------------------+------+

Note that now it's impossible to purge historica data from a partitioned table. Please track issue #403 for this feature.

Replication

System versioned tables in MyISAM and InnoDB with system_versioning_transaction_registry=false (by default) storage engines are just usual tables, so any type of replication works just fine with such tables.

However, System versioning for InnoDB with system_versioning_transaction_registry=true relies on transaction IDs. Meantime, nodes in a replication cluster can have different current transaction IDs, so only asynchronous statement based replication (SBR) works with system versioned tables using transaction IDs.

Backup

MariaDB Backup is fully compatible with system versioning.

DDL survival

DDL survival feature is still in under development process and won't be included into MariaDB 10.3 release. Read DDL survival for the feature design description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly