-
Notifications
You must be signed in to change notification settings - Fork 5
High level design
This high level design describes new history feature for MariaDB, mainly inspired by Oracle's Flashback, MS SQL Server Change Data Capture (see also Introduction to Change Data Capture (CDC) in SQL Server 2008) and Point in Time database Architecture (PTA).
There are few different approaches to implement the system versioning feature with their pros and cons depending on the main use case of the feature.
Oracle Flashback seems mostly designated for point-in-time recovery of short terms. So history tables are generated on-the-fly from undo log and so have limited time to live as well as indexing abilities. Maintaining relatively long history using the approach can lead to significant performance issues (see also feature request Bug #75540, Major regression having many row versions). It's also not viable against non-in-line DDL changes.
While the paper Transaction Time Indexing with Version Compression is published by Microsoft Research, it seems MS SQL Server CDC uses a different, somewhat similar to PTA, approach: it monitors undo log changes and writes history records to separate tables. It seems the most straightforward and easy to implement approach. Since new tables are created for historical data, the approach can survive DDL. The cons of the approach is that a row update involves two tables update (however, the history update is done in background) with consequent performance impact.
Transaction Time Indexing with Version Compression proposes to use TSB-tree (Time-Split B-tree), a variation of Write Once B-Tree (WOBT) which has to keep old items since it can't rewrite them on tree balancing and node splitting. However, the tree lookup skips 'old versions' of the items and use the latest ones. TSB-tree splits index or leaf nodes by key value and current time or by current time only. The first split allocates new nodes where actual only items are copied to, while the original node keeps old versions of the items after split. The second split is used when there have been many updates and the number of current versions is so small, thus only current version of items are copied to new nodes. The approach requires less space in comparison with the previous one. However, the resulting index becomes heavier and eviction of too old records, executing on the same table, can affect performance of current data DML.
TSB-tree is a tradeoff between two other extreme index approaches. The first one is R-tree. It's more space greedy, but it provides better performance for time slices.
The second one is extended B-tree using undo log files for historical data.
InnoDB makes all changes in-place in the table space, moving old records to
undo log such that other concurrent transactions can find required version of
a row. Actual row in table space has designated field ROLL_PTR
pointing to
old versions of the row in undo log. Different versions of a row in undo log
are also linked into list, so a transaction looking for a particular row scans
the list of all versions of the row instead of scanning whole undo log
containing versions of other records. System versioning feature can just reuse
undo log mechanism to keep historical data.
System versioning for MariaDB is designed to satisfy following requirements:
-
SQL standard compliance;
-
It must be able to quickly fetch one year and older history;
-
It must be possible to query history over DDL operations (e.g. ALTER TABLE). This feature is known as DDL Survival and is unique for MariaDB implementation.
-
The implementation must be as database engine independent as possible. It must support at least InnoDB and MyISAM. Besides these storage engines it must be easily extended to support other engines. So most of the logic is implemented on SQL layer.
-
Database history grows very quickly and significantly impacts performance of a database node, so scenario for System version is to be deployed on an asynchronous slave dedicated to store the database history.
-
Records must be versioned by transaction IDs since timestamps are unreliable. Operating system doesn't guarantee that two calls of getting system time return strictly monotonically increasing time. They can return same time for two events that occur in a very close, but different times, or they can return even lower time for later event on a different CPUs.
Each system versioned table is extended by two new hidden columns defining stored rows' versions. InnoDB table with system versioning looks as:
> create table t(x int primary key) with system versioning;
> desc t;
+---------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| x | int(11) | NO | PRI | NULL | |
| sys_trx_start | bigint(20) unsigned | YES | | NULL | |
| sys_trx_end | bigint(20) unsigned | NO | PRI | NULL | |
+---------------+---------------------+------+-----+---------+-------+
Note that sys_trx_end
is added to primary key to make historical queries
run faster.
System versioned tables in MyISAM, since the engine doesn't have transactions, use timestamps for row versioning:
> create table tm(x int primary key) engine=myisam with system versioning;
> desc tm;
+---------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| x | int(11) | NO | PRI | NULL | |
| sys_trx_start | timestamp(6) | YES | | NULL | |
| sys_trx_end | timestamp(6) | NO | PRI | NULL | |
+---------------+--------------+------+-----+---------+-------+
One of the most fundamental change in InnoDB code required by System Versioning
is different handling of transaction ID. Standard InnoDB assign ID (and
increment the global transaction counter) when a new transaction starts.
System versioned InnoDB assigns two IDs to each transaction: one for transaction
begin (sys_trx_start
, just the same as in original InnoDB) and one for
transaction commit (sys_trx_end
, the new one). So the global transaction
counter is incremented twice for each transaction. The maximum value
18446744073709551615
for sys_trx_end
means that a row still alive:
> insert into t values (1);
> select * from t for system_time all;
+---+---------------+----------------------+
| x | sys_trx_start | sys_trx_end |
+---+---------------+----------------------+
| 1 | 1288 | 18446744073709551615 |
+---+---------------+----------------------+
Since MyISAM uses timestamps, the maximum timestamp 2038-01-19 06:14:07.999999
is used to specify live records:
> insert into tm values (1);
> select * from tm for system_time all;
+---+----------------------------+----------------------------+
| x | sys_trx_start | sys_trx_end |
+---+----------------------------+----------------------------+
| 1 | 2017-12-13 22:16:33.217435 | 2038-01-19 06:14:07.999999 |
+---+----------------------------+----------------------------+
mysql.transaction_registry
described below clearly shows the double-increment
nature of transaction ID: if we query the table for just after the insertion
from the example above, then we notice the record with two different IDs for
the transaction:
> select * from mysql.transaction_registry;
+----------------+-----------+----------------------------+----------------------------+-----------------+
| transaction_id | commit_id | begin_timestamp | commit_timestamp | isolation_level |
+----------------+-----------+----------------------------+----------------------------+-----------------+
| 1288 | 1289 | 2017-12-13 21:39:22.146570 | 2017-12-13 21:39:22.175836 | REPEATABLE-READ |
+----------------+-----------+----------------------------+----------------------------+-----------------+
The commit transaction ID makes AS OF queries fast and trivial to implement. For example, if we run one more transaction and query the database state just before the transaction
> update t set x=2;
> select transaction_id,commit_id from mysql.transaction_registry;
+----------------+-----------+
| transaction_id | commit_id |
+----------------+-----------+
| 1288 | 1289 |
| 1291 | 1292 |
+----------------+-----------+
> select * from t for system_time as of transaction 1290;
+---+
| x |
+---+
| 1 |
+---+
, then, to get MVCC-like view just before transaction 1291, we don't need to
keep lists of concurrent transactions ever ran in the database as the MVCC engine
does. Instead, we just compare sys_trx_end
against the transaction ID passed
as the query argument. See
Ordering of transactions
for more details about internal query algorithms.
The table tracks the whole transactions history and matches transaction IDs to appropriate timestamps as well as stores isolation level of each transaction to satisfy MVCC-like history queries (read more on Ordering of transactions page).
> desc mysql.transaction_registry;
+------------------+----------------------------------------------------------------------------+------+-----+----------------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+----------------------------------------------------------------------------+------+-----+----------------------------+-------+
| transaction_id | bigint(20) unsigned | NO | PRI | NULL | |
| commit_id | bigint(20) unsigned | NO | UNI | NULL | |
| begin_timestamp | timestamp(6) | NO | MUL | 0000-00-00 00:00:00.000000 | |
| commit_timestamp | timestamp(6) | NO | MUL | 0000-00-00 00:00:00.000000 | |
| isolation_level | enum('READ-UNCOMMITTED','READ-COMMITTED','REPEATABLE-READ','SERIALIZABLE') | NO | | NULL | |
+------------------+----------------------------------------------------------------------------+------+-----+----------------------------+-------+
The two transactions, INSERT and UPDATE for table t
, are represented in
mysql.transaction_registry
as
> select * from mysql.transaction_registry;
+----------------+-----------+----------------------------+----------------------------+-----------------+
| transaction_id | commit_id | begin_timestamp | commit_timestamp | isolation_level |
+----------------+-----------+----------------------------+----------------------------+-----------------+
| 1288 | 1289 | 2017-12-13 21:39:22.146570 | 2017-12-13 21:39:22.175836 | REPEATABLE-READ |
| 1291 | 1292 | 2017-12-13 21:42:33.175504 | 2017-12-13 21:42:33.179434 | REPEATABLE-READ |
+----------------+-----------+----------------------------+----------------------------+-----------------+
Each transaction has two IDs for the transaction start and commit and corresponding timestamps, also for the transaction start and commit. Transaction isolation level is used for query algorithms as described in Ordering of transactions.
MariaDB system versioning for InnoDB internally works with transaction ID to
fetch appropriate rows, so mysql.transaction_registry
is used to convert timestamps
to transaction IDs in timestamp-based AS OF queries, e.g.:
> select * from t for system_time as of timestamp '2017-12-13 21:40';
+---+
| x |
+---+
| 1 |
+---+
In this example mysql.transaction_registry
is firstly queried to convert the
passed timestamp to a ID of last committed transaction before the timestamp,
and next appropriate rows are fetched from t
by transaction IDs.
Note that updates on MyISAM tables don't affect mysql.transaction_registry
anyhow:
> select count(*) from mysql.transaction_registry;
+----------+
| count(*) |
+----------+
| 2 |
+----------+
> update tm set x=2;
> select count(*) from mysql.transaction_registry;
+----------+
| count(*) |
+----------+
| 2 |
+----------+
Besides querying and updating mysql.transaction_registry
on InnoDB queries
and DML operations, it's worth to note that SELECT and DML operators,
such as INSERT, UPDATE and DELETE, against system versioned tables do somewhat
more work that usual DML and queries.
SELECT statement for historical data adds implicit WHERE clause to a query. In particular query
> select * from t for system_time as of transaction 1290;
+---+
| x |
+---+
| 1 |
+---+
, with all simplifications, works as
> select x from t for system_time all
where sys_trx_end > 1290 and sys_trx_start <= 1290;
+---+
| x |
+---+
| 1 |
+---+
INSERT statement just adds the sys_trx_start
and sys_trx_end
fields to the
new row and inserts it into the table.
UPDATE does two operations:
-
update current row with current transaction ID (or timestamp for MyISAM) of
sys_trx_end
field; -
insert a new row, which is updated copy of the original row.
DELETE actually just updates a row by the new value of sys_trx_end
equal to
current timestamp for MyISAM or transaction ID for InnoDB.
DELETE can not be used to real deletion of historical data, so all the data ever inserted into a database is persistent and special privileges are required to delete the historical data.
You can add and drop system versioning property on a table:
> show create table t \G
*************************** 1. row ***************************
Table: t
Create Table: CREATE TABLE `t` (
`x` int(11) NOT NULL,
`sys_trx_start` bigint(20) unsigned GENERATED ALWAYS AS ROW START,
`sys_trx_end` bigint(20) unsigned GENERATED ALWAYS AS ROW END,
PRIMARY KEY (`x`,`sys_trx_end`),
PERIOD FOR SYSTEM_TIME (`sys_trx_start`, `sys_trx_end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 WITH SYSTEM VERSIONING
> alter table t drop system versioning;
> show create table t \G
*************************** 1. row ***************************
Table: t
Create Table: CREATE TABLE `t` (
`x` int(11) NOT NULL,
PRIMARY KEY (`x`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Also you can add or drop system versioning for columns separately:
> alter table t add column y int;
> alter table t add system versioning;
> set @@versioning_alter_history=keep;
> alter table t modify y int without system versioning;
> show create table t \G
*************************** 1. row ***************************
Table: t
Create Table: CREATE TABLE `t` (
`x` int(11) NOT NULL,
`y` int(11) DEFAULT NULL WITHOUT SYSTEM VERSIONING,
`sys_trx_start` bigint(20) unsigned GENERATED ALWAYS AS ROW START,
`sys_trx_end` bigint(20) unsigned GENERATED ALWAYS AS ROW END,
PRIMARY KEY (`x`,`sys_trx_end`),
PERIOD FOR SYSTEM_TIME (`sys_trx_start`, `sys_trx_end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 WITH SYSTEM VERSIONING
Now only changes in column y
sill be tracked by the system versioning.
FIXME This actually doesn't work, see
issue #401.
Note usage; of set @@versioning_alter_history=keep
command to keep
historical system rows and subject them to ALTER TABLE. The default value
ERROR
of the variable doesn't allow you to alter table if it has some
history, i.e. alter table t modify y int without system versioning
from
the example above would fail if @@versioning_alter_history
isn't set to
KEEP
since there are history data in t
. In KEEP
mode added columns
of history rows are filled with NULL
values.
Technically historical records are common records, i.e. if you delete a row from a system versioned table, then the row still exists in the table. It means that foreign key constrains do not work on system versioned tables as following example shows:
> create table p(x int unique key);
> create table c(px int, foreign key(px) references p(x)) with system versioning;
> insert into p values(1);
> insert into c values(1);
> delete from c;
> delete from p;
> select * from c for system_time all;
+------+---------------+-------------+
| px | sys_trx_start | sys_trx_end |
+------+---------------+-------------+
| 1 | 1308 | 1311 |
+------+---------------+-------------+
Since history data grows rapidly, it's wise to use data partitioning. E.g. monthly paritioning can be done by:
> create table tp (x int) with system versioning
partition by system_time interval 1 month (
partition p0 versioning,
partition p1 versioning,
partition pn as of current_timestamp);
So history queries running against the table use early partition pruning for fetching data from relevant partitions only.
Historical data can be deleted using special TRUNCATE statement and only if special priviledges were granted. Only whole history untill particular point in time can be purged, there is no opportunity to purge part of the history from the middle.
> select * from t for system_time all;
+------+---------------+----------------------+------+
| x | sys_trx_start | sys_trx_end | y |
+------+---------------+----------------------+------+
| 1 | 2885 | 18446744073709551615 | NULL |
| 1 | 2857 | 2869 | NULL |
| 1 | 2869 | 2874 | NULL |
| 1 | 2874 | 2877 | NULL |
| 1 | 2877 | 2880 | NULL |
| 1 | 2880 | 2885 | NULL |
+------+---------------+----------------------+------+
> truncate t to system_time timestamp '2017-12-14 19:13:12';
> select * from t for system_time all;
+------+---------------+----------------------+------+
| x | sys_trx_start | sys_trx_end | y |
+------+---------------+----------------------+------+
| 1 | 2885 | 18446744073709551615 | NULL |
| 1 | 2874 | 2877 | NULL |
| 1 | 2877 | 2880 | NULL |
| 1 | 2880 | 2885 | NULL |
+------+---------------+----------------------+------+
> truncate t to system_time now();
> select * from t for system_time all;
+------+---------------+----------------------+------+
| x | sys_trx_start | sys_trx_end | y |
+------+---------------+----------------------+------+
| 1 | 2885 | 18446744073709551615 | NULL |
+------+---------------+----------------------+------+
Note that now it's impossible to purge historica data from a partitioned table. Please track issue #403 for this feature.
System versioning for InnoDB relies on transaction IDs. Meantime, remote InnoDB nodes in replication cluster, even with synchronous Galera replication, can have different current transaction IDs, so there is lemitations on which replication can be used with system versioning.
Galera replication doesn't work with system versioning. Basically, history data from system versioned table can be replicated by Galera, but transaction IDs in the replicated data won't be correct and historical queryes will return wrong results.
If you need Galera replication for a database having system versioned tables, it's recommended to purge whole history from all the system versioned tables, alter all the tables to switch off history tracking, and only after that start synchronous replication.
Row based asynchronous replication (RBR) also doesn't work with system versioning due to the same reasons as for Galera replication. Asynchronous statement based replication (SBR) is the only one replication type which works with system versioned tables. However, if asynchronous replication for InnoDB system versioned tables was configured in RBR mode, then it's automatically (implicitly) converted to SBR.
System versioned tables in MyISAM storage engine are just usual tables, so any type of replication works just fine with such tables.
MariaDB Backup is fully compatible with system versioning.
DDL survival feature is still in under development process and won't be included into MariaDB 10.3 release. Read DDL survival for the feature design description.