br: incremental restore does not handle CREATE INDEX (ADD INDEX) correctly, causing data inconsistency #54426

kennytm · 2024-07-03T16:16:10Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

use test;

-- 1. prepare the full backup

drop table if exists t;
create table t (pk bigint primary key, val int not null);
insert into t values (1, 11), (2, 22), (3, 33), (4, 44);
backup table t to 'local:///tmp/tidb_54426_full_backup';
-- ^ write down the reported backupts.
admin check table t;
select * from t;
/*
+----+-----+
| pk | val |
+----+-----+
|  1 |  11 |
|  2 |  22 |
|  3 |  33 |
|  4 |  44 |
+----+-----+
*/

-- 2. prepare the incr backup

create index index_val on t (val);
update t set val = 66 - val;
backup table t to 'local:///tmp/tidb_54426_incr_backup' last_backup = «fill_in_the_BackupTS_from_above_here»;
admin check table t;
select * from t use index (index_val);
/*
+----+-----+
| pk | val |
+----+-----+
|  4 |  22 |
|  3 |  33 |
|  2 |  44 |
|  1 |  55 |
+----+-----+
*/

-- 3. clean up
drop table t;

-- 4. perform the full restore
restore schema * from 'local:///tmp/tidb_54426_full_backup';
admin check table t;
select * from t;
/*
+----+-----+
| pk | val |
+----+-----+
|  1 |  11 |
|  2 |  22 |
|  3 |  33 |
|  4 |  44 |
+----+-----+
*/

-- 5. perform the incr restore
restore schema * from 'local:///tmp/tidb_54426_incr_backup';
admin check table t;
select * from t use index (index_val);

2. What did you expect to see? (Required)

In step 5, the admin check table passes and the select * has the same output as step 2.

3. What did you see instead (Required)

mysql> admin check table t;
ERROR 8223 (HY000): data inconsistency in table: t, index: index_val, handle: 1, index-values:"handle: 1, values: [KindInt64 11]" != record-values:"handle: 1, values: [KindInt64 55]"

mysql> select * from t use index (index_val);
+----+-----+
| pk | val |
+----+-----+
|  1 |  11 |
|  2 |  22 |
|  4 |  22 |
|  3 |  33 |
|  2 |  44 |
|  4 |  44 |
|  1 |  55 |
+----+-----+
7 rows in set (0.01 sec)

4. What is your TiDB version? (Required)

Release Version: v8.1.0
Edition: Community
Git Commit Hash: 945d07c5d5c7a1ae212f6013adfb187f2de24b23
Git Branch: HEAD
UTC Build Time: 2024-05-21 03:52:40
GoVersion: go1.21.10
Race Enabled: false
Check Table Before Drop: false
Store: tikv

The text was updated successfully, but these errors were encountered:

kennytm · 2024-07-03T16:58:48Z

If we use tidb-ctl mvcc to check on all keys after step 5 we'll find the MVCC records like this:

key	startTS	commitTS	value
row - pk: 1	450893690177585154	450893690177585155	val: 11
row - pk: 2	450893690177585154	450893690177585155	val: 22
row - pk: 3	450893690177585154	450893690177585155	val: 33
row - pk: 4	450893690177585154	450893690177585155	val: 44
row - pk: 1	450893773434519588	450893773434519588	val: 55
row - pk: 2	450893773434519588	450893773434519588	val: 44
row - pk: 4	450893773434519588	450893773434519588	val: 22
index - val: 11, pk: 1	450893773434519588	450893773434519588	DELETED
index - val: 22, pk: 2	450893773434519588	450893773434519588	DELETED
index - val: 22, pk: 4	450893773434519588	450893773434519588	null
index - val: 33, pk: 3	450893773434519588	450893773434519588	null
index - val: 44, pk: 2	450893773434519588	450893773434519588	null
index - val: 44, pk: 4	450893773434519588	450893773434519588	DELETED
index - val: 55, pk: 1	450893773434519588	450893773434519588	null
index - val: 11, pk: 1	450893773749092360	450893773749092360	null
index - val: 22, pk: 2	450893773749092360	450893773749092360	null
index - val: 33, pk: 3	450893773749092360	450893773749092360	null
index - val: 44, pk: 4	450893773749092360	450893773749092360	null

Here,

450893690177585154/155 are the timestamps of the initial INSERT transaction. Full restore did not perform any RewriteTS operation.
450893773434519588 is the reported Cluster TS of the Incremental restore.
450893773749092360 is the ingestion TS of the CREATE INDEX DDL performed during Incremental restore.

The bug is that the Cluster TS obtained by Incremental restore

tidb/br/pkg/task/restore.go

Line 827 in b4052bd

restoreTS, err := restore.GetTSWithRetry(ctx, mgr.GetPDClient())

is computed way too early before DDL execution at

tidb/br/pkg/task/restore.go

Line 986 in b4052bd

err = client.ExecDDLs(ctx, ddlJobs)

so the RewriteTS operation failed to overwrite the keys generated by CREATE INDEX.

The restoreTS should be fetched again after executing the DDLs.

(Alternatively, we can modify the DDL execution such that it will never change any t-prefixed keys i.e. skip backfilling, which should be much more efficient and get rid of the need of RewriteTS)

lance6716 · 2024-07-04T02:30:25Z

I'm not sure what's the timeline of these actions. Should it be invoking ExecDDLs after RewriteTS operation? So ADD INDEX will see the newest data and generate the correct index KV

kennytm · 2024-07-04T04:15:46Z

@lance6716 the current situation is

get "RestoreTS" (let's say the value is 51)
setup a GCSafePoint on RestoreTS=51
perform ExecDDL (tso = 52...60)
import data with NewTimestamp set to RestoreTS=51 ❌

The correct action should be:

get "MinRestoreTS" = 51
setup a GCSafePoint on MinRestoreTS=51
perform ExecDDL (tso = 52...60)
get a new "DataRestoreTS" = 61 🆕
import data with NewTimestamp set to DataRestoreTS=61 ✅

Alternatively, ensure DDLs won't generate any t-prefixed KVs so no overwriting will happen

get "RestoreTS" = 51
setup a GCSafePoint on RestoreTS=51
perform ExecDDL with backfilling set to do nothing 🆕 (tso = 52...60)
import data with NewTimestamp set to RestoreTS=51 or without rewrite TS at all ⭕️

But if we do want to keep rewriteTS it should use 61 rather than 51 to maintain the global chronological order.

If we disable rewriteTS, it is better we find a way to also force ExecDDL to commit the DDLJobs using the "backup time axis" rather than "restore time axis", so that @@tidb_snapshot can still work.

kennytm added type/bug This issue is a bug. component/br This issue is related to BR of TiDB. affects-8.1 labels Jul 3, 2024

3pointer mentioned this issue Jul 4, 2024

Incremental restore: fix the issue that backfill data is not covered by newTS #54430

Merged

13 tasks

3pointer added affects-6.5 affects-7.1 affects-7.5 labels Jul 5, 2024

jebter added the impact/inconsistency incorrect/inconsistency/inconsistent label Jul 5, 2024

3pointer added the severity/major label Jul 8, 2024

ti-chi-bot bot added may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Jul 8, 2024

3pointer removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Jul 8, 2024

ti-chi-bot bot closed this as completed in #54430 Jul 8, 2024

ti-chi-bot bot closed this as completed in 08147e7 Jul 8, 2024

ti-chi-bot mentioned this issue Jul 8, 2024

Incremental restore: fix the issue that backfill data is not covered by newTS (#54430) #54505

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

br: incremental restore does not handle CREATE INDEX (ADD INDEX) correctly, causing data inconsistency #54426

br: incremental restore does not handle CREATE INDEX (ADD INDEX) correctly, causing data inconsistency #54426

kennytm commented Jul 3, 2024

kennytm commented Jul 3, 2024 •

edited

Loading

lance6716 commented Jul 4, 2024

kennytm commented Jul 4, 2024 •

edited

Loading

br: incremental restore does not handle CREATE INDEX (ADD INDEX) correctly, causing data inconsistency #54426

br: incremental restore does not handle CREATE INDEX (ADD INDEX) correctly, causing data inconsistency #54426

Comments

kennytm commented Jul 3, 2024

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

kennytm commented Jul 3, 2024 • edited Loading

lance6716 commented Jul 4, 2024

kennytm commented Jul 4, 2024 • edited Loading

kennytm commented Jul 3, 2024 •

edited

Loading

kennytm commented Jul 4, 2024 •

edited

Loading