title | summary | category |
---|---|---|
Handle Sharding DDL Locks Manually in DM |
Learn how to handle sharding DDL locks manually in DM. |
reference |
DM uses the sharding DDL lock to ensure operations are performed in the correct order. This locking mechanism resolves sharding DDL locks automatically in most cases, but you need to use the unlock-ddl-lock
or break-ddl-lock
command to manually handle the abnormal DDL locks in some abnormal scenarios.
Warning:
- Do not use
unlock-ddl-lock
orbreak-ddl-lock
unless you are totally aware of the possible impacts brought by the command and you can accept them.- Before manually handling the abnormal DDL locks, make sure that you have already read the DM shard merge principles.
This command queries the current DDL lock information on DM-master
.
show-ddl-locks [--worker=127.0.0.1:8262] [task-name]
-
worker
:- Flag; string;
--worker
; optional - It can be specified repeatedly multiple times.
- If it is not specified, this command queries the lock information related to all DM-workers; if it is specified, this command queries the lock information related only to the specified DM-worker.
- Flag; string;
-
task-name
:- Non-flag; string; optional
- If it is not specified, this command queries the lock information related to all tasks; if it is specified, this command queries the lock information related only to the specified task.
» show-ddl-locks test
{
"result": true, # The result of the query for the lock information.
"msg": "", # The additional message for the failure to query the lock information or other descriptive information (for example, the lock task does not exist).
"locks": [ # The lock information list on DM-master.
{
"ID": "test-`shard_db`.`shard_table`", # The lock ID, which is made up of the current task name and the schema/table information corresponding to the DDL.
"task": "test", # The name of the task to which the lock belongs.
"owner": "127.0.0.1:8262", # The owner of the lock.
"DDLs": [ # The DDL list corresponding to the lock.
"USE `shard_db`; ALTER TABLE `shard_db`.`shard_table` DROP COLUMN `c2`;"
],
"synced": [ # The list of DM-workers that have received all sharding DDL events in the corresponding MySQL instance.
"127.0.0.1:8262"
],
"unsynced": [ # The list of DM-workers that have not yet received all sharding DDL events in the corresponding MySQL instance.
"127.0.0.1:8263"
]
}
]
}
This command actively requests DM-master
to unlock the specified DDL lock, including requesting the owner to execute the DDL statement, requesting all other DM-workers that are not the owner to skip the DDL statement, and removing the lock information on DM-master
.
unlock-ddl-lock [--worker=127.0.0.1:8262] [--owner] [--force-remove] <lock-ID>
-
worker
:- Flag; string;
--worker
; optional - It can be specified repeatedly multiple times.
- If it is not specified, this command sends requests for all DM-workers (except for the owner) that are waiting for the lock to skip the DDL statement; if it is specified, this command sends requests only for the specified DM-worker to skip the DDL statement.
- Flag; string;
-
owner
:- Flag; string;
--owner
; optional - If it is not specified, this command requests for the default owner (the owner in the result of
show-ddl-locks
) to execute the DDL statement; if it is specified, this command requests for the DM-worker (the alternative of the default owner) to execute the DDL statement.
- Flag; string;
-
force-remove
:- Flag; boolean;
--force-remove
; optional - If it is not specified, this command removes the lock information only when the owner succeeds to execute the DDL statement; if it is specified, this command forcefully removes the lock information even though the owner fails to execute the DDL statement (after doing this you cannot query or operate on the lock again).
- Flag; boolean;
-
lock-ID
:- Non-flag; string; required
- It specifies the ID of the DDL lock that needs to be unlocked (the
ID
in the result ofshow-ddl-locks
).
» unlock-ddl-lock test-`shard_db`.`shard_table`
{
"result": true, # The result of the unlocking operation.
"msg": "", # The additional message for the failure to unlock the lock.
"workers": [ # The result list of the executing or skipping DDL operation of each DM-worker.
{
"result": true, # The result of the executing or skipping DDL operation.
"worker": "127.0.0.1:8262", # The DM-worker ID.
"msg": "" # The reasons why the DM-worker failed to execute or skip the DDL statement.
}
]
}
This command actively asks the DM-worker to forcefully break the DDL lock that is to be unlocked, including asking the DM-worker to execute/skip the DDL and removing the DDL lock information on the DM-worker.
break-ddl-lock <--worker=127.0.0.1:8262> [--remove-id] [--exec] [--skip] <task-name>
-
worker
:- Flag; string;
--worker
; required - It specifies the DM-worker that needs to execute the breaking operation.
- Flag; string;
-
remove-id
: deprecated. -
exec
:- Flag; boolean;
--exec
; optional - It cannot be specified simultaneously with the
--skip
parameter. - If it is specified, this command asks the DM-worker to execute the corresponding DDL statement of the lock.
- Flag; boolean;
-
skip
:- flag; boolean;
--skip
; optional - It cannot be specified simultaneously with the
--exec
parameter. - If it is specified, this command asks the DM-worker to skip the corresponding DDL statement of the lock.
- flag; boolean;
-
task-name
:- Non-flag; string; required
- It specifies the name of the task containing the lock that is going to execute the breaking operation (you can check whether a task contains the lock via query-status).
» break-ddl-lock -w 127.0.0.1:8262 --exec test
{
"result": true, # The result of the lock breaking operation.
"msg": "", # The reason why the breaking lock operation failed.
"workers": [ # The list of DM-workers which break the lock (currently the lock can be broken by only one DM-worker at a single operation).
{
"result": false, # The result of the lock breaking operation by the DM-worker.
"worker": "127.0.0.1:8262", # The DM-worker ID.
"msg": "" # The reason why the DM-worker failed to break the lock.
}
]
}
Currently, the unlock-ddl-lock
or break-ddl-lock
command only supports handling sharding DDL locks in the following three abnormal scenarios.
Before DM-master
tries to automatically unlock the sharding DDL lock, all the DM-workers need to receive the sharding DDL events (for details, see shard merge principles). If the sharding DDL event is already in the replication process, and some DM-workers have gone offline and are not to be restarted (these DM-workers have been removed according to the application demand), then the sharding DDL lock cannot be automatically replicated and unlocked because not all the DM-workers can receive the DDL event.
Note:
If you need to make some DM-workers offline when not in the process of replicating sharding DDL events, a better solution is to use
stop-task
to stop the running tasks first, make the DM-workers go offline, remove the corresponding configuration information from the task configuration file, and finally usestart-task
and the new task configuration to restart the replication task.
Suppose that there are two instances MySQL-1
and MySQL-2
in the upstream, and there are two tables shard_db_1
.shard_table_1
and shard_db_1
.shard_table_2
in MySQL-1
and two tables shard_db_2
.shard_table_1
and shard_db_2
.shard_table_2
in MySQL-2
. Now we need to merge the four tables and replicate them into the table shard_db
.shard_table
in the downstream TiDB.
The initial table structure is:
mysql> SHOW CREATE TABLE shard_db_1.shard_table_1;
+---------------+------------------------------------------+
| Table | Create Table |
+---------------+------------------------------------------+
| shard_table_1 | CREATE TABLE `shard_table_1` (
`c1` int(11) NOT NULL,
PRIMARY KEY (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+---------------+------------------------------------------+
The following DDL operation will be executed on the upstream sharded tables to alter the table structure:
ALTER TABLE shard_db_*.shard_table_* ADD COLUMN c2 INT;
The operation processes of MySQL and DM are as follows:
-
The corresponding DDL operations are executed on the two sharded tables of
DM-worker-1
inMySQL-1
to alter the table structures.ALTER TABLE shard_db_1.shard_table_1 ADD COLUMN c2 INT;
ALTER TABLE shard_db_1.shard_table_2 ADD COLUMN c2 INT;
-
DM-worker-1
sends the DDL information related toMySQL-1
toDM-master
, andDM-master
creates the corresponding DDL lock. -
Use
show-ddl-lock
to check the information of the current DDL lock.» show-ddl-locks test { "result": true, "msg": "", "locks": [ { "ID": "test-`shard_db`.`shard_table`", "task": "test", "owner": "127.0.0.1:8262", "DDLs": [ "USE `shard_db`; ALTER TABLE `shard_db`.`shard_table` ADD COLUMN `c2` int(11);" ], "synced": [ "127.0.0.1:8262" ], "unsynced": [ "127.0.0.1:8263" ] } ] }
-
Due to the application demand, the
DM-worker-2
data inMySQL-2
is no longer needed to be replicated to the downstream TiDB, andDM-worker-2
is made offline. -
The lock whose ID is
test-`shard_db`.`shard_table`
onDM-master
cannot receive the DDL information ofDM-worker-2
.- The returned result
unsynced
byshow-ddl-locks
has always included the information ofDM-worker-2
(127.0.0.1:8263
).
- The returned result
-
Use
unlock-dll-lock
to askDM-master
to actively unlock the DDL lock.-
If the owner of the DDL lock has gone offline, you can use the parameter
--owner
to specify another DM-worker as the new owner to execute the DDL. -
If any DM-worker reports an error,
result
will be set tofalse
, and at this point you should check carefully if the errors of each DM-worker is acceptable and within expectations.- DM-workers that have gone offline will return the error
rpc error: code = Unavailable
, which is within expectations and can be neglected; but if other online DM-workers return errors, then you should deal with them based on the scenario.
» unlock-ddl-lock test-`shard_db`.`shard_table` { "result": false, "msg": "github.com/pingcap/tidb-enterprise-tools/dm/master/server.go:1472: DDL lock test-`shard_db`.`shard_table` owner ExecuteDDL successfully, so DDL lock removed. but some dm-workers ExecuteDDL fail, you should to handle dm-worker directly", "workers": [ { "result": true, "worker": "127.0.0.1:8262", "msg": "" }, { "result": false, "worker": "127.0.0.1:8263", "msg": "rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8263: connect: connection refused\"" } ] }
- DM-workers that have gone offline will return the error
-
-
Use
show-ddl-locks
to confirm if the DDL lock is unlocked successfully.» show-ddl-locks test { "result": true, "msg": "no DDL lock exists", "locks": [ ] }
-
Check whether the table structure is altered successfully in the downstream TiDB.
mysql> SHOW CREATE TABLE shard_db.shard_table; +-------------+--------------------------------------------------+ | Table | Create Table | +-------------+--------------------------------------------------+ | shard_table | CREATE TABLE `shard_table` ( `c1` int(11) NOT NULL, `c2` int(11) DEFAULT NULL, PRIMARY KEY (`c1`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin | +-------------+--------------------------------------------------+
-
Use
query-status
to confirm if the replication task is normal.
After you have manually unlocked the lock by using unlock-ddl-lock
, if you don't deal with the offline DM-workers included in the task configuration information, the lock might still be unable to be replicated automatically when the next sharding DDL event is received.
Therefore, after you have manually unlocked the DDL lock, you should perform the following operations:
- Use
stop-task
to stop the running tasks. - Update the task configuration file, and remove the related information of the offline DM-worker from the configuration file.
- Use
start-task
and the new task configuration file to restart the task.
Note:
After you run
unlock-ddl-lock
, if the DM-worker that went offline becomes online again and tries to replicate the data of the sharded tables, a match error between the data and the downstream table structure might occur.
After DM-master
receives the DDL events of all DM-workers, automatically running unlock DDL lock
mainly include the following steps:
- Ask the owner of the lock to execute the DDL and update the checkpoints of corresponding sharded tables.
- Remove the DDL lock information stored on
DM-master
after the owner successfully executes the DDL. - Ask all other DM-workers to skip the DDL and update the checkpoints of corresponding sharded tables after the owner successfully executes the DDL.
Currently, the above unlocking process is not atomic. Therefore, after the owner successfully executes the DDL, if a DM-worker restarts during the period of asking other DM-workers to skip the DDL, then the DM-worker might fail to skip the DDL.
At this point, the lock information on DM-master
has been removed and the restarted DM-worker will continue to replicate the DDL, but as other DM-workers (including the previous owner) has replicated the DDL and continued the replication process, this DM-worker will never see the DDL lock be unlocked automatically.
Suppose that now we have the same upstream and downstream table structures and the same demand for merging tables and replication as in the manual solution of Some DM-workers go offline.
When DM-master
automatically executes the unlocking process, the owner (DM-worker-1
) successfully executes the DDL and continues the replication process, and the DDL lock information has been removed from DM-master
. But at this point, if DM-worker-2
restarts during the period of asking DM-worker-2
to skip the DDL, then the skipping process might fail.
After DM-worker-2
restarts, it will try to replicate the waiting DDL lock before it restarted. At this point, a new lock will be created on DM-master
, and the DM-worker will become the owner of the lock (other DM-workers have executed/skipped the DDL by now and are continuing the replication process).
The operation processes are:
-
Use
show-ddl-locks
to confirm if the corresponding lock of the DDL exists onDM-master
.Only the restarted DM-worker (
127.0.0.1:8263
) is at thesynced
state.» show-ddl-locks { "result": true, "msg": "", "locks": [ { "ID": "test-`shard_db`.`shard_table`", "task": "test", "owner": "127.0.0.1:8263", "DDLs": [ "USE `shard_db`; ALTER TABLE `shard_db`.`shard_table` ADD COLUMN `c2` int(11);" ], "synced": [ "127.0.0.1:8263" ], "unsynced": [ "127.0.0.1:8262" ] } ] }
-
Use
unlock-ddl-lock
to askDM-master
to unlock the lock.-
Use the parameter
--worker
to limit the operation to only target at the restarted DM-worker (127.0.0.1:8263
). -
The DM-worker will try to execute the DDL to the downstream again during the unlocking process (the owner before restarting has executed the DDL to the downstream), so as to make sure that the DDL can be executed multiple times.
» unlock-ddl-lock --worker=127.0.0.1:8263 test-`shard_db`.`shard_table` { "result": true, "msg": "", "workers": [ { "result": true, "worker": "127.0.0.1:8263", "msg": "" } ] }
-
-
Use
show-ddl-locks
to confirm if the DDL lock has been successfully unlocked. -
Use
query-status
to confirm if the replication task is normal.
After manually unlocking the lock, the following sharding DDL can be replicated automatically and normally.
This scenario has the similar reason for the abnormal lock in Scenario 3: Some DM-workers restart during the DDL unlocking process. If the DM-worker is temporarily unreachable when you request the DM-worker to skip the DDL statement, this DM-worker might fail to skip the DDL statement. At this point, the lock information is removed from DM-master
, but the DM-worker will continue to be waiting for a DDL lock which is no longer existing.
The difference between Scenario 3 and Scenario 2: Some DM-workers restart during the DDL unlocking process is that the DM-master does not have a lock in Scenario 3, but the DM-master has a new lock in Scenario 2.
Suppose that now we have the same upstream and downstream table structures and the same demand for merging tables and replication as in the manual solution of Some DM-workers go offline.
When DM-master
automatically executes the unlocking operation, the owner (DM-worker-1
) successfully executes the DDL and continues the replication process, and the DDL lock information has been removed from DM-master
. But at this point, if DM-worker-2
is temporarily unreachable due to the Internet failure during the period of asking DM-worker-2
to skip the DDL, then the skipping process might fail.
The operation processes are:
-
Use
show-ddl-locks
to confirm if the corresponding lock of the DDL no longer exists onDM-master
. -
Use
query-status
to confirm if the DM-worker is still waiting for the lock to replicate.» query-status test { "result": true, "msg": "", "workers": [ ... { ... "worker": "127.0.0.1:8263", "subTaskStatus": [ { ... "unresolvedDDLLockID": "test-`shard_db`.`shard_table`", "sync": { ... "blockingDDLs": [ "USE `shard_db`; ALTER TABLE `shard_db`.`shard_table` ADD COLUMN `c2` int(11);" ], "unresolvedGroups": [ { "target": "`shard_db`.`shard_table`", "DDLs": [ "USE `shard_db`; ALTER TABLE `shard_db`.`shard_table` ADD COLUMN `c2` int(11);" ], "firstPos": "(mysql-bin|000001.000003, 1752)", "synced": [ "`shard_db_2`.`shard_table_1`", "`shard_db_2`.`shard_table_2`" ], "unsynced": [ ] } ], "synced": false } } ] ... } ] }
-
Use
break-ddl-lock
to compulsorily break the DDL lock which the DM-worker is waiting for.As the owner has executed the DDL to the downstream, you should use the parameter
--skip
to break the lock.» break-ddl-lock --worker=127.0.0.1:8263 --skip test { "result": true, "msg": "", "workers": [ { "result": true, "worker": "127.0.0.1:8263", "msg": "" } ] }
-
Use
query-status
to confirm if the replication task is normal and no longer at the state of waiting for the lock.
After manually breaking the lock, the following sharding DDL can be replicated automatically and normally.