Release the read lock while creating connections in`refresh_connections` #191

barshaul · 2024-09-13T14:16:05Z

PR Description:

Main Changes:

Lock Management Improvement:
In the previous implementation, the read lock (inner.conn_lock.read()) was held throughout the entire connection refresh process (for all connections sent to refresh), including while attempting to establish connections (via get_or_create_conn). If connections were slow or timed out, the lock was held for an extended duration, blocking other tasks requiring a write lock.

The new implementation releases the read lock before making connection attempts. If the connection is successfully established, the read lock is reacquired to update the connection container. This approach ensures that other operations needing the lock (e.g., write operations) can proceed while connections are being established.
Unclear Deadlock Behavior:
A deadlock scenario was observed while testing the update_slotmap_moved branch (on amazon-contributing/redis-rs) during failover testing. The root cause of the deadlock remains unclear. The branch introduces changes that attempt to acquire a write lock on the connection container, which leads to the issue. However, even after removing the content of the update_upon_moved function (leaving only the lock acquisition), the deadlock persisted, suggesting that the problem isn't directly tied to the logic in the function itself.

It seems like there is an unusual race condition occurring, causing the lock to enter an undefined state where neither reads nor writes are able to acquire it. This lock state is leading to the deadlock, with all tasks attempting to use the lock getting blocked.

The issue arose in the following situation:

A failover is initiated by terminating a node.
refresh_connections is triggered and acquires the read lock, while get_or_create_conn is waiting for a connection to complete.
Meanwhile, update_upon_moved tries to acquire the write lock but is blocked since the read lock is held by refresh_connections.
After refresh_connections fails with a Connection refused (os error 111) and exits, the lock is not properly released.
Despite the function returning, the read lock remains stuck, resulting in a system-wide deadlock where both read and write lock tasks are stalled. Logs only show timeouts at the socket_listener level, with the internal redis-rs client being completely stuck.

Important: It is unclear why this "deadlock" occurs and why the lock isn't released after the function exits. Despite attempts to explicitly drop the lock right before the function returns, the issue persisted. However, with the new lock-release-before-connection strategy, the problem no longer appears.

Testing:

Failover Handling:
This issue and change were tested by simulating node failovers on the update_slotmap_moved branch, verifying that the client successfully recovers without getting stuck, allowing the system to quickly find the promoted replica and maintain operations.

We still need to investigate the root cause of the lock issue (looks like a tokio bug?), but this change resolves the deadlock and improving lock management.

Deadlock Test Logs:

13:51:07.460972  WARN No connection found for route `Route(1200, Master)`. Attempting redirection to a random node.
13:51:07.460991  INFO connectionCheck::RandomConnection
13:51:07.461007  INFO connectionCheck::RandomConnection acquired lock
13:51:07.461057  INFO connectionCheck::RandomConnection dropped lock
13:51:07.461107  INFO validate_all_user_connections acquired lock
13:51:07.461299  INFO validate_all_user_connections lock is dropped
13:51:07.461353  INFO validate_all_user_connections calls refresh_connections
13:51:07.461369  INFO Started refreshing connections to ["host:6379"]
13:51:07.461369  INFO refresh_connections acquired read lock
13:51:07.462057  INFO Creating TCP with TLS connection for node: "host:6379", IP: x.x.x.172
13:51:07.481366  WARN Received request error An error was signalled by the server - Moved: 1200 host:6379 on node "other_host:6379".
13:51:07.481535  INFO update_upon_moved_error is called, waiting to acquire write lock (no log after, meaning lock isn't acquired)
13:51:07.481583  INFO refresh_slots_inner is called, waiting to acquire read lock (no log after, meaning lock isn't acquired)
13:51:07.481904  WARN Failed to refresh connection for node host:6379. Error: `Connection refused (os error 111)`
13:51:07.481939  INFO refresh connections completed, function exits
13:51:07.481964  INFO validate_all_user_connections checks again if I can acquire read lock (no log after, meaning lock isn't acquired)
13:51:07.710412  WARN received error - timed out
13:51:07.710453  DEBUG received error - for callback 82
13:51:07.961725  WARN received error - timed out
13:51:07.961786  DEBUG received error - for callback 82
..... more and more time out errors raising from the socket_listener, the redis-rs client is stuck

… while creating a new connection

eifrah-aws · 2024-09-15T07:03:23Z