Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release the read lock while creating connections inrefresh_connections #191

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

barshaul
Copy link

@barshaul barshaul commented Sep 13, 2024

PR Description:

Main Changes:

  • Lock Management Improvement:
    In the previous implementation, the read lock (inner.conn_lock.read()) was held throughout the entire connection refresh process (for all connections sent to refresh), including while attempting to establish connections (via get_or_create_conn). If connections were slow or timed out, the lock was held for an extended duration, blocking other tasks requiring a write lock.

    The new implementation releases the read lock before making connection attempts. If the connection is successfully established, the read lock is reacquired to update the connection container. This approach ensures that other operations needing the lock (e.g., write operations) can proceed while connections are being established.

  • Unclear Deadlock Behavior:
    A deadlock scenario was observed while testing the update_slotmap_moved branch (on amazon-contributing/redis-rs) during failover testing. The root cause of the deadlock remains unclear. The branch introduces changes that attempt to acquire a write lock on the connection container, which leads to the issue. However, even after removing the content of the update_upon_moved function (leaving only the lock acquisition), the deadlock persisted, suggesting that the problem isn't directly tied to the logic in the function itself.

It seems like there is an unusual race condition occurring, causing the lock to enter an undefined state where neither reads nor writes are able to acquire it. This lock state is leading to the deadlock, with all tasks attempting to use the lock getting blocked.

The issue arose in the following situation:

  1. A failover is initiated by terminating a node.
  2. refresh_connections is triggered and acquires the read lock, while get_or_create_conn is waiting for a connection to complete.
  3. Meanwhile, update_upon_moved tries to acquire the write lock but is blocked since the read lock is held by refresh_connections.
  4. After refresh_connections fails with a Connection refused (os error 111) and exits, the lock is not properly released.
  5. Despite the function returning, the read lock remains stuck, resulting in a system-wide deadlock where both read and write lock tasks are stalled. Logs only show timeouts at the socket_listener level, with the internal redis-rs client being completely stuck.

Important: It is unclear why this "deadlock" occurs and why the lock isn't released after the function exits. Despite attempts to explicitly drop the lock right before the function returns, the issue persisted. However, with the new lock-release-before-connection strategy, the problem no longer appears.

Testing:

  • Failover Handling:
    This issue and change were tested by simulating node failovers on the update_slotmap_moved branch, verifying that the client successfully recovers without getting stuck, allowing the system to quickly find the promoted replica and maintain operations.

We still need to investigate the root cause of the lock issue (looks like a tokio bug?), but this change resolves the deadlock and improving lock management.


Deadlock Test Logs:

13:51:07.460972  WARN No connection found for route `Route(1200, Master)`. Attempting redirection to a random node.
13:51:07.460991  INFO connectionCheck::RandomConnection
13:51:07.461007  INFO connectionCheck::RandomConnection acquired lock
13:51:07.461057  INFO connectionCheck::RandomConnection dropped lock
13:51:07.461107  INFO validate_all_user_connections acquired lock
13:51:07.461299  INFO validate_all_user_connections lock is dropped
13:51:07.461353  INFO validate_all_user_connections calls refresh_connections
13:51:07.461369  INFO Started refreshing connections to ["host:6379"]
13:51:07.461369  INFO refresh_connections acquired read lock
13:51:07.462057  INFO Creating TCP with TLS connection for node: "host:6379", IP: x.x.x.172
13:51:07.481366  WARN Received request error An error was signalled by the server - Moved: 1200 host:6379 on node "other_host:6379".
13:51:07.481535  INFO update_upon_moved_error is called, waiting to acquire write lock (no log after, meaning lock isn't acquired)
13:51:07.481583  INFO refresh_slots_inner is called, waiting to acquire read lock (no log after, meaning lock isn't acquired)
13:51:07.481904  WARN Failed to refresh connection for node host:6379. Error: `Connection refused (os error 111)`
13:51:07.481939  INFO refresh connections completed, function exits
13:51:07.481964  INFO validate_all_user_connections checks again if I can acquire read lock (no log after, meaning lock isn't acquired)
13:51:07.710412  WARN received error - timed out
13:51:07.710453  DEBUG received error - for callback 82
13:51:07.961725  WARN received error - timed out
13:51:07.961786  DEBUG received error - for callback 82
..... more and more time out errors raising from the socket_listener, the redis-rs client is stuck

@barshaul barshaul changed the title Changed refresh_connections to release the connections container lock… Fix deadlock caused by refresh_connections by adjusting lock management Sep 13, 2024
@barshaul barshaul changed the title Fix deadlock caused by refresh_connections by adjusting lock management Release the read lock while creating connections inrefresh_connections Sep 14, 2024
)
.await;
tasks.push(async move {
let connections_container = inner.conn_lock.read().await;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a lock "public" is not a good idea. We should an atomic API and not the lock itself.

For example:

fn do_something() -> Result<(), Box<dyn Error>>{
  let _lk = self.lock.write()?;
  ...
}

if the "something" is complex, we should add an API:

fn write_lock_and_do<F>(callback: F) -> Result<(), Box<dyn Error>>
    where F: Fn() -> Result<(), Box<dyn Error>> {
    let _lk = self.lock.write()?;
    callback()
}

this way we have a full control over the lock and we can avoid misuse of the lock

let connections_container = inner.conn_lock.read().await;
let cluster_params = &inner.cluster_params;
let subscriptions_by_address = &inner.subscriptions_by_address;
let glide_connection_options = &inner.glide_connection_options;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the temporary assignment here?

node_option,
&cluster_params,
conn_type,
glide_connection_options.clone(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inner.glide_connection_options.clone()

match result {
(address, Ok(node)) => {
let connections_container = inner.conn_lock.read().await;
connections_container.replace_or_add_connection_for_address(address, node);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should expose API for inner for this function (replace_or_add_connection_for_address) and avoid exposing the lock here

);
}
}
}
info!("refresh connections completed");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that happens often? if it does, please move this to debug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants