Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix more tsan bugs #229

Merged
merged 3 commits into from
Jan 24, 2020
Merged

Fix more tsan bugs #229

merged 3 commits into from
Jan 24, 2020

Conversation

justinboswell
Copy link
Contributor

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -239,15 +239,16 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) {
struct aws_channel_creation_callbacks callbacks = {
.on_setup_completed = s_channel_setup_test_on_setup_completed,
.setup_user_data = &test_args,
.on_shutdown_completed = s_channel_setup_test_on_setup_completed,
.shutdown_user_data = &test_args,
.on_shutdown_completed = NULL,
Copy link
Contributor

@nchong-at-aws nchong-at-aws Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a copy-paste error. I assumed waiting on shutdown being complete isn't needed in this test so made them NULL.

ASSERT_SUCCESS(aws_condition_variable_wait(&test_args.condition_variable, &test_args.mutex));
ASSERT_SUCCESS(aws_mutex_unlock(&test_args.mutex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should unlock after condition wait.

@@ -281,8 +280,10 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) {

/* Release hold 2/2. The handler and channel should be destroyed. */
aws_channel_release_hold(channel);
ASSERT_SUCCESS(aws_mutex_lock(&destroy_mutex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce size of the critical section by moving this lock down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this mutex/condition variable need to exist anymore. We can just loop until destroy_called is true.

Before our 1st pass of tsan fixes, this test was broken because it didn't acquire the lock before changing the bool and signaling the condvar.

We changed the bool to be atomic. We're using the mutex when waiting for the value and not using wait_pred(). But we're not using the mutex when changing the value and signaling the condvar. That means, if we set the bool and signal before the main thread waits for it, the main thread will never be woken up (except for spurious wakeups).

So a fully proper solution is either:

  1. use an atomic, main thread just checks in in a while loop. No mutex or condvar required.
  2. OR use the lock AND the condition variable when setting the bool, use a predicate when waiting for the bool. It's not necessary for the bool to be atomic at this point.

First off, I think this test had been timing dependent and could theoretically fail
they weren't using predicates so the

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes agreed on the pathological case. I'll fix with (1) as we can remove some code with that option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about this further, it is ok to keep the bool atomic and the condition variable if the main thread does a check with aws_condition_variable_wait_pred. In this case, the thread executing destroy atomically updates the bool and signals the condition variable (no lock necessary). The main thread locks the mutex and calls aws_condition_variable_wait_pred with a predicate that atomically loads from the bool. The spurious wakeup case is purely due to the use of aws_condition_variable_wait.

@@ -31,7 +32,7 @@ struct task_args {
enum aws_task_status status;
struct aws_mutex mutex;
struct aws_condition_variable condition_variable;
bool thread_complete;
struct aws_atomic_var thread_complete;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be atomic since it is written-by the cleanup thread and read-by the main thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data-race / ordering violation is:

  • cleanup thread launched and detached
  • main thread exits and frees allocator
  • cleanup thread tries to use allocator to free exit_callback_data

@nchong-at-aws nchong-at-aws requested a review from graebm January 22, 2020 21:44
@justinboswell justinboswell requested a review from a team January 22, 2020 21:53
@@ -1070,7 +1071,7 @@ AWS_TEST_CASE(event_loop_group_setup_and_shutdown, test_event_loop_group_setup_a
/* mark the thread complete when the async shutdown thread is done */
static void s_async_shutdown_thread_exit(void *user_data) {
struct task_args *args = user_data;
args->thread_complete = true;
aws_atomic_store_int(&args->thread_complete, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock is always acquired in the predicate, and unlocked during wait. why does this need to be atomic? You can just lock in the callback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an ordering violation: we need to ensure the cleanup thread exit happens-before the test finishes. A lock by itself won't do. Do you mean use a condition variable? That would work, but this use of an atomic is simpler, I think.

Comment on lines 284 to 286
ASSERT_SUCCESS(aws_condition_variable_wait(&destroy_condition_variable, &destroy_mutex));
ASSERT_TRUE(aws_atomic_load_int(&destroy_called));
ASSERT_SUCCESS(aws_mutex_unlock(&destroy_mutex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outside the scope of fixing immediate tsan errors, but:

Should we replace all uses of aws_condition_variable_wait() with aws_condition_variable_wait_pred()? I would even vote for completely removing the vanilla wait() call from aws-c-common. These tests are only using vanilla wait() because it required less code to get working than wait_pred(). But vanilla wait() can spuriously wake up and break the assumptions of the test.

From the pthread docs:

When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t necessarily want to remove it, but we should definitely not be using it in tests. This particular test was written prior to us adding the predicate variant.

Copy link
Contributor

@nchong-at-aws nchong-at-aws Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to replacing these calls with their _pred equivalents in the tests. Also, document this behavior for future users: awslabs/aws-c-common#580

@@ -281,8 +280,10 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) {

/* Release hold 2/2. The handler and channel should be destroyed. */
aws_channel_release_hold(channel);
ASSERT_SUCCESS(aws_mutex_lock(&destroy_mutex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this mutex/condition variable need to exist anymore. We can just loop until destroy_called is true.

Before our 1st pass of tsan fixes, this test was broken because it didn't acquire the lock before changing the bool and signaling the condvar.

We changed the bool to be atomic. We're using the mutex when waiting for the value and not using wait_pred(). But we're not using the mutex when changing the value and signaling the condvar. That means, if we set the bool and signal before the main thread waits for it, the main thread will never be woken up (except for spurious wakeups).

So a fully proper solution is either:

  1. use an atomic, main thread just checks in in a while loop. No mutex or condvar required.
  2. OR use the lock AND the condition variable when setting the bool, use a predicate when waiting for the bool. It's not necessary for the bool to be atomic at this point.

First off, I think this test had been timing dependent and could theoretically fail
they weren't using predicates so the

@nchong-at-aws
Copy link
Contributor

nchong-at-aws commented Jan 23, 2020

Travis failure needs investigating. Looks like channel_refcount_delays_clean_up isn't terminating (update: caused by spurious wakeup discussed above; fixed with condvar wait_pred). Still failing for host_resolver: test_resolver_ttls, test_resolver_connect_failure_recording, test_resolver_ttl_refreshes_on_resolve (could all be same root issue).

This avoids spurious wakeup behavior.

Also, refactor test channel setup into single blocking call.
@nchong-at-aws nchong-at-aws marked this pull request as ready for review January 24, 2020 22:35
@nchong-at-aws nchong-at-aws merged commit 8a75bac into master Jan 24, 2020
@nchong-at-aws nchong-at-aws deleted the fix-more-tsan branch January 5, 2021 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants