Fix more tsan bugs #229

justinboswell · 2020-01-22T21:37:49Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nchong-at-aws · 2020-01-22T21:40:28Z

tests/channel_test.c

@@ -239,15 +239,16 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) {
    struct aws_channel_creation_callbacks callbacks = {
        .on_setup_completed = s_channel_setup_test_on_setup_completed,
        .setup_user_data = &test_args,
-        .on_shutdown_completed = s_channel_setup_test_on_setup_completed,
-        .shutdown_user_data = &test_args,
+        .on_shutdown_completed = NULL,


Looks like a copy-paste error. I assumed waiting on shutdown being complete isn't needed in this test so made them NULL.

nchong-at-aws · 2020-01-22T21:41:45Z

tests/channel_test.c

    ASSERT_SUCCESS(aws_condition_variable_wait(&test_args.condition_variable, &test_args.mutex));
+    ASSERT_SUCCESS(aws_mutex_unlock(&test_args.mutex));


Should unlock after condition wait.

nchong-at-aws · 2020-01-22T21:42:26Z

tests/channel_test.c

@@ -281,8 +280,10 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) {

    /* Release hold 2/2. The handler and channel should be destroyed. */
    aws_channel_release_hold(channel);
+    ASSERT_SUCCESS(aws_mutex_lock(&destroy_mutex));


Reduce size of the critical section by moving this lock down.

I don't think this mutex/condition variable need to exist anymore. We can just loop until destroy_called is true.

Before our 1st pass of tsan fixes, this test was broken because it didn't acquire the lock before changing the bool and signaling the condvar.

We changed the bool to be atomic. We're using the mutex when waiting for the value and not using wait_pred(). But we're not using the mutex when changing the value and signaling the condvar. That means, if we set the bool and signal before the main thread waits for it, the main thread will never be woken up (except for spurious wakeups).

So a fully proper solution is either:

use an atomic, main thread just checks in in a while loop. No mutex or condvar required.

OR use the lock AND the condition variable when setting the bool, use a predicate when waiting for the bool. It's not necessary for the bool to be atomic at this point.

First off, I think this test had been timing dependent and could theoretically fail
they weren't using predicates so the

I see. Yes agreed on the pathological case. I'll fix with (1) as we can remove some code with that option.

Actually, thinking about this further, it is ok to keep the bool atomic and the condition variable if the main thread does a check with aws_condition_variable_wait_pred. In this case, the thread executing destroy atomically updates the bool and signals the condition variable (no lock necessary). The main thread locks the mutex and calls aws_condition_variable_wait_pred with a predicate that atomically loads from the bool. The spurious wakeup case is purely due to the use of aws_condition_variable_wait.

nchong-at-aws · 2020-01-22T21:43:04Z

tests/event_loop_test.c

@@ -31,7 +32,7 @@ struct task_args {
    enum aws_task_status status;
    struct aws_mutex mutex;
    struct aws_condition_variable condition_variable;
-    bool thread_complete;
+    struct aws_atomic_var thread_complete;


Should be atomic since it is written-by the cleanup thread and read-by the main thread.

The data-race / ordering violation is:

cleanup thread launched and detached

main thread exits and frees allocator

cleanup thread tries to use allocator to free exit_callback_data

JonathanHenson · 2020-01-22T22:22:48Z

tests/event_loop_test.c

@@ -1070,7 +1071,7 @@ AWS_TEST_CASE(event_loop_group_setup_and_shutdown, test_event_loop_group_setup_a
 /* mark the thread complete when the async shutdown thread is done */
 static void s_async_shutdown_thread_exit(void *user_data) {
    struct task_args *args = user_data;
-    args->thread_complete = true;
+    aws_atomic_store_int(&args->thread_complete, true);


the lock is always acquired in the predicate, and unlocked during wait. why does this need to be atomic? You can just lock in the callback.

It's an ordering violation: we need to ensure the cleanup thread exit happens-before the test finishes. A lock by itself won't do. Do you mean use a condition variable? That would work, but this use of an atomic is simpler, I think.

graebm · 2020-01-22T22:18:08Z

tests/channel_test.c

    ASSERT_SUCCESS(aws_condition_variable_wait(&destroy_condition_variable, &destroy_mutex));
    ASSERT_TRUE(aws_atomic_load_int(&destroy_called));
+    ASSERT_SUCCESS(aws_mutex_unlock(&destroy_mutex));


outside the scope of fixing immediate tsan errors, but:

Should we replace all uses of aws_condition_variable_wait() with aws_condition_variable_wait_pred()? I would even vote for completely removing the vanilla wait() call from aws-c-common. These tests are only using vanilla wait() because it required less code to get working than wait_pred(). But vanilla wait() can spuriously wake up and break the assumptions of the test.

From the pthread docs:

When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.

I don’t necessarily want to remove it, but we should definitely not be using it in tests. This particular test was written prior to us adding the predicate variant.

Yes to replacing these calls with their _pred equivalents in the tests. Also, document this behavior for future users: awslabs/aws-c-common#580

graebm · 2020-01-22T22:30:12Z

tests/channel_test.c

@@ -281,8 +280,10 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) {

    /* Release hold 2/2. The handler and channel should be destroyed. */
    aws_channel_release_hold(channel);
+    ASSERT_SUCCESS(aws_mutex_lock(&destroy_mutex));


I don't think this mutex/condition variable need to exist anymore. We can just loop until destroy_called is true.

Before our 1st pass of tsan fixes, this test was broken because it didn't acquire the lock before changing the bool and signaling the condvar.

We changed the bool to be atomic. We're using the mutex when waiting for the value and not using wait_pred(). But we're not using the mutex when changing the value and signaling the condvar. That means, if we set the bool and signal before the main thread waits for it, the main thread will never be woken up (except for spurious wakeups).

So a fully proper solution is either:

use an atomic, main thread just checks in in a while loop. No mutex or condvar required.

OR use the lock AND the condition variable when setting the bool, use a predicate when waiting for the bool. It's not necessary for the bool to be atomic at this point.

First off, I think this test had been timing dependent and could theoretically fail
they weren't using predicates so the

nchong-at-aws · 2020-01-23T16:11:36Z

Travis failure needs investigating. ~~Looks like channel_refcount_delays_clean_up isn't terminating~~ (update: caused by spurious wakeup discussed above; fixed with condvar wait_pred). Still failing for host_resolver: test_resolver_ttls, test_resolver_connect_failure_recording, test_resolver_ttl_refreshes_on_resolve (could all be same root issue).

This avoids spurious wakeup behavior. Also, refactor test channel setup into single blocking call.

nchong-at-aws added 2 commits January 22, 2020 16:34

Fix more tsan bugs

40d180f

Fix clang-format

03cf5b6

nchong-at-aws reviewed Jan 22, 2020

View reviewed changes

nchong-at-aws requested a review from graebm January 22, 2020 21:44

justinboswell requested a review from a team January 22, 2020 21:53

JonathanHenson reviewed Jan 22, 2020

View reviewed changes

graebm reviewed Jan 22, 2020

View reviewed changes

Channel tests use condvar pred everywhere

dd7b331

This avoids spurious wakeup behavior. Also, refactor test channel setup into single blocking call.

graebm approved these changes Jan 24, 2020

View reviewed changes

nchong-at-aws marked this pull request as ready for review January 24, 2020 22:35

nchong-at-aws merged commit 8a75bac into master Jan 24, 2020

nchong-at-aws deleted the fix-more-tsan branch January 5, 2021 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix more tsan bugs #229

Fix more tsan bugs #229

justinboswell commented Jan 22, 2020

nchong-at-aws Jan 22, 2020 •

edited

Loading

nchong-at-aws Jan 22, 2020

nchong-at-aws Jan 22, 2020

graebm Jan 22, 2020

nchong-at-aws Jan 22, 2020

nchong-at-aws Jan 23, 2020

nchong-at-aws Jan 22, 2020

nchong-at-aws Jan 22, 2020

JonathanHenson Jan 22, 2020

nchong-at-aws Jan 22, 2020

graebm Jan 22, 2020

JonathanHenson Jan 22, 2020

nchong-at-aws Jan 23, 2020 •

edited

Loading

graebm Jan 22, 2020

nchong-at-aws commented Jan 23, 2020 •

edited

Loading

		ASSERT_SUCCESS(aws_condition_variable_wait(&test_args.condition_variable, &test_args.mutex));
		ASSERT_SUCCESS(aws_mutex_unlock(&test_args.mutex));

Fix more tsan bugs #229

Fix more tsan bugs #229

Conversation

justinboswell commented Jan 22, 2020

nchong-at-aws Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nchong-at-aws Jan 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nchong-at-aws commented Jan 23, 2020 • edited Loading

nchong-at-aws Jan 22, 2020 •

edited

Loading

nchong-at-aws Jan 23, 2020 •

edited

Loading

nchong-at-aws commented Jan 23, 2020 •

edited

Loading