-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix more tsan bugs #229
Fix more tsan bugs #229
Conversation
@@ -239,15 +239,16 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) { | |||
struct aws_channel_creation_callbacks callbacks = { | |||
.on_setup_completed = s_channel_setup_test_on_setup_completed, | |||
.setup_user_data = &test_args, | |||
.on_shutdown_completed = s_channel_setup_test_on_setup_completed, | |||
.shutdown_user_data = &test_args, | |||
.on_shutdown_completed = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a copy-paste error. I assumed waiting on shutdown being complete isn't needed in this test so made them NULL.
tests/channel_test.c
Outdated
ASSERT_SUCCESS(aws_condition_variable_wait(&test_args.condition_variable, &test_args.mutex)); | ||
ASSERT_SUCCESS(aws_mutex_unlock(&test_args.mutex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unlock after condition wait.
@@ -281,8 +280,10 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) { | |||
|
|||
/* Release hold 2/2. The handler and channel should be destroyed. */ | |||
aws_channel_release_hold(channel); | |||
ASSERT_SUCCESS(aws_mutex_lock(&destroy_mutex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce size of the critical section by moving this lock down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this mutex/condition variable need to exist anymore. We can just loop until destroy_called is true.
Before our 1st pass of tsan fixes, this test was broken because it didn't acquire the lock before changing the bool and signaling the condvar.
We changed the bool to be atomic. We're using the mutex when waiting for the value and not using wait_pred(). But we're not using the mutex when changing the value and signaling the condvar. That means, if we set the bool and signal before the main thread waits for it, the main thread will never be woken up (except for spurious wakeups).
So a fully proper solution is either:
- use an atomic, main thread just checks in in a while loop. No mutex or condvar required.
- OR use the lock AND the condition variable when setting the bool, use a predicate when waiting for the bool. It's not necessary for the bool to be atomic at this point.
First off, I think this test had been timing dependent and could theoretically fail
they weren't using predicates so the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes agreed on the pathological case. I'll fix with (1) as we can remove some code with that option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, thinking about this further, it is ok to keep the bool atomic and the condition variable if the main thread does a check with aws_condition_variable_wait_pred
. In this case, the thread executing destroy atomically updates the bool and signals the condition variable (no lock necessary). The main thread locks the mutex and calls aws_condition_variable_wait_pred
with a predicate that atomically loads from the bool. The spurious wakeup case is purely due to the use of aws_condition_variable_wait
.
@@ -31,7 +32,7 @@ struct task_args { | |||
enum aws_task_status status; | |||
struct aws_mutex mutex; | |||
struct aws_condition_variable condition_variable; | |||
bool thread_complete; | |||
struct aws_atomic_var thread_complete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be atomic since it is written-by the cleanup thread and read-by the main thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data-race / ordering violation is:
- cleanup thread launched and detached
- main thread exits and frees allocator
- cleanup thread tries to use allocator to free exit_callback_data
@@ -1070,7 +1071,7 @@ AWS_TEST_CASE(event_loop_group_setup_and_shutdown, test_event_loop_group_setup_a | |||
/* mark the thread complete when the async shutdown thread is done */ | |||
static void s_async_shutdown_thread_exit(void *user_data) { | |||
struct task_args *args = user_data; | |||
args->thread_complete = true; | |||
aws_atomic_store_int(&args->thread_complete, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lock is always acquired in the predicate, and unlocked during wait. why does this need to be atomic? You can just lock in the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an ordering violation: we need to ensure the cleanup thread exit happens-before the test finishes. A lock by itself won't do. Do you mean use a condition variable? That would work, but this use of an atomic is simpler, I think.
tests/channel_test.c
Outdated
ASSERT_SUCCESS(aws_condition_variable_wait(&destroy_condition_variable, &destroy_mutex)); | ||
ASSERT_TRUE(aws_atomic_load_int(&destroy_called)); | ||
ASSERT_SUCCESS(aws_mutex_unlock(&destroy_mutex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outside the scope of fixing immediate tsan errors, but:
Should we replace all uses of aws_condition_variable_wait()
with aws_condition_variable_wait_pred()
? I would even vote for completely removing the vanilla wait() call from aws-c-common. These tests are only using vanilla wait() because it required less code to get working than wait_pred(). But vanilla wait() can spuriously wake up and break the assumptions of the test.
From the pthread docs:
When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t necessarily want to remove it, but we should definitely not be using it in tests. This particular test was written prior to us adding the predicate variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to replacing these calls with their _pred
equivalents in the tests. Also, document this behavior for future users: awslabs/aws-c-common#580
@@ -281,8 +280,10 @@ static int s_test_channel_refcount(struct aws_allocator *allocator, void *ctx) { | |||
|
|||
/* Release hold 2/2. The handler and channel should be destroyed. */ | |||
aws_channel_release_hold(channel); | |||
ASSERT_SUCCESS(aws_mutex_lock(&destroy_mutex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this mutex/condition variable need to exist anymore. We can just loop until destroy_called is true.
Before our 1st pass of tsan fixes, this test was broken because it didn't acquire the lock before changing the bool and signaling the condvar.
We changed the bool to be atomic. We're using the mutex when waiting for the value and not using wait_pred(). But we're not using the mutex when changing the value and signaling the condvar. That means, if we set the bool and signal before the main thread waits for it, the main thread will never be woken up (except for spurious wakeups).
So a fully proper solution is either:
- use an atomic, main thread just checks in in a while loop. No mutex or condvar required.
- OR use the lock AND the condition variable when setting the bool, use a predicate when waiting for the bool. It's not necessary for the bool to be atomic at this point.
First off, I think this test had been timing dependent and could theoretically fail
they weren't using predicates so the
Travis failure needs investigating. |
This avoids spurious wakeup behavior. Also, refactor test channel setup into single blocking call.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.