Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix test_network_disconnect_during_migration test #4224

Merged
merged 6 commits into from
Dec 2, 2024

Conversation

chakaz
Copy link
Collaborator

@chakaz chakaz commented Nov 28, 2024

There are actually a few failures fixed in this PR, only one of which is a test bug:

  • db_slice_->Traverse() can yield, causing fiber_cancelled_'s value to change
  • When a migration is cancelled, it may never finish WaitForInflightToComplete() because it has in_flight_bytes_ that will never reach destination due to the cancellation
  • IterateMap() with numeric key/values overrode the key's buffer with the value's buffer

Fixes #4207

There are actually a few failures fixed in this PR, only one of which is
a test bug.

Fixes #4207
@chakaz chakaz requested a review from kostasrim November 28, 2024 20:35
@kostasrim
Copy link
Contributor

🕺 🚀

kostasrim
kostasrim previously approved these changes Nov 28, 2024
Copy link
Contributor

@kostasrim kostasrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -79,7 +81,9 @@ void JournalStreamer::Cancel() {
VLOG(1) << "JournalStreamer::Cancel";
waker_.notifyAll();
journal_->UnregisterOnChange(journal_cb_id_);
WaitForInflightToComplete();
if (!cntx_->IsCancelled()) {
WaitForInflightToComplete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to write that maybe we should move cntx_->IsCancelled() within WaitForInFlightToComplete but then I realized it's only called in this place so not really needed I guess

@@ -41,7 +41,9 @@ JournalStreamer::JournalStreamer(journal::Journal* journal, Context* cntx)
}

JournalStreamer::~JournalStreamer() {
DCHECK_EQ(in_flight_bytes_, 0u);
if (!cntx_->IsCancelled()) {
DCHECK_EQ(in_flight_bytes_, 0u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we not trigger this before ? Or did we just deadlocked because WaitForInFlightToCOmplete() would never progress ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk why we didn't trigger this before, but indeed this dead locks

@@ -112,6 +112,8 @@ MultiCommandSquasher::SquashResult MultiCommandSquasher::TrySquash(StoredCmd* cm

cmd->Fill(&tmp_keylist_);
auto args = absl::MakeSpan(tmp_keylist_);
if (args.size() == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, didn't DetermineKeys below handle this case or ?

Also general small nits (I do not care if you apply this or not 😄 )

  1. span contains empty()
  2. We can also use NumArgs() and avoid the two calls above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to use empty(), thanks!
Re/ NumArgs(), that's a feature of the command, not the args

@@ -215,6 +219,9 @@ void RestoreStreamer::Run() {
return;

cursor = db_slice_->Traverse(pt, cursor, [&](PrimeTable::bucket_iterator it) {
if (fiber_cancelled_) // Could be cancelled any time as Traverse may preempt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can traverse preempt if we dont have the big value serialization merged yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the callback

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which callback can preempt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way I think we can use snapshot_version_ instead of fiber_canceled_ because we always process them together

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which callback can preempt?

db_slice_->FlushChangeToEarlierCallbacks(0 /db_id always 0 for cluster/,
DbSlice::Iterator::FromPrime(it), snapshot_version_);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also WriteBucket(it); can yield

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think the if fiber_cancelled_ should be also after call to FlushChangeToEarlierCallbacks

@adiholden
Copy link
Collaborator

There are actually a few failures fixed in this PR, only one of which is a test bug.

Fixes #4207

I suggest to write in the descirtion all the bugs that were fixed

@BorysTheDev
Copy link
Contributor

BorysTheDev commented Dec 2, 2024

@chakaz I think we also need to update the next methods
void OutgoingMigration::SliceSlotMigration::Cancel() {
cntx_.Cancel();
streamer_.Cancel();
}

~OutgoingMigration::SliceSlotMigration::SliceSlotMigration() {
Cancel();
cntx_.JoinErrorHandler();
}

@chakaz
Copy link
Collaborator Author

chakaz commented Dec 2, 2024

Test now passes, and I think I responded to / applied all comments. PTAL :)

@BorysTheDev
Copy link
Contributor

@chakaz Please run tests a couple of times more, because sometimes they are passed even with bug

BorysTheDev
BorysTheDev previously approved these changes Dec 2, 2024
@chakaz chakaz merged commit 779bba7 into main Dec 2, 2024
9 checks passed
@chakaz chakaz deleted the chakaz/network_disconnect branch December 2, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_network_disconnect_during_migration with big_value_serialization and compression off
4 participants