chore: remove DbSlice mutex and add ConditionFlag in SliceSnapshot #4073

kostasrim · 2024-11-06T12:16:39Z

remove DbSlice mutex
add ConditionFlag in SliceSnapshot

Partially solves #4072

kostasrim · 2024-11-06T12:22:26Z

src/server/common.h

+};
+
+// Helper class used to guarantee atomicity between serialization of buckets
+class ConditionGuard {


This is code we checked in and removed on previous PR's. I just brought it back

Same for all the changes in this header

kostasrim · 2024-11-06T14:08:47Z

https://github.com/dragonflydb/dragonfly/actions/runs/11705220823

dranikpg · 2024-11-07T10:15:10Z

src/server/common.h

+struct ConditionFlag {
+  util::fb2::CondVarAny cond_var;
+  bool flag = false;
+};
+
+// Helper class used to guarantee atomicity between serialization of buckets
+class ConditionGuard {


How is this not just a thread local mutex? 🤔 Topological proof 🤓: you can move the constructor/destructor code into lock and unlock() of ConditionFlag and use it with lock_guard

dranikpg · 2024-11-07T10:19:01Z

src/server/common.h

+  ConditionFlag* enclosing_;
+};
+
+class LocalBlockingCounter {


we have EmbeddedBlockingCounter in helio, but it's atomicity is not needed here

Yep that's why I introduced it

adiholden · 2024-11-07T16:26:02Z

src/server/snapshot.cc

@@ -243,6 +243,7 @@ void SliceSnapshot::SwitchIncrementalFb(Context* cntx, LSN lsn) {
 }

 bool SliceSnapshot::BucketSaveCb(PrimeIterator it) {
+  ConditionGuard guard(&bucket_ser_);


why here and in OnDbChange and not inside SerializeBucket ?

You have asked me the exact same question in June 🤣

The answer is, that if you look on BucketSaveCb it calls FlushChangeToEarlierCallbacks which might preempt and when the fiber resumes the condition the check above might not be true. The code within BucketSaveCb should be atomic /considered as a critical section

do you mean the check below lambda for version?
you can check also inside BucketSaveCb the snapshot version and if its bigger than dont serialize the bucket

adiholden · 2024-11-07T16:28:22Z

src/server/journal/streamer.h

@@ -106,6 +106,8 @@ class RestoreStreamer : public JournalStreamer {
  cluster::SlotSet my_slots_;
  bool fiber_cancelled_ = false;
  bool snapshot_finished_ = false;
+
+  ConditionFlag bucket_ser_;


why did you add it to streamer?

Because it has similar semantics with the snapshot. It traverses and serializes the buckets (see RestoreStreamer::Run)

we dont have any preeption in serializing buckets in RestoreStreamer as we do not support yet big value serialization in migration yet

Signed-off-by: kostas <[email protected]>

kostasrim · 2024-11-11T13:02:16Z

src/server/engine_shard.cc

+  // TODO: iterate over all namespaces
+  DbSlice& db_slice = namespaces.GetDefaultNamespace().GetDbSlice(shard_id());
+  // Skip heartbeat if we are serializing a big value
+  if (db_slice.HasBlockingCounterMutating()) {


This condition can be violated if we proceed, preempt and then resume

we should add some prints here so we will be able to know if hearbeat is not running for long time maybe because we are running lots of big values serialization or we have a bug somewere causing this skip

+100 on this one. I thought the same tbh -- I will add it

adiholden · 2024-11-13T08:59:51Z

src/server/db_slice.cc

@@ -1158,6 +1143,11 @@ DbSlice::PrimeItAndExp DbSlice::ExpireIfNeeded(const Context& cntx, PrimeIterato
  const_cast<DbSlice*>(this)->PerformDeletion(Iterator(it, StringOrView::FromView(key)),
                                              ExpIterator(expire_it, StringOrView::FromView(key)),
                                              db.get());
+  // Replicate expiry


what is the reason you moved this here?

Good question. The answer is preemptions because heartbeat is not transactional. If we first write to the journal, preempt and then delete, we risk another transaction coming in between the preemption and before the deletion. This is not good.

RecordExpiry can not preempt, the await is set to false
Does this assumption breaks now?

Nope you are right with this assumption because: serializer_->WriteJournalEntry(item.data); does not ever call FlushIfNeeded so it won't flush. I still rather keep it though, we always do the operation and then write to the journal not the other way around

my concern here that this assumption is not right anymore :(
even if await is false we can run the await on the ConditinalGuard as we have some other entry in the middle of serialization
But we dont run heartbeat when serialization of big value is in progress. So I am not sure if we really have a problem but this is not writen well.

we need to add a FiberGaurd when we call write journal with await=flase to make sure the assumption is fulfilled

we need to rewrite the code as the flow is not clear , we call await=flase but under some guard we might call await

my concern here that this assumption is not right anymore :(

I think it is with a small exception which is easy to address (see below -- in fact your comment/discussion is GOLD).

We can only enter Heartbeat if we are not serializing a big value.

Within heartbeat we do two things. a) we expire b) we evict. Both of those flows won't preempt (we preempt at the end when we flush)

However, for Heartbeat to proceed it relies on LocalBlockingCounter. This is ok for all but one flow. Both db_slice::CallChangeCallbacks and db_slice::FlushToEarlierCallbacks are fine, since if they preempt the counter will be incremented and Heartbeat won't run. The only exception is BucketSaveCb which will

FlushChangeToEarlierCallbacks

Serialize the bucket

The problem is that if (2) preempts because of a big value, the counter might be zero and heartbeat will run normally.

The fix is simple though, we should increment the counter in bucket_save as well which will provide mutual exlusivity for Heartbeat.

Last but not least, I agree in adding a FiberGuard, although this will merely denote that Heartbeat did not preempt not that there is no big value serialization happening (luckliy if this is the case the tests will fail because journal will contain junk)

adiholden · 2024-11-13T09:14:09Z

src/server/db_slice.cc

-  // could lead to UB or assertion failures (while DashTable::Traverse is iterating over
-  // a logical bucket).
-  util::fb2::LockGuard lk(local_mu_);
+  block_counter_.Wait();


is this true? we can not flush while we are in the middle of a big value serialization?
when snapshoting we can run flushdb as we use intrusive ptr there

Well it did mess with Traverse in the past as it caused the assertion to trigger. Now we got bucket (not logical) Traversal it's probably ok. I would rather keep this though as a safe guard, and have another branch run this without this Wait here. We can incrementally remove those parts if you are ok with this

I dont see how this is related to travesing on logical/ not logical bucket, we copy the pointer asside while flushing and snapshoting so that when flush is run we reset the pointer for the next transactions running.
Actually if we have a bug here I would like to understand before we merge this PR

Now that I am thinking about it, think of the following:

We called IterateBucketsFb which calls Traverse which calls BucketSaveCb on the bucket. Now this will serialize an entry. Let's assume this entry is a list, so it calls SaveListObject. What this does is it iterates over each node in the list. Now let't say that each node is also a big value, so we call at the end of each iteration the FlushIfNeeded which happily preempts. Now we issue FLUSHALL which cleans up the dbtable. When SaveListObject resumes, the pointers to the list/node are invalidated (freed by the FLUSHALL). Isn't that a problem ?

they should not be freed by flushall, what is the different if you preempt from snapshoting in the middle of entry serialize or at the end of bucket serialize.
when running flushall the snapshot should continue to create a point in time snapshot for the data although flush all finised and next transactions get empty db
this is why we use intrusive ptr in snapshoting

148 // We use reference counting semantics of DbTable when doing snapshotting. 149 // There we need to preserve the copy of the table in case someone flushes it during 150 // the snapshot process. We copy the pointers in StartSnapshotInShard function. 151 using DbTableArray = std::vector<boost::intrusive_ptr<DbTable>>;

and

db_array_ = slice->databases();

Now it makes sense.

Something that pops my mind:

Within FlushDbIndexes:

803 804 auto cb = [indexes, flush_db_arr = std::move(flush_db_arr)]() mutable { 805 flush_db_arr.clear(); 806 ServerState::tlocal()->DecommitMemory(ServerState::kDataHeap | ServerState::kBackingHeap | 807 ServerState::kGlibcmalloc); 808 }; 809 810 fb2::Fiber("flush_dbs", std::move(cb)).Detach();

The fiber will clear flush_db_arr and then call memory decommit. This won't have any effect if we are sharpshooing. Shall we run memory decommit in the destructor instead ?

nice. It's minor but I will patch this at some point.

adiholden · 2024-11-13T09:23:35Z

tests/dragonfly/replication_test.py

    replica = df_factory.create(
-        cluster_mode=cluster_mode, cluster_announce_ip=announce_ip, announce_port=announce_port
+        disable_serialization_max_chunk_size=0,


why disable in this test?

dfly_version = "v1.19.2" does not support the flag, so launching the instance will fail

in which version did you introduced this flag? maybe we can update this test to run this dfly version

in which version did you introduced this flag?

I will need to check.

in which version did you introduced this flag? maybe we can update this test to run this dfly version

There is no point because a) the bug was on that version b) we won't run the reg tests with this flag on in the future anyway

I dont think should use this flag in regression tests so that small percentage of entries in replication/snapshot test will hit this flow.
The default for this flag will be much higher though

I dont think should use this flag in regression tests so that small percentage of entries in replication/snapshot test will hit this flow.

Only for extreme testing. I can disable this no issue on that

adiholden · 2024-11-13T09:26:02Z

tests/dragonfly/instance.py

@@ -124,6 +124,12 @@ def __init__(self, params: DflyParams, args):
            if threads > 1:
                self.args["num_shards"] = threads - 1

+        if "disable_serialization_max_chunk_size" not in self.args:
+            # Add 1 byte limit for big values
+            self.args["serialization_max_chunk_size"] = 1


does it realy makes sense to run always the big value serialization flow in regression tests?
it is good for extensive testing at the beging but then we dont test the other flow in code

I wanted this to run for a week or so to make sure that we don't see any other new bugs. I will revert this afterwards. It was Roman's idea back in the summer 😄

point is I either keep this PR and run the reg tests every 3 hours manually or I introduce this change and wait for a week 😄

adiholden · 2024-11-13T11:38:39Z

It will be best if you add statistics for big value serilization i.e preept in serialize expose in stats
This way we will know this logic is executed in pytest flows and track this in running dragonfly servers

kostasrim · 2024-11-13T12:07:08Z

It will be best if you add statistics for big value serilization i.e preept in serialize expose in stats This way we will know this logic is executed in pytest flows and track this in running dragonfly servers

Generally this PR is about removing the lock from db slice and not adding statistics. I am happy to do this as part of this PR as it's a straightforward change but usually I like to separate tasks like that

adiholden · 2024-11-14T13:18:47Z

src/server/db_slice.h

  // Does not check for non supported events. Callers must parse the string and reject it
  // if it's not empty and not EX.
  void SetNotifyKeyspaceEvents(std::string_view notify_keyspace_events);

+  bool HasBlockingCounterMutating() const {


What we want to expose by this api is if we will preepmt if we try to do any mutation on db
i.e if heartbeat function want to do mutation for eviction/expiry and we do not allow preepmting there right now than is should invoke db slice api IsMutatingPreepmtive and if yes we will not invoke heartbeat eviction/expiry step
Note: we should to call this in hearbeat and in PrimeEvictionPolicy::Evict

I think that beside renaming the function name the logic will be more clear if we register 2 callbacks to DbSlice::RegisterOnChange one is the one we currently have the change cb, and the other one is the new callback that will return false or true if we will preempt if we invoke the callback.
This way we will need to iterate though this callbacks to know if IsMutatingPreepmtive
On one hand it is not as simple as just returning the block_counter_ as you do now
But on the other hand I find this more robust and flow more clear when reading as it is more simple and clear from the register side what is this function and we will not need to remember to call the blocking_counter increamental from different places in code

As discussed internally, we can use this but for now I am aiming for correctness. Change should be simple though, we will discuss this offline.

Signed-off-by: kostas <[email protected]>

adiholden · 2024-12-04T18:50:39Z

src/server/db_slice.cc

@@ -1098,6 +1090,7 @@ DbSlice::PrimeItAndExp DbSlice::ExpireIfNeeded(const Context& cntx, PrimeIterato
    LOG(ERROR) << "Invalid call to ExpireIfNeeded";
    return {it, ExpireIterator{}};
  }
+  block_counter_.Wait();


I think we discussed that this should be removed

adiholden · 2024-12-04T18:53:44Z

src/server/dflycmd.cc

      status = StopFullSyncInThread(flow, &replica_ptr->cntx, shard);
      if (*status != OpStatus::OK) {
        return;
      }
      StartStableSyncInThread(flow, &replica_ptr->cntx, shard);
    };
    shard_set->RunBlockingInParallel(std::move(cb));
-
-    if (*status != OpStatus::OK)


why did you remove this lines?

adiholden · 2024-12-04T18:56:36Z

src/server/engine_shard.cc

+  if (db_slice.WillBlockOnJournalWrite()) {
+    const auto elapsed = std::chrono::system_clock::now() - start;
+    if (elapsed > std::chrono::seconds(1)) {
+      LOG(WARNING) << "Stalled heartbeat() fiber for " << elapsed.count()


you should use log ever t
as if the heartbeat is stalled due to big value writings you we will be flooded with this prints

adiholden · 2024-12-04T19:06:28Z

src/server/generic_family.cc

@@ -597,6 +597,10 @@ void OpScan(const OpArgs& op_args, const ScanOpts& scan_opts, uint64_t* cursor,
  auto& db_slice = op_args.GetDbSlice();
  DCHECK(db_slice.IsDbValid(op_args.db_cntx.db_index));

+  // We need to make sure we don't preempt below, because we don't hold any locks to the keys


I dont think its related to holding locks for keys
We do not allow preemption in traverse callback here, because other fiber can edit the bucket at the time we yield
Please write:
Because ScanCb can preempt due to journaling expired entries we need to make sure that we enter the callback in a timing when journaling will not cause preemtion

It's both. Let's ignore bucket splits when we preempt. Now let's say the bucket has 5 slots.
We iterate the first and second and on the third we call ExpireIfNeeded from ScanCb. The issue is that we preempt when we write to the journal but we have not yet deleted the entry from the dash table. If we preempt on the first step, then we could get a transaction on the same key that just does a blind update. What's the problem there? That once we resume we will expire the update. That's a bug and its a side effect of not holding the locks to the relevant key - and you just broke the isolation level of the datastore.

ScanCb preempts on RecordExpiry -> Transaction updates the same slot/key to a new value -> ScanCb resumes and deletes/expires the key.

Why for example it is ok to call ExpireIfNeeded and preempt in Find ? Well, because this is transactional and we hold the locks so even if we preempt we don't risk the corner case described above.

I will leave a big comment here.

hmm I think I might be wrong because if we preempt when we write the journal change, the transaction that will try to modify the same key will block on PreUpdate() since it will try to call the registered callback that locks the same mutex and will block. In otherwords, the write to journal within ScanCb will syncronize with PreUpdate()

kostasrim self-assigned this Nov 6, 2024

kostasrim commented Nov 6, 2024

View reviewed changes

kostasrim requested a review from adiholden November 6, 2024 12:24

kostasrim force-pushed the kpr2 branch from 6b3c912 to 744ecaf Compare November 7, 2024 09:11

dranikpg reviewed Nov 7, 2024

View reviewed changes

adiholden reviewed Nov 7, 2024

View reviewed changes

kostasrim added 3 commits November 11, 2024 15:00

chore: remove DbSlice mutex and add ConditionFlag in SliceSnapshot

14e0f6c

Signed-off-by: kostas <[email protected]>

fixes

ef6fa2c

fix notify deadlock

b1254f6

kostasrim force-pushed the kpr2 branch from 377178a to b1254f6 Compare November 11, 2024 13:00

kostasrim commented Nov 11, 2024

View reviewed changes

fixes

3a31267

adiholden reviewed Nov 13, 2024

View reviewed changes

kostasrim added 7 commits November 13, 2024 16:57

comments

293b400

Merge branch 'main' into kpr2

54d6ed9

fixes

8cad815

fix build

0d67a80

this is a test commit

e149a89

test

964e356

revert mutex change

c04f96d

adiholden reviewed Nov 14, 2024

View reviewed changes

properly clean up thread local

7b50a5d

kostasrim force-pushed the kpr2 branch from cac0e88 to 7b50a5d Compare November 14, 2024 18:45

kostasrim added 21 commits November 26, 2024 17:57

polish

967940a

Merge branch 'main' into kpr2

312183f

remove reduntant log warning messages

ab7dd0f

fix failing test

2d9b07c

fixes

aa377e1

cluster tests again

540d24b

rename + corner cases

d5797d3

polish

26b0793

Merge branch 'main' into kpr2

9609761

enable git leak back

86d7532

fixes

4198773

add fiber guard and replace wait with skip

fc907eb

Merge branch 'main' into kpr2

4fdf7b6

migrations again

72009f1

remove unused merged flow

390115f

foxes

5ec921b

move to source

e7a2066

werror

197cb14

Merge branch 'main' into kpr2

3382e4c

Merge branch 'main' into kpr2

8e62193

small fixes

8ce1463

Signed-off-by: kostas <[email protected]>

kostasrim requested a review from adiholden December 4, 2024 15:03

adiholden reviewed Dec 4, 2024

View reviewed changes

comments

88b5116

adiholden approved these changes Dec 5, 2024

View reviewed changes

kostasrim merged commit 267d5ab into main Dec 5, 2024
14 checks passed

kostasrim deleted the kpr2 branch December 5, 2024 11:24

chore: remove DbSlice mutex and add ConditionFlag in SliceSnapshot #4073

chore: remove DbSlice mutex and add ConditionFlag in SliceSnapshot #4073

Conversation

kostasrim commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiholden Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiholden commented Nov 13, 2024

kostasrim commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim Nov 13, 2024 •

edited

Loading

kostasrim Nov 13, 2024 •

edited

Loading

kostasrim Nov 13, 2024 •

edited

Loading

adiholden Nov 13, 2024 •

edited

Loading

kostasrim Dec 5, 2024 •

edited

Loading