Parallel ledger close #4543

marta-lokhova · 2024-11-14T17:36:27Z

Resolves #4317
Concludes #4128

The implementation of this proposal requires massive changes to the stellar-core codebase, and touches almost every subsystem. There are some paradigm shifts in how the program executes, that I will discuss below for posterity. The same ideas are reflected in code comments as well, as it’ll be important for code maintenance and extensibility

Database access

Currently, only Postgres DB backend is supported, as it required minimal changes to how DB queries are structured (Postgres provides a fairly nice concurrency model).

SQLite concurrency support is a lot more rudimentary, with only a single writer allowed, and the whole database is locked during writing. This necessitates further changes in core (such as splitting the database into two). Given that most network infrastructure is on Postgres right now, SQLite support can be added later.

Reduced responsibilities of SQL

SQL tables have been trimmed as much as possible to avoid conflicts, essentially we only store persistent state such as the latest LCL and SCP history, as well as legacy OFFER table.

Asynchronous externalize flow

There are three important subsystems in core that are in charge of tracking consensus, externalizing and applying ledgers, and advancing the state machine to catchup or synced state:

Herder: receives SCP messages, forwards them to SCP, decides if a ledger is externalized, triggers voting for the next ledger
LedgerManager: implements closing of a ledger, sets catchup vs synced state, advances and persists last closed ledger.
CatchupManager: Keep track of any externalized ledgers that are not LCL+1. That is, keep track of future externalizing ledgers, attempt applying them to keep core in sync, and trigger catchup if needed.

Prior to this change, the externalize flow had two different flows:

If core received LCL+1, it would immediately apply it. Which means the flow externalize → closeLedger → set “synced” state happened in one synchronous function. After application, core triggers the next ledger, usually asynchronously, as it needs to wait to meet the 5s ledger requirement.
If core received ledger LCL+2..LCL+N it would asynchronously buffer it, and continue buffering new ledgers. If core can’t close the gap and apply everything sequentially, it would go into catchup flow.

With the new changes, the triggering ledger close flow moved to CatchupManager completely. Essentially, CatchupManager::processLedger became a centralized place to decide whether to apply a ledger, or trigger catchup. Because ledger close happens in the background, the transition between externalize and “closeLedger→set synced” becomes asynchronous.

Concurrent ledger close

List of core items that moved to the background followed by explanation why it is safe to do so:

Emitting meta

Ledger application is the only process that touches the meta pipe, no conflicts with other subsystems

Writing checkpoint files

Only the background thread writes in-progress checkpoint files. Main thread deals exclusively with “complete” checkpoints, which after completion must not be touched by any subsystem except publishing.

Updating ledger state

The rest of the system operates strictly on read-only BucketList snapshots, and is unaffected by changing state. Note: there are some calls to LedgerTxn in the codebase still, but those only appear on startup during setup (when node is not operational) or in offline commands.

Incrementing current LCL

This implies that the meaning of getLastClosedLedger/getLastClosedLedgerHeader changed. The value returned by these functions should be handled with care, as the assumption that on the next call it can be different (if background applied a new ledger)
Depending on use case in the codebase, the value returned as-is may or may not be safe to use

When it is safe: in cases, where LCL is used more like a heuristic or an approximation. Program correctness does not depend on the exact state of LCL. Example: post-externalize cleanup of transaction queue. We load LCL’s close time to purge invalid transactions from the queue. This is safe because if LCL has been updated while we call this, the queue is still in a consistent state. Anything in the transaction queue really, as it’s just an approximation, so an LCL in a particular point in time (even though it’s currently changing) should be safe to use.
When it is not safe: when LCL is needed in places where the latest ledger state is critical, like voting in SCP, validating blocks, etc. To avoid any unnecessary headaches, we introduce a new invariant: “applying” is a new state in the state machine, which does not allow voting and triggering next ledgers. Core must first complete applying to be able to vote on the “latest state”. In the meantime, if ledgers arrive while applying, we treat them like “future ledgers” and apply the same procedures in herder that we do today (don’t perform validation checks, don’t vote on them, and buffer them in a separate queue). The state machine remains on the main thread only, which ensures SCP can safely execute as long as the state transitions are correct (for example, executing a block production function can safely grab the LCL at the beginning of the function without worrying that it might change in the background).

Reflecting state change in the bucketlist

Close ledger is the only place in the code that updates the BucketList. Other subsystems may only read it. Example is garbage collection, which queries the latest BucketList state to decide which buckets to delete. These are protected with a mutex (the same LCL mutex used in LM, as bucketlist is conceptually a part of LCL as well).

src/bucket/test/BucketListTests.cpp

graydon · 2024-12-17T00:59:48Z

src/bucket/BucketManager.h

+    // bloated, with lots of legacy code, so to ensure safety, annotate all
+    // functions using mApp with `releaseAssert(threadIsMain())` and avoid
+    // accessing mApp in the background. Safety invariant: lock acquisition must
+    // always be LCL lock -> BucketManger lock, and never the other direction


This is ok but I think the comment on the lock mBucketMutex, below, is out of date: it's no longer just for protecting the file accesses, right?

mm, I didn't change the semantics mBucketMutex though, instead I added a new lock to LM which guards changes to LCL.

graydon · 2024-12-18T06:28:26Z

Ok so my guide-level understanding of this follows. Could you confirm I've got it right and am not missing anything?

The database gets multiple sessions, one per thread, and nearly everything that touches it gets session-qualified. This allows a degree of concurrent access.
The hilariously-named "persistent state" table, full of miscellaneous stuff, gets split up to avoid serialization conflicts arising from various table-scan predicates that might otherwise execute concurrently.
A new ledger close thread is added.
The ledger close path gets modified so that:
- Herder on main thread calls LM::valueExternalized as before
- LM::valueExternalized does not close ledger anymore, passes ledger to CM::processLedger to enqueue.
- CM::processLedger enqueues ledger in mSyncingLedgers
  - If CM is in sync:
    - Call CM::tryApplySyncingLedgers, which posts a task over to the close thread to call LM::closeLedger
    - Return true to LM::valueExternalized, which then returns to herder to carry on doing SCP.
  - Else CM is not in sync, return false so LM::valueExternalized can transition LM and CM to catchup.
- Task on close thread runs LM::closeLedger
  - Which is mostly the same as old closeLedger path, except now it's racing main thread on BL, DB and HM
  - When LM::closeLedger completes, post task back to main thread that does history steps, bucket GC, and notifies Herder of completion previously notified synchronously after LM::valueExternalized.
The BL and HM (and Config and a few other things) get made more threadsafe in general.

If my understanding is correct .. I think this should basically work. And I should express a tremendous congratulations for finding a cut-point that seems like it might work. This is no small feat! It's brilliant.

That said, I remain quite nervous about the details, in I think 3 main ways:

General fine-grained lack-of-threadsafety or data-consistency issues. It's hard to be sure you protected the last race.
The potential for bugs in the ledger sequence arithmetic and state-transition conditions of LM, CM and Herder. Just because so much has changed here. It's very hard to keep straight the whole set of possible system-state transitions and queue contents, and what the correct thing is to happen in all cases.
The potential for bigger blocks of logic on the "two sides of the split" -- main thread doing SCP/herder/enqueue/completion, and close thread doing dequeue/tx-apply/bucket-transfer -- having a subtle unstated assumption of coordinated / synchronous operation. In other words the risk that the cut is "not clean", and some correctness invariant of the system I've forgotten about is violated by running the close thread concurrently.

All 3 of these are diffuse, vague worries I can't point to any specific code actually inhabiting. You've done great here, I would never have thought to make this cut point. But I remain worried. I wonder if there are ways we could audit, detect or mitigate any of those risks.

graydon

Generally great work. Handful of minor nits, lots of questions to make sure I'm understanding things, handful of potential clarifications to consider around naming / comments / logic. Plus I wrote a larger "overview question" in the PR comments.

But all that aside, congratulations on the accomplishment here!

src/bucket/BucketManager.cpp

graydon · 2024-12-17T07:41:27Z

src/catchup/CatchupManagerImpl.h

@@ -22,6 +22,10 @@ class Work;

 class CatchupManagerImpl : public CatchupManager
 {
+    // Maximum number of ledgers that can be queued to apply (this only applies
+    // when Config.parallelLedgerClose() == true)


Expand comment: what happens when this limit is exceeded?

graydon · 2024-12-17T07:43:02Z

src/catchup/CatchupManager.h

@@ -53,7 +53,12 @@ class CatchupManager

    // Process ledgers that could not be applied, and determine if catchup
    // should run
-    virtual void processLedger(LedgerCloseData const& ledgerData) = 0;
+
+    // Return true is latest ledger was applied, and there are no syncing


Suggested change

// Return true is latest ledger was applied, and there are no syncing

// Return true if latest ledger was applied, and there are no syncing

src/catchup/CatchupManager.h

graydon · 2024-12-17T07:50:17Z

src/catchup/CatchupManagerImpl.cpp

-              "Close of ledger {} buffered. mSyncingLedgers has {} ledgers",
-              ledgerData.getLedgerSeq(), mSyncingLedgers.size());
+    releaseAssert(threadIsMain());
+    if (!mLastQueuedToApply)


It seems like this method always updates mLastQueuedToApply, so I think it should not be named "maybe". But I am also confused because I thought mLastQueuedToApply would only be set to a value at all when we're in parallel ledger close mode (at least according to the comment in LedgerManagerImpl.h); but it seems it's just always an incrementally-updated max?

Yes, I switched to mLastQueuedToApply for simplicity - without parallel ledger close, ledgers are applied right away, so mLastQueuedToApply will always be identical to getLastClosedLedgerNum(). That seems fairly harmless, and still satisfies the definition of mLastQueuedToApply (LCL <= Q), but let me know if you feel strongly.

graydon · 2024-12-18T02:21:31Z

src/test/FuzzerImpl.cpp

-    app.getDatabase().clearPreparedStatementCache();
+    app.getDatabase().clearPreparedStatementCache(
+        app.getDatabase().getSession());
+    app.getDatabase().clearPreparedStatementCache(


rebase fallout, fixed.

graydon · 2024-12-18T05:25:33Z

src/herder/HerderImpl.cpp

@@ -2255,9 +2277,15 @@ HerderImpl::purgeOldPersistedTxSets()
    }
 }

+// Tracking -> not tracking should only happen if there is nothing to apply
+// If there's something to apply, it's possible the rest of the network is


Another todo to add to the task list.

graydon · 2024-12-18T06:13:02Z

src/catchup/CatchupManagerImpl.cpp

+        // If we have too many ledgers queued to apply, just stop scheduling
+        // more and let the node gracefully go into catchup.
+        releaseAssert(mLastQueuedToApply >= lcl);
+        if (nextToClose - lcl >= MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT)


Is this a condition we observe in reality? Or, I mean, why is there new code here to handle it? Is it something you're concerned will happen due to speed mismatches between the two threads?

So the scenario I'm thinking of is if node is externalizing ledgers much faster than it can apply them. This may create a weird situation where ledger close queue grows without bounds and node technically reports as being "in sync" whereas in reality it's arbitrarily behind. This is also not great for flooding: while it'll technically "work", the node will likely flood obsolete/irrelevant traffic. So with this change we allow nodes to fall behind by about a minute of consensus, and if they can't keep up, we stop scheduling new ledgers to apply, and let core go into catchup (it will wait until all queued ledgers are applied first, then start catchup after trigger ledger)

Note that this condition is very unlikely to happen in reality on the network today. I've artificially created the condition in testing by experimenting with ledger close taking arbitrarily long time (e.g. 5+ seconds)

graydon · 2024-12-18T06:25:31Z

src/catchup/CatchupManagerImpl.cpp

-    auto const& ledgerHeader =
-        mApp.getLedgerManager().getLastClosedLedgerHeader();
+    releaseAssert(threadIsMain());
+    uint32_t nextToClose = *mLastQueuedToApply + 1;


I don't know that this is right? Or rather the condition below that checks it isn't right? mLastQueuedToApply is basically "the back of the queue", right? And we're taking lcd off the front of the queue?

Let me do an example: if the queue is, say, {4, 5, 6}, then LCL is 3, lcd is 4, mLastQueuedToApply is 6, and so nextToClose is 7. The condition we check below (nextToClose != lcd.getLegerSeq()) will be comparing 7 with 4, they are unequal, so we break.

In the old code (before this change) it was comparing LCL+1 with lcd, which will be comparing 4 with 4, which are equal, so we break. I think the old code is correct.

But then .. I am a little surprised if the new code works at all. Even when the queue is just a single entry, say {4}, then LCL is 3, lcd is 4, mLastQueuedToApply is also 4, nextToClose is 5, and so again it will compare 5 with 4, they are unequal, it will break.

What am I misunderstanding here?

So mLastQueuedToApply tracks the back of the queue of ledgers currently being applied in the background. It's different from the queue tracked by CM (sorry, many queues here!). CM queue is "any ledgers we received from the network", which may arrive out of order, and the queue may contain gaps. mLastQueuedToApply tracks the queue of strictly sequential ledgers that are guaranteed to be applied in the background (lambdas posted to ledgerCloseThread via asio). So once we've determined that a buffered ledger can be applied, it's popped from buffered ledgers queue, and scheduled to be applied in the background (and mLastQueuedToApply is incremented)

graydon · 2024-12-18T06:26:47Z

src/catchup/CatchupManagerImpl.cpp

+        {
+            mApp.getLedgerManager().closeLedger(lcd, /* externalize */ true);
+        }
+        mLastQueuedToApply = lcd.getLedgerSeq();


oh wow now I am very confused. I thought mLastQueuedToApply was tracking the back of the queue. Now it looks like it's tracking the front.

…rtain methods static

marta-lokhova force-pushed the parallelLedgerClose branch from 615e535 to 54d7345 Compare November 14, 2024 17:36

marta-lokhova mentioned this pull request Nov 14, 2024

Allow ledger close path to run async during externalize #4506

Closed

marta-lokhova force-pushed the parallelLedgerClose branch 7 times, most recently from 1e53b11 to 6d1ce1f Compare November 16, 2024 01:14

marta-lokhova force-pushed the parallelLedgerClose branch from e2c7701 to f95b782 Compare December 3, 2024 22:41

marta-lokhova force-pushed the parallelLedgerClose branch 4 times, most recently from db9a9e0 to 097cb43 Compare December 16, 2024 18:37

marta-lokhova marked this pull request as ready for review December 16, 2024 18:38

marta-lokhova force-pushed the parallelLedgerClose branch from 097cb43 to 95cab12 Compare December 16, 2024 19:07

graydon reviewed Dec 17, 2024

View reviewed changes

src/bucket/test/BucketListTests.cpp Outdated Show resolved Hide resolved

graydon reviewed Dec 17, 2024

View reviewed changes

marta-lokhova self-assigned this Dec 17, 2024

graydon requested changes Dec 18, 2024

View reviewed changes

marta-lokhova force-pushed the parallelLedgerClose branch from 0936264 to be2c9a7 Compare December 19, 2024 16:46

marta-lokhova mentioned this pull request Dec 19, 2024

Post parallel ledger close cleanup #4592

Open

3 tasks

marta-lokhova requested review from SirTyson and dmkozh December 19, 2024 21:11

marta-lokhova added 5 commits December 19, 2024 13:26

Cleanup dead code

af12254

LedgerTxn plumbing: allow specifying session in database operations

88b7811

LedgerTxn fixup

485159a

Virtual time thread-safety

c76ade4

Split PersistentState into two tables to support concurrent writes

2afafa2

marta-lokhova added 7 commits December 19, 2024 13:26

HistoryManager cleanup: remove obsolete bucket publish cache, make ce…

7c4073f

…rtain methods static

Support new experimental parallel ledger close config

2998889

Support multiple sessions in Database class

064fe7b

Remove dead code now that in-memory mode is dropped

cb7ce6c

Switch to using AppConnector in apply

4ea77f4

Parallel ledger close implementation

d88c58f

Minor bucket manager cleanup

8e7b826

marta-lokhova force-pushed the parallelLedgerClose branch from 1f89ef4 to 8e7b826 Compare December 19, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel ledger close #4543

Parallel ledger close #4543

marta-lokhova commented Nov 14, 2024 •

edited

Loading

graydon Dec 17, 2024

marta-lokhova Dec 19, 2024

graydon commented Dec 18, 2024

graydon left a comment

graydon Dec 17, 2024

graydon Dec 17, 2024

graydon Dec 17, 2024

marta-lokhova Dec 18, 2024

graydon Dec 18, 2024

marta-lokhova Dec 18, 2024

graydon Dec 18, 2024

graydon Dec 18, 2024

marta-lokhova Dec 18, 2024 •

edited

Loading

graydon Dec 18, 2024

marta-lokhova Dec 18, 2024

graydon Dec 18, 2024

	// Return true is latest ledger was applied, and there are no syncing
	// Return true if latest ledger was applied, and there are no syncing

Parallel ledger close #4543

Are you sure you want to change the base?

Parallel ledger close #4543

Conversation

marta-lokhova commented Nov 14, 2024 • edited Loading

Database access

Reduced responsibilities of SQL

Asynchronous externalize flow

Concurrent ledger close

Emitting meta

Writing checkpoint files

Updating ledger state

Incrementing current LCL

Reflecting state change in the bucketlist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graydon commented Dec 18, 2024

graydon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marta-lokhova Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marta-lokhova commented Nov 14, 2024 •

edited

Loading

marta-lokhova Dec 18, 2024 •

edited

Loading