-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel ledger close #4543
base: master
Are you sure you want to change the base?
Parallel ledger close #4543
Conversation
615e535
to
54d7345
Compare
1e53b11
to
6d1ce1f
Compare
e2c7701
to
f95b782
Compare
db9a9e0
to
097cb43
Compare
097cb43
to
95cab12
Compare
// bloated, with lots of legacy code, so to ensure safety, annotate all | ||
// functions using mApp with `releaseAssert(threadIsMain())` and avoid | ||
// accessing mApp in the background. Safety invariant: lock acquisition must | ||
// always be LCL lock -> BucketManger lock, and never the other direction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok but I think the comment on the lock mBucketMutex, below, is out of date: it's no longer just for protecting the file accesses, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm, I didn't change the semantics mBucketMutex though, instead I added a new lock to LM which guards changes to LCL.
Ok so my guide-level understanding of this follows. Could you confirm I've got it right and am not missing anything?
If my understanding is correct .. I think this should basically work. And I should express a tremendous congratulations for finding a cut-point that seems like it might work. This is no small feat! It's brilliant. That said, I remain quite nervous about the details, in I think 3 main ways:
All 3 of these are diffuse, vague worries I can't point to any specific code actually inhabiting. You've done great here, I would never have thought to make this cut point. But I remain worried. I wonder if there are ways we could audit, detect or mitigate any of those risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally great work. Handful of minor nits, lots of questions to make sure I'm understanding things, handful of potential clarifications to consider around naming / comments / logic. Plus I wrote a larger "overview question" in the PR comments.
But all that aside, congratulations on the accomplishment here!
src/catchup/CatchupManagerImpl.h
Outdated
@@ -22,6 +22,10 @@ class Work; | |||
|
|||
class CatchupManagerImpl : public CatchupManager | |||
{ | |||
// Maximum number of ledgers that can be queued to apply (this only applies | |||
// when Config.parallelLedgerClose() == true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand comment: what happens when this limit is exceeded?
src/catchup/CatchupManager.h
Outdated
@@ -53,7 +53,12 @@ class CatchupManager | |||
|
|||
// Process ledgers that could not be applied, and determine if catchup | |||
// should run | |||
virtual void processLedger(LedgerCloseData const& ledgerData) = 0; | |||
|
|||
// Return true is latest ledger was applied, and there are no syncing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Return true is latest ledger was applied, and there are no syncing | |
// Return true if latest ledger was applied, and there are no syncing |
"Close of ledger {} buffered. mSyncingLedgers has {} ledgers", | ||
ledgerData.getLedgerSeq(), mSyncingLedgers.size()); | ||
releaseAssert(threadIsMain()); | ||
if (!mLastQueuedToApply) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this method always updates mLastQueuedToApply, so I think it should not be named "maybe". But I am also confused because I thought mLastQueuedToApply would only be set to a value at all when we're in parallel ledger close mode (at least according to the comment in LedgerManagerImpl.h); but it seems it's just always an incrementally-updated max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I switched to mLastQueuedToApply for simplicity - without parallel ledger close, ledgers are applied right away, so mLastQueuedToApply will always be identical to getLastClosedLedgerNum(). That seems fairly harmless, and still satisfies the definition of mLastQueuedToApply (LCL <= Q), but let me know if you feel strongly.
src/test/FuzzerImpl.cpp
Outdated
app.getDatabase().clearPreparedStatementCache(); | ||
app.getDatabase().clearPreparedStatementCache( | ||
app.getDatabase().getSession()); | ||
app.getDatabase().clearPreparedStatementCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase fallout, fixed.
src/herder/HerderImpl.cpp
Outdated
@@ -2255,9 +2277,15 @@ HerderImpl::purgeOldPersistedTxSets() | |||
} | |||
} | |||
|
|||
// Tracking -> not tracking should only happen if there is nothing to apply | |||
// If there's something to apply, it's possible the rest of the network is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another todo to add to the task list.
// If we have too many ledgers queued to apply, just stop scheduling | ||
// more and let the node gracefully go into catchup. | ||
releaseAssert(mLastQueuedToApply >= lcl); | ||
if (nextToClose - lcl >= MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a condition we observe in reality? Or, I mean, why is there new code here to handle it? Is it something you're concerned will happen due to speed mismatches between the two threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the scenario I'm thinking of is if node is externalizing ledgers much faster than it can apply them. This may create a weird situation where ledger close queue grows without bounds and node technically reports as being "in sync" whereas in reality it's arbitrarily behind. This is also not great for flooding: while it'll technically "work", the node will likely flood obsolete/irrelevant traffic. So with this change we allow nodes to fall behind by about a minute of consensus, and if they can't keep up, we stop scheduling new ledgers to apply, and let core go into catchup (it will wait until all queued ledgers are applied first, then start catchup after trigger ledger)
Note that this condition is very unlikely to happen in reality on the network today. I've artificially created the condition in testing by experimenting with ledger close taking arbitrarily long time (e.g. 5+ seconds)
auto const& ledgerHeader = | ||
mApp.getLedgerManager().getLastClosedLedgerHeader(); | ||
releaseAssert(threadIsMain()); | ||
uint32_t nextToClose = *mLastQueuedToApply + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that this is right? Or rather the condition below that checks it isn't right? mLastQueuedToApply is basically "the back of the queue", right? And we're taking lcd off the front of the queue?
Let me do an example: if the queue is, say, {4, 5, 6}, then LCL is 3, lcd is 4, mLastQueuedToApply is 6, and so nextToClose is 7. The condition we check below (nextToClose != lcd.getLegerSeq()) will be comparing 7 with 4, they are unequal, so we break.
In the old code (before this change) it was comparing LCL+1 with lcd, which will be comparing 4 with 4, which are equal, so we break. I think the old code is correct.
But then .. I am a little surprised if the new code works at all. Even when the queue is just a single entry, say {4}, then LCL is 3, lcd is 4, mLastQueuedToApply is also 4, nextToClose is 5, and so again it will compare 5 with 4, they are unequal, it will break.
What am I misunderstanding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So mLastQueuedToApply tracks the back of the queue of ledgers currently being applied in the background. It's different from the queue tracked by CM (sorry, many queues here!). CM queue is "any ledgers we received from the network", which may arrive out of order, and the queue may contain gaps. mLastQueuedToApply tracks the queue of strictly sequential ledgers that are guaranteed to be applied in the background (lambdas posted to ledgerCloseThread via asio). So once we've determined that a buffered ledger can be applied, it's popped from buffered ledgers queue, and scheduled to be applied in the background (and mLastQueuedToApply is incremented)
{ | ||
mApp.getLedgerManager().closeLedger(lcd, /* externalize */ true); | ||
} | ||
mLastQueuedToApply = lcd.getLedgerSeq(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wow now I am very confused. I thought mLastQueuedToApply was tracking the back of the queue. Now it looks like it's tracking the front.
0936264
to
be2c9a7
Compare
…rtain methods static
1f89ef4
to
8e7b826
Compare
Resolves #4317
Concludes #4128
The implementation of this proposal requires massive changes to the stellar-core codebase, and touches almost every subsystem. There are some paradigm shifts in how the program executes, that I will discuss below for posterity. The same ideas are reflected in code comments as well, as it’ll be important for code maintenance and extensibility
Database access
Currently, only Postgres DB backend is supported, as it required minimal changes to how DB queries are structured (Postgres provides a fairly nice concurrency model).
SQLite concurrency support is a lot more rudimentary, with only a single writer allowed, and the whole database is locked during writing. This necessitates further changes in core (such as splitting the database into two). Given that most network infrastructure is on Postgres right now, SQLite support can be added later.
Reduced responsibilities of SQL
SQL tables have been trimmed as much as possible to avoid conflicts, essentially we only store persistent state such as the latest LCL and SCP history, as well as legacy OFFER table.
Asynchronous externalize flow
There are three important subsystems in core that are in charge of tracking consensus, externalizing and applying ledgers, and advancing the state machine to catchup or synced state:
Prior to this change, the externalize flow had two different flows:
With the new changes, the triggering ledger close flow moved to CatchupManager completely. Essentially, CatchupManager::processLedger became a centralized place to decide whether to apply a ledger, or trigger catchup. Because ledger close happens in the background, the transition between externalize and “closeLedger→set synced” becomes asynchronous.
Concurrent ledger close
List of core items that moved to the background followed by explanation why it is safe to do so:
Emitting meta
Ledger application is the only process that touches the meta pipe, no conflicts with other subsystems
Writing checkpoint files
Only the background thread writes in-progress checkpoint files. Main thread deals exclusively with “complete” checkpoints, which after completion must not be touched by any subsystem except publishing.
Updating ledger state
The rest of the system operates strictly on read-only BucketList snapshots, and is unaffected by changing state. Note: there are some calls to LedgerTxn in the codebase still, but those only appear on startup during setup (when node is not operational) or in offline commands.
Incrementing current LCL
This implies that the meaning of getLastClosedLedger/getLastClosedLedgerHeader changed. The value returned by these functions should be handled with care, as the assumption that on the next call it can be different (if background applied a new ledger)
Depending on use case in the codebase, the value returned as-is may or may not be safe to use
Reflecting state change in the bucketlist
Close ledger is the only place in the code that updates the BucketList. Other subsystems may only read it. Example is garbage collection, which queries the latest BucketList state to decide which buckets to delete. These are protected with a mutex (the same LCL mutex used in LM, as bucketlist is conceptually a part of LCL as well).