Backfilling history within RPC's retention window #323

mollykarcher · 2024-11-01T19:00:25Z

mollykarcher
Nov 1, 2024
Collaborator

What

RPC currently populates it's retention window just via "forward fill"; meaning, it starts out with no data and does not prune away any old data until it fills up it's retention window. This means that when spinning up a new node, an operator needs to wait the duration of their retention window before that RPC is actually retaining the amount of history that it is configured to retain. We'd propose changing that, so that an RPC can have a full retention window immediately (or quickly) on startup.

Why

There have been several context in which this has come up, some discussed internally and some from various provider use-cases. Some of these include:

Provider expects their retention window to be full on initial startup, rather than waiting for forward fill
- Falls under the "feature parity" with other L1s bucket
- ~~Note that I haven't actually seen/heard anyone complain about this~~ @overcat has indicated they'd want this
Provider expands their retention window, and the window is now partially empty and want to fill it asap, rather than waiting for forward fill
- We will experience this to some degree, when updating internal instance from 1 day -> 7 days
Provider has already built application logic against RPC's APIs and plans to use it on a going forward basis, but wants a one-time backfill/stream of all Soroban data
- At least 3 known requests from real users for something like this
- RPC used almost as a proxied streamer/ingester to churn through all soroban history
Provider wants "archival node" equivalent of RPC
- Recommendation here will likely be different than backfilling and storing full history (ie. CDP-backed RPC?)
[potential / future use case] Provider wants to reindex data in RPC
- We debated this internally prior to decided to support data migrations
- Potentially some future state where if RPC has filtering capabilities, this would enable you to change filters and reingest

How

How exactly we should implement this from a product-perspective is debatable, and we have a few different options. For example:

Backfill synchronously on startup
- The most "industry standard" approach when compared to other L1 RPCs
- Startup is slower, presuming all getHealth requests fail until backfill is complete
- Slows down the "default" case; local development, quickstart, tests, etc
- "Slow" issues unlikely to affect real/production instances, since most employ multi-replica or blue/green setups for upgrades
Backfill asynchronously in the background after startup
- Is this even technically possible? Would it mean we'd have to run 2 captive cores?
- Potentially confusing for end-users if it starts serving requests, but history retention/API responses changes over time
Backfill offline via a different command (a la horizon reingest), where backfill is mutually exclusive with "live" ingestion
- Potentially easier to understand for existing stellar operators due to the Horizon analog
- Backfilling must be a very intentional operator action (could be a pro or a con)

I think either of option 1/3 would be reasonable, but open to other thoughts. Other things/options to consider:

Should whether we backfill at all be configurable? Probably
Should we allow backfilling via both Captive Core and CDP? Probably, though we can prioritize which option makes the most sense to start with

chowbao · 2024-11-01T22:28:05Z

chowbao
Nov 1, 2024
Collaborator

Should whether we backfill at all be configurable? Probably

+1

Should we allow backfilling via both Captive Core and CDP? Probably, though we can prioritize which option makes the most sense to start with

Does backfilling need a Captive Core option? I would actually argue that backfill should be CDP only. Reasons being

As proven by @Shaptic CDP RPC backfill experiment, using CDP to backfill RPC is extremely easy and decouples the need to run/setup a Captive Core for a potentially large backfill time range
Pushes more users to use CDP for use cases CDP is good at

How exactly we should implement this

I'm in favor of option 3 Backfill offline via a different command but I'm not the most familiar with RPC so will defer to other's opinions

1 reply

Shaptic Nov 4, 2024
Collaborator

I agree that we should make it CDP only if we could, but I doubt that'll fly.

tomerweller · 2024-11-04T14:28:14Z

tomerweller
Nov 4, 2024
Maintainer

My vote is for 1 and sync via captive core by default with an option to use CDP instead

2 replies

Shaptic Nov 4, 2024
Collaborator

I'd vote for the opposite defaults because we want to push people to CDP, but still have the captive core option if they decide to opt-out: the additional friction might get us more CDP users.

tomerweller Nov 8, 2024
Maintainer

So the idea would be that a default configuration would introduce a hard requirement on CDP? How is the operator expected to have access to CDP?

tamirms · 2024-11-04T18:26:29Z

tamirms
Nov 4, 2024
Collaborator

There are a few challenges with option 3, Backfill offline via a different command (a la horizon reingest):

Having an separate ingest command introduces the possibility of having gaps in RPC's dataset. For example, if you ingest ledgers [5000, 10000] into an empty rpc DB and then startup rpc when the latest ledger is 15000, that will result in a gap between ledgers 10000 and 15000.
I don't think sqlite allows concurrent write access so it will not be possible to run the ingest command while rpc is running or vice versa. This could lead to a confusing experience for the operator if they aren't aware of this issue.
RPC will immediately trim data which is outside its retention window. So, if you ingest a ledger range which is in the distant past (e.g. outside the 7 day retention window), all of it will be immediately deleted when you start up rpc. This is another issue which could frustrate the operator if they aren't careful about adjusting the retention window after running the ingest command.

Should whether we backfill at all be configurable? Probably

+1 for making it configurable.

Should we allow backfilling via both Captive Core and CDP?

this could be challenging from an operator experience perspective because there are 2 sources of ledgers (captive-core and cdp) and 2 phases of ingestion (backfill and live). So in total there are 4 different combinations that they could consider:

backfill from CDP, live ingestion from captive core
backfill from CDP, live ingestion from CDP
backfill from captive core, live ingestion from CDP
backfill from captive core, live ingestion from captive core

To minimize confusion, I am inclined to allow the operator to only chose the source of ledgers (captive core vs cdp) and not allow them to customize the ingestion source for each phase of ingestion.

1 reply

mollykarcher Nov 8, 2024
Collaborator Author

this could be challenging from an operator experience perspective because there are 2 sources of ledgers

I think we may deal with this eventually anyway, since we're also playing around with the idea of enabling RPC requests for data older than it's retention window to proxy over to CDP. I don't necessarily want to convolute that discussion with this one, but just to point out that this matrix of possibilities may exist anyway at a later point in time.

Shaptic · 2024-11-04T22:56:59Z

Shaptic
Nov 4, 2024
Collaborator

I'm inclined to support option (1), synchronous backfill. I'll echo the problems others have brought up with the other options:

In (2), asynchronous backup is untenable with Captive Core, since you'd need one in catchup mode and one for live ingestion. I could get on board with asynchronous CDP backfill if we can nail down the multiple-writer complexity of sqlite, but that also introduces the divergent functionality paths Tamir mentions.

In (3), you enter the world of state machine management hell that we have in Horizon, which includes dealing with gaps as Tamir mentioned.

With (1) we have the cleanest possible approach, and we could theoretically avoid some of the issues outlined by Molly:

presuming all getHealth requests fail until backfill is complete

I'm pretty sure that if we synced to the database on every ledger, we could safely allow endpoints to return results within the in-progress backfill window. This would at least enable historical queries quickly, with the window increasing as backfill progresses.

Slows down the "default" case; local development, quickstart, tests, etc

Only if backfill is enabled! Since it's opt-in, people will be aware of the performance risks.

In order to avoid gaps and retention window issues, I think we should allow only a single parameter which specifies the number of ledgers backwards from the current tip to backfill on startup. If this parameter is set, it becomes equivalent to the retention window, meaning only one of these can be used at a time: --backfill N will catch up N ledgers behind the current LCL and maintain N ledgers in the DB from then on, while --retention-window M will maintain M ledgers in the DB moving forward from an empty DB. (I think we'd want to rename --retention-window with this, too.)

4 replies

tamirms Nov 5, 2024
Collaborator

I think we should allow only a single parameter which specifies the number of ledgers backwards from the current tip to backfill on startup. If this parameter is set, it becomes equivalent to the retention window

why not use the existing retention window parameter instead of introducing another backfill parameter?

Shaptic Nov 5, 2024
Collaborator

In principle I agree, but then you're introducing breaking behavior on the existing parameter because it will lead to slow startup immediately rather than what we have today. I still think we should preserve the forward fill behavior.

tamirms Nov 5, 2024
Collaborator

what I had in mind was that backfill could be a boolean command line flag (defaulting to false) and the amount to backfill would be inferred from the retention window configuration. So, running with --retention-window M will preserve the existing forward fill behavior and running --retention-window M --backfill will backfill any missing ledgers so that the retention window will contain M ledgers

Shaptic Nov 5, 2024
Collaborator

Ah nice, I like that!

2opremio · 2024-11-06T18:11:14Z

2opremio
Nov 6, 2024
Collaborator

Backfill offline via a different command (a la horizon reingest), where backfill is mutually exclusive with "live" ingestion

I agree with everyone else about doing synchronously being a better option

Backfill asynchronously in the background after startup

I think this option would be ideal and I don't think that it would be a lot more complicated than doing it synchronously (apart from coordinating it with reaping). But it does require running a second copy of Captive Core. Unless, of course, CDP is used.

Here is the Epic I created a while ago about it #196

Should we allow backfilling via both Captive Core and CDP

We should. What ingestion backend we use shouldn't make a difference.

EDIT: BTW, regardless of the option we choose, I think we can reuse some of the tickets I created for #196 . For instance, #203 will be necessary regardless of what we choose.

0 replies

overcat · 2024-11-08T03:51:33Z

overcat
Nov 8, 2024

I personally prefer (1). It may be very simple for multi-node users to use it. Before completing the backfill, health should be set to false, and once completed, it should be set to true, allowing us to direct traffic to the new node.

0 replies

mollykarcher · 2024-11-08T16:51:20Z

mollykarcher
Nov 8, 2024
Collaborator Author

To summarize the discussions thus far (comment if you disagree):

Option (1) is broadly preferred by everyone
Backfill will be configured with a new --backfill parameter, which will default to false (proposed by @tamirms)
getHealth should be updated to not report healthy until retention window is full (proposed by @overcat)
Mixed opinions on backfill mechanism (Captive Core vs CDP)
- Should you be allowed to mix + match your backfill + live ingestion sources (posed by @tamirms)?
- What should the "default" source be? (Transitively, which source should we prioritize working on first?)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfilling history within RPC's retention window #323

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Backfilling history within RPC's retention window #323

mollykarcher Nov 1, 2024 Collaborator

What

Why

How

Replies: 7 comments · 8 replies

chowbao Nov 1, 2024 Collaborator

Shaptic Nov 4, 2024 Collaborator

tomerweller Nov 4, 2024 Maintainer

Shaptic Nov 4, 2024 Collaborator

tomerweller Nov 8, 2024 Maintainer

tamirms Nov 4, 2024 Collaborator

mollykarcher Nov 8, 2024 Collaborator Author

Shaptic Nov 4, 2024 Collaborator

tamirms Nov 5, 2024 Collaborator

Shaptic Nov 5, 2024 Collaborator

tamirms Nov 5, 2024 Collaborator

Shaptic Nov 5, 2024 Collaborator

2opremio Nov 6, 2024 Collaborator

overcat Nov 8, 2024

mollykarcher Nov 8, 2024 Collaborator Author

mollykarcher
Nov 1, 2024
Collaborator

Replies: 7 comments 8 replies

chowbao
Nov 1, 2024
Collaborator

Shaptic Nov 4, 2024
Collaborator

tomerweller
Nov 4, 2024
Maintainer

Shaptic Nov 4, 2024
Collaborator

tomerweller Nov 8, 2024
Maintainer

tamirms
Nov 4, 2024
Collaborator

mollykarcher Nov 8, 2024
Collaborator Author

Shaptic
Nov 4, 2024
Collaborator

tamirms Nov 5, 2024
Collaborator

Shaptic Nov 5, 2024
Collaborator

tamirms Nov 5, 2024
Collaborator

Shaptic Nov 5, 2024
Collaborator

2opremio
Nov 6, 2024
Collaborator

overcat
Nov 8, 2024

mollykarcher
Nov 8, 2024
Collaborator Author