Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permit non-consecutive increments in validator set #1137

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Lederstrumpf
Copy link
Contributor

Changes the validator set successions to still be strictly increasing, but not necessarily consecutive.

This can greatly improve cost-efficiency of the bridge, but is also useful in recovery scenarios like paritytech/polkadot-sdk#3080, where only one signed commitment would have to be crafted for a recovery, instead of one for every missed session.

Rationale

For bridge security against collusion by validators, it's important that these validators are still bonded, but the timing of the signature itself is secondary. Only bonded validators are still slashable, and to a bonded validator, neither the probability of succeeding with an attack nor the probability of getting caught are influenced by the timing of the signature - it is in fact irrelevant whether they sign a payload while being part of a particular validator set n or if they happened to be part of set n+m later (as long as they are still bonded).

While it is important that BEEFY produces a commitment for every handover of validator sets, it would be beneficial to not relay a block every session unless the active identities have changed.

For the BEEFY light client, while it will naturally still check a signed commitment against the authorities defined in the currentValidatorSet (read: validator set last known on Ethereum [VSLKOE]), as long as the signatures are still ordered as expected for the bitfield checks, the commitment can pass even if the commitment was signed outside of the VSLKOE's mandate within the BEEFY voting protocol.

While the validator set does change over time, it is largely preserved from session to session.
As such, even if some validators have rotated out, this change permits keeping the bridge alive so long as 2/3rds of the currentValidatorSet are still bonded and sign the commitment - either for recovery or cost optimization purposes.

Next Steps

This change is a prerequisite to allowing relayers to skip sessions for block relaying:
https://github.com/Snowfork/snowbridge/blob/main/relayer/relays/beefy/polkadot-listener.go#L115-L125

For a cost-efficient bridge, the next step is to adjust the (common-good) relayer logic to only relay blocks if there has been a change in the keys of the upcoming validator set.

More aggressive approaches that additionally don't relay if only small quantities of keys have changed are possible too, but also require a more bespoke implementation.

cc @AlistairStewart @bhargavbh

Changes the validator set successions to still be sequential, but non-consecutive.

For bridge security against collusion by validators, it's important that
these validators are still bonded, but the timing of the signature itself
is secondary.

As such, even if some validators have rotated out, this change permits
keeping the bridge alive so long as 2/3rds of the `currentValidatorSet`
are still bonded and sign the commitment.

Co-authored-by: bhargavbh <[email protected]>
Co-authored-by: Alistair Stewart <[email protected]>
Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ec6efea) 49.66% compared to head (a223f97) 49.66%.
Report is 9 commits behind head on main.

Files Patch % Lines
contracts/src/BeefyClient.sol 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1137   +/-   ##
=======================================
  Coverage   49.66%   49.66%           
=======================================
  Files          63       63           
  Lines        3707     3707           
  Branches       72       72           
=======================================
  Hits         1841     1841           
  Misses       1849     1849           
  Partials       17       17           
Flag Coverage Δ
solidity 80.71% <66.66%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bhargavbh
Copy link

Thanks @Lederstrumpf for summarising our discussions and creating this PR.

Just to add, for the light client to not necessarily follow every consecutive validatorSet changes, there are 2 requirements:

  1. there is an overlap of 2/3rd validators in the expected nextValidatorSet and Validator set of the session (strictly greater than current) the commitment is being crafted.
  2. some logic on the relayer side which maps the overlapped 2/3rd validators (of the session where commitment is crafted) into a bitfield that respects the order of the expected nextValidatorSet. This can possibly be manually triggered on the relayer side in scenarios like BEEFY: Rococo⇄Sepolia deployment stalled paritytech/polkadot-sdk#3080

@alistair-singh
Copy link
Contributor

  1. there is an overlap of 2/3rd validators in the expected nextValidatorSet and Validator set of the session (strictly greater than current) the commitment is being crafted.

There must also exist a commitment where the overlapped 2/3rd has signed. Would this be the mandatory commitment for the session? Do all validators have to sign an mandatory commitment?

@Lederstrumpf
Copy link
Contributor Author

Lederstrumpf commented Jan 31, 2024

  1. there is an overlap of 2/3rd validators in the expected nextValidatorSet and Validator set of the session (strictly greater than current) the commitment is being crafted.

There must also exist a commitment where the overlapped 2/3rd has signed. Would this be the mandatory commitment for the session? Do all validators have to sign an mandatory commitment?

There must also exist a commitment where the overlapped 2/3rd has signed.

Do all validators have to sign an mandatory commitment?

No, a voting round concludes once the threshold of required signatures is reached:
https://github.com/lederstrumpf/polkadot-sdk/blob/e7718fc8e75ed7133b3c06061374f32beefa7b8f/substrate/client/consensus/beefy/src/round.rs#L54-L58
https://github.com/lederstrumpf/polkadot-sdk/blob/e7718fc8e75ed7133b3c06061374f32beefa7b8f/substrate/client/consensus/beefy/src/round.rs#L166-L174

So at least 2/3rds have to sign it for producing a finalized commitment. Hence: I would start off with the conservative implementation below.

There must also exist a commitment where the overlapped 2/3rd has signed. Would this be the mandatory commitment for the session?

It's mandatory in BEEFY to sign the first block of the new session, i.e. where the new authority set kicks in. Although the previous authority set sign to hand over to the new one, there isn't a guarantee that there's a 2/3rd overlap between the two sets. For instance, in principle, validators could all rotate their session keys from one session to the next, so the BEEFY authorities would seem distinct despite the underlying economic identities remaining the same.
But practically, the vast majority of keys remain the same from one session to the next, hence it's sensible to optimize for this from a cost perspective - so long as all blocks with authority changes are still mandatory to sign.

We could relax requiring the validators to sign every first block of a session to making it mandatory to sign only if any keys have changed since the last session in the voting protocol itself (making it mirror grandpa), but the real benefit is in just relaxing this requirement for the relayer:

  1. In the most conservative implementation, the relayer only relays if there has been any change in the BEEFY authority set.
  2. In a more aggressive implementation, the relayer has a threshold of keys that need to change in the authority set before they relay a new commitment (say for instance, 5% of keys). This may require some remapping of indices from the signatures in the SignedCommitment produced by the new BEEFY authority set to the indices of the overlapping authorities in the merkelized validator set so that this can be verified in isValidatorInSet(...). Note that this remapping should take place in the relayer: both BEEFY voting and the BEEFY light client should be oblivious to it.

Co-authored-by: bhargavbh <[email protected]>
@yrong
Copy link
Contributor

yrong commented Feb 2, 2024

currentValidatorSet = nextValidatorSet;

Does this line also require change? Assuming there is no beefy authorities change from Session 4 to 6 and we skip these mandatory commitments, after switching the (current, next) pair from (3,4) to (7,8), the next pair will be (4,8) instead of (7,8).

@Lederstrumpf
Copy link
Contributor Author

currentValidatorSet = nextValidatorSet;

Does this line also require change? Assuming there is no beefy authorities change from Session 4 to 6 and we skip these mandatory commitments, after switching the (current, next) pair from (3,4) to (7,8), the next pair will be (4,8) instead of (7,8).

good observation, and an important point to discuss:

Does this line also require change?

I reckon we don't need to change this line, but the minimal change does have a caveat:

TLDR: after the session hop, the next commitment will have to be on the is_next_session path too, i.e. would be a commitment from session 8 in your example. So it doesn't change anything if just moving between mandatory commitments.

Assuming there is no beefy authorities change from Session 4 to 6 and we skip these mandatory commitments, after switching the (current, next) pair from (3,4) to (7,8), the next pair will be (4,8) instead of (7,8).

This is correct. In your example, after the update from a commitment from session 7, the client will deem session 4 to be the current one, but it will have updated the latestBeefyBlock to the one from a commitment in session 7.
Commitments from any sessions later than session 4 will trigger the is_next_session path, hence will be checked against the signatures of session 8's authorities. As such, without assuming any overlap in session authorities (i.e. the "worst" case), the light client will only accept commitments from session 8.
These then trigger the is_next_session path, and will update the (current, next) pair to (8, 9), at which stage the light client logic will operate exactly as if it had arrived there from consecutive session changes (i.e. current behavior).

Before seeing a commitment from session 8, the light client would still see the authorities from session 4 as the current ones, but since it will already check that commitment.blockNumber > latestBeefyBlock in submitInitial(...), they will be rejected, hence the light client's state will never revert against the latestBeefyBlock set by the commitment from session 7.

Does this line also require change?

The caveat of the minimal change (as it is now) is that the light client will ignore any commitments from session 7, unless we remap signatures from session 8 to it (in which case they would also trigger the is_next_session path, since the light client still compares to session 4).
For skipping sessions, I reckon this is acceptable:

  1. if you jump directly to session 7 while it is still the current session, then unless you remap signatures from the authorities from session 8, you'd have to wait for session 8 to send any new commitment.
  2. if you want to jump to session 7, just send a commitment for session 6 first

If you want to jump without skipping the !is_next_session path for a session, then you would have to update the commitment's leaf to also include the current authority.
It's an option to consider, and we may want that for other reasons eventually too, but I reckon it's not necessary at this stage.

@yrong
Copy link
Contributor

yrong commented Feb 4, 2024

Thanks @Lederstrumpf for the detailed explanation and all makes sense.

Just checked the recent 10 history beefyMmrLeaf.beefyAuthorities on the Kusama network and seem all the same so I prefer to start with the conservative implementation(only relays if there has been any change in the BEEFY authority set) as you mentioned. Add a companion PR in #1139 changes for the relayer and welcome comments.

We may improve for the aggressive implementation(with the threshold and remapping of indices stuff) later.

@alistair-singh
Copy link
Contributor

alistair-singh commented Feb 5, 2024

It's mandatory in BEEFY to sign the first block of the new session, i.e. where the new authority set kicks in.

I cannot reproduce all validators signing the mandatory commitment in our local testnet. It seems validators stop once 2/3rds have signed the mandatory commitment. This can be an issue:

3 steps that the relayer performs:

  1. submitInitial where the relayer creates an initial bitfield with the validators who signed.
    • Here we make sure that 2/3rds of the validators have signed. We MUST provide a bitfield with at least 2/3rds. The issue is that the 2/3rds that have signed are not necessarily the 2/3rds that are common with the VSLKOE. This means the bitfield will contain some validators from the VSLKOE and some validators from the current session.
  2. commitPrevRandao where it randomly subsample the initial bitfield.
    • Here the light client will choose validators from the bitfield provided, which may include validators that have signed that are not in the VSLKOE.
  3. submitFinal where it validates the random subsample.
    • This will fail validation of the validator that is unknown to the light client.

Recovery depends on:

  1. 2/3rds of current validators in VSLKOE.
  2. VSLKOE makes up a large portion of the 2/3rds that sign the mandatory commitment.
  3. Chance that the light client subsamples N signatures who are also from the VSLKOE

So potentially the recovery would need to retry until getting a PrevRandao that chooses validators from the VSLKOE.

We may need a new call on the light client specific for recovery but uses the same principals as this commit.

@bhargavbh
Copy link

bhargavbh commented Feb 6, 2024

Hi @alistair-singh, BEEFY letting only 2/3rd+1 validators to sign in a round seems problematic for this change. This modification requires some more thought into it.
From the spec, the claim that validators stop voting on a BEEFY round once they see 2/3rd+1 votes seems to be the case. We should however confirm this by looking at the code as well. This implies that the session in which recovery commitment not only needs to have 2/3rd overlap, but also the same 2/3rd who signed in VSLKOE must be the first ones to sign. Even if there are few validators outside the overlapping set who sign in a BEEFY round, this would take up the space of the overlapping validators signature which we really wanted (for the recovery). We will get back to you once we take a closer look into the issue.

@bhargavbh
Copy link

I am not so sure of the concerns on light client making multiple attempts to subsample the right set of validators. We first make a mapping between validators from recovery_crafting_seesion and VSLKOE. The light client now would subsample only from bitfields (obtained after the mapping transformation) that are set to 1 in commitPrevRandao. I fail to see how this would be different from any other subsampling situation. If we have the signatures of the overlapping 2/3rd in VSLKOE (which may not be the case due the the issues described above), then subsampling should not be a problem. To conclude, subsampling would only be an issue in conjunction with the issue of "2/3rd current validators in VSLKOE", and is not independently an issue. Is my understanding right?

@alistair-singh
Copy link
Contributor

The light client now would subsample only from bitfields (obtained after the mapping transformation) that are set to 1 in commitPrevRandao. I fail to see how this would be different from any other subsampling situation.

So the subsample will fail if it chooses a validator not in VSLKOE. We have to include validators that are not in VSLKOE when calling submitInitial to pass the 2/3rds have signed the commitment validation. This makes it not like a normal subsample scenario. In a normal subsample scenario any of the sampled signed validators will be successful.

To conclude, subsampling would only be an issue in conjunction with the issue of "2/3rd current validators in VSLKOE", and is not independently an issue. Is my understanding right?

Yes. But I would say it's the signing of the commitment not being signed by 2/3rds VSLKOE rather than "2/3rd current validators in VSLKOE". Random sampling that might fail in this context will make it hard to predict the time and cost of recovery.

@bhargavbh
Copy link

thanks for clarifying. Agreed! indeed the subsampling issue stems from the fact that 2/3rd of the VSLKOE validators are not signing the crafted commitment. This in turn is a result of validators not signing a commitment in a BEEFY round once they see 2/3rd signatures of other validators. At an initial glance, the BEEFY rules of moving to the next round once it sees 2/3rd signatures is for efficiency reasons, and is not necessarily needed for safety/liveness. One approach could be relaxing the rule but there are several trade-offs between the efficiency of BEEFY and redundant signatures that are useful only incase of Bridge recovery. We also have to take into consideration how complex the changes are on BEEFY side. We will put some more thought into it and get back.

@bhargavbh
Copy link

We had a discussion internally and this is our opinion.
i use the terminology "crafting validator set" to refer to the set where the commitment is being crafted to move the syncing forward.

The BEEFY rule that makes validators skip signing a commitment if they have already seen 2/3rd other signatures can actually be relaxed and seems like a good idea specially for mandatory blocks.
Regardless of the above change, it still makes sense to go ahead with the changes on light client side suggested in this PR. The reason is as follows:

  • the validator sets usually remain the same after session changes. So the VSLKOE and the CVS are the same within each era (1 era equals 6 sessions, and there are small changes to authority every era change). This assumption is false if the validators set new keys on session changes, but it is very rare that the validators perform it. So the light client can possibly skip sessions within the same era. This helps in bridge efficiency but also in cases of recovery where the commitment is crafted within the same era.

It is probably an incremental improvement and does not completely solve the recovery problem.

Future Plans:
Potentially we change BEEFY voting rules such that the mandatory blocks in each session are signed by all honest validators. This is currently not the case. Mandatory block are always BEEFY finalised but not necessarily signed by every honest validator (an honest validator can skip signing mandatory block if it has already seen 2/3rd). This would lead to a more efficient bridge as sessions across eras can be skipped (as long as there is 2/3rd overlap in VSLKOE and CVS). However, we probably should not wait for this change at least for the current launch.

@alistair-singh
Copy link
Contributor

Thanks @bhargavbh, I will discuss with the team going forward with this merge. But I think the main blocker for me is that we need some unit tests to test this scenario. I will probably have to add these tests as I know how to generate data from our testnet.

@Lederstrumpf
Copy link
Contributor Author

Regardless of the above change, it still makes sense to go ahead with the changes on light client side suggested in this PR. The reason is as follows:

  • the validator sets usually remain the same after session changes. So the VSLKOE and the CVS are the same within each era (1 era equals 6 sessions, and there are small changes to authority every era change). This assumption is false if the validators set new keys on session changes, but it is very rare that the validators perform it. So the light client can possibly skip sessions within the same era. This helps in bridge efficiency but also in cases of recovery where the commitment is crafted within the same era.

Some data to support our opinion here:

The conservative change proposed above performs no remapping and sends a commitment whenever the auth set's keys effectively change (note BEEFY auth set changes on every session via paritytech/substrate#11269, irrespective of key changes).
The grandpa authority set id can change on every session, but only does so when some keys in the auth set change - i.e. provides exactly the indicator we're looking for wrt. frequency of auth set changes:

Historic stats:

Kusama

grandpa.currentSetId: 8461
session.currentIndex: 36760
➡ grandpa auth set changes only every 4.34 sessions on average

Polkadot

grandpa.currentSetId: 1809
session.currentIndex: 8110
➡ grandpa auth set changes only every 4.48 sessions on average

So even with only the conservative change for session skipping, the expected operational cost for relaying commitments for altruistic relayers* is less than a quarter of the current cost.
I.e. the expected cost** of operating Kusama (at current gas & ETH prices ~ $400k per year for only mandatory commitments) would drop to under $100k per year, i.e. less than what we'd have expected to spend on Polkadot before (which would now be under $25k).

*: i.e. the ones relaying commitments only for keeping the bridge alive - if more demand warrants more frequent commitment updates: great!
**: All numbers subject to ETH not heading into a congested bull run ;)

@Lederstrumpf
Copy link
Contributor Author

But I think the main blocker for me is that we need some unit tests to test this scenario. I will probably have to add these tests as I know how to generate data from our testnet.

If you send me / push some additional finality proofs from your testnet à la https://github.com/Snowfork/snowbridge/tree/a223f97a2c21f14a83aa9786644f9e81f94c1fce/contracts/test/data/beefy-final-proof-0.json, I'm happy to add the unit tests I had in mind wrt. the double hop behavior.

Before seeing a commitment from session 8, the light client would still see the authorities from session 4 as the current ones, but since it will already check that commitment.blockNumber > latestBeefyBlock in submitInitial(...), they will be rejected, hence the light client's state will never revert against the latestBeefyBlock set by the commitment from session 7.

E.g. this should be tested.

@claravanstaden
Copy link
Contributor

Hey @Lederstrumpf! I am closing this PR for now, since it looks like we won't address this until launch. Happy to reopen once we have decided to make this improvement.

@vgeddes
Copy link
Collaborator

vgeddes commented Feb 22, 2024

Oops sorry @Lederstrumpf @bhargavbh we closed this too hastily. I am keen on exploring this so that we can add it post launch. It seems we are going to actually launch within the next few weeks. This means that for better or worse, we now have a soft code freeze.

The gas savings introduced here are compelling. I'm just a bit worried about the complexity arising in this thread #1137 (comment) (remapping of validator sets).

After adding your improvements, the existing currentValidatorSet = nextValidatorSet; line in our algorithm seems to introduce a lot of chaotic behavior that is not simple to follow. So I think we need to change our algorithm, so that the new rules are codified in some sort of explicit state machine.

@vgeddes vgeddes reopened this Feb 22, 2024
@alistair-singh
Copy link
Contributor

E.g. this should be tested.

So here is the documentation for re-generating the test data.

  1. Ok so you can generate commitments by reading this section of the markdown. 
https://github.com/Snowfork/snowbridge/blob/main/contracts/README.md#test-fixtures
  2. Then run the following commands in order: 
https://github.com/Snowfork/snowbridge/tree/main/web/packages/test-helpers#regenerate-beefy-test-data

But it’s not useful on its own. It needs to be modified to support this case. The process goes like this:

  1. Copy a live commitment to test/data/beefy-commitment.json. Then run pnpm generateBeefyValidatorSet. This will create a new validator set with 300 keys, but re-use the validator set ID and the commitment hash from the live commitment. The output of this process generates ~/contracts/test/data/beefy-final-bitfield-0.json which basically contains a list of validators that have signed the commitment and the validator's root.
  2. Run (cd ../../../contracts && forge test --match-test testRegenerateBitField) which runs the subsampling algorithm based on the PREVRANDAO. The output of this process is constracts/test/data/beefy-final-bitfield-0.json, the bitfield containing the chosen validators that will be used for signature validation.
  3. Run pnpm generateBeefyFinalProof which takes in the bitfield from step 2 and then creates the final payload for the submitFinal.

Now the above 3 steps generate data for a single SubmitInitial/SubmitFinal cycle built from the same commitment and validator set from step 1 above. There are two sets of data for two cases which we have specialized files for.

  1. Files ending with -0.json are specialized for the case where the dynamic signature count is 0.
  2. Files ending with -3.json are specialized for the case where the dynamic signature count is 3.

The two cases above have two tests each:

  1. Accepting a commitment, when not a handover, is successful.
  2. Submitting a commitment, when a handover, is successful.

The test case you will need to test will need to test multiple SubmitInitial/SubmitFinal cycles through the light client. There will be 3 cycles of SubmitInitial/SubmitFinal.

  1. An initial commitment which you can re-use -0.json files for.
  2. A commitment that jumps forward some validator sets ahead. You will need to generate a new test/data/beefy-commitment.json from the testnet by waiting for a few validator sets.
  3. A commitment comes after the jump to normalize the current and next validator set ids in the light client.

Within those 3 cycles of SubmitInitial/SubmitFinal we would probably want to test the following cases which will also need to be generated.

  1. Jump ahead and then process a non-handover. From validator set 6 -> 8 -> 8 (skipping 7).
  2. Jump ahead and then process a handover. From validator set 6 -> 8 -> 9 (skipping 7).
  3. Jump ahead and jump a head again. From validator set 6 -> 8 -> 10 (skipping 7, 9).
  4. A test to make sure we cannot accept a commitment with a lower block number.

@acatangiu
Copy link

I'd like to see some on-chain data rough analysis supporting this will make any difference in practice regardless of approach - the bridge is getting good traffic and this will only increase, I expect there will be at least one P->E message every 4 hours anyway, so at least one finality relay happening because of demand anyway.

The optimization is worth it if data shows inactivity periods where this actually saves money.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants