Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to sign the stateroot? #1526

Open
Tommo-L opened this issue Apr 1, 2020 · 70 comments
Open

How to sign the stateroot? #1526

Tommo-L opened this issue Apr 1, 2020 · 70 comments
Labels
Question Used in questions
Milestone

Comments

@Tommo-L
Copy link
Contributor

Tommo-L commented Apr 1, 2020

Q1: Should stateroot be part of core or plugin?

As we discussed at last night meeting, most people agreed that It should be part of the core to ensure the data consistency of the all cn nodes.

Q2: Which method should we use to sign state root?

This is still under discussion, there are some options:

  • Option A: Using the consensus message, binding stateroot and proposal block to complete the signature.
  • Option B: The cn node just broadcasts its own stateroot signature directly.
  • Option C: Using the stateroot contract, the consensus node which send the proposal block, add a stateroot transaction in the proposal block, just like the MinerTransaction in neo2.x
  • Option D: Add stateroot in header

What do you think?

@Tommo-L Tommo-L added the Question Used in questions label Apr 1, 2020
@Tommo-L
Copy link
Contributor Author

Tommo-L commented Apr 1, 2020

For me, I think option A is the best way at present, even through don't like it. Consider that if stateroot shuold be provided to the outside, like for SPV, cross-chain, it should be verifiable, requiring the signatures of all the consensus nodes.

@shargon
Copy link
Member

shargon commented Apr 1, 2020

They can send the current state root in the prepare request, it will be signed automatically. But then there is no difference with include it in the header.

@ZhangTao1596
Copy link

They can send the current state root in the prepare request, it will be signed automatically. But then there is no difference with include it in the header.

They can sign header and stateroot separately.

@roman-khimov
Copy link
Contributor

it should be verifiable, requiring the signatures of all the consensus nodes.

That's exactly why it should be in the header, because that's what CNs naturally agree upon and sign. Also, speaking of outside users, any P2P-distributed solution would require those users implementing P2P protocol support which may not be something they want to do.

And to me that's also very important from the other perspective, ensuring user's trust in the network. This whole state is something that matters to users, making header contain state hash gives clear guarantee that whatever this state is it won't suddenly change. If there is a need to change the state because of some bug or whatever then is should be done in a clear and explicit manner in subsequent blocks (like described in #1287), but not by rewriting the history.

So, Option D, definitely.

@shargon
Copy link
Member

shargon commented Apr 7, 2020

My favorite version is send the StateRoot of the previous block in the proposal, and include it in the header.

@Tommo-L
Copy link
Contributor Author

Tommo-L commented Apr 8, 2020

Store in block footer?

|  header: version, hash, ... |
|  body: txs |
|  footer: stateRoot |

All consensus nodes sign the whole block, and set block.hash = header.hash.

@ZhangTao1596
Copy link

ZhangTao1596 commented Apr 11, 2020

We should consider of vm upgrade compatiblity first before decide where to put root.
For example, if we accept the versioning vm proposaled by @shargon . Same height will get the same result no matter which version the node is. We can put root anywhere without inconvenience.

@roman-khimov
Copy link
Contributor

My favorite version is send the StateRoot of the previous block in the proposal, and include it in the header.

I've been thinking about it for a while and although it has the obvious downside of state lagging one block behind, it still gives us clear predictable behaviour: for transaction in block N you get a state at block N + 1. And it fits our current transaction handling model nicely, it's very easy to implement. So it probably really is the best option we have at the moment.

Store in block footer?

Wouldn't that complicate life for light nodes/wallets? I think it would be nice for them to be able to operate just using headers. Also, stateRoot should probably be added to MerkleRoot calculation then (the same way ConsensusData is now), so I'm not sure what this move to footer would buy us.

For example, if we accept the versioning vm proposaled by @shargon . Same height will get the same result no matter which version the node is.

This versioning is essential IMO, but not just for VM, we should also consider whole contract execution environment like native contracts or syscalls (that might also be changed in some incompatible way for whatever reasons).

@Tommo-L
Copy link
Contributor Author

Tommo-L commented Jun 12, 2020

Description
Previously, we store stateRoot in p2p message for two main reasons:

  • If we separate state persistence from block persistence, the consensus node can quickly process transactions without waiting for state writes and smart contract execution. original posted at Detach state persistence and block persistence #302 (comment)
    (If we store the previous block's state root in block header, it also may not have the problem)
  • As shown in the figure below, if a bug in the VM caused an incorrect calculation of 1+1=3, we want to fix it.

image

For the above problem, there are 4 ways to deal with:

Solution1

Ignore. For the future, we will correct it.
image

Affect

  • Layer1: neo-core need to support different neovm version and syscalls
  • Layer2+: do nothing
  • "Victim": have to face reality

Applicable scope:

  • Tiny bug
  • Tolerable impact
  • "Victim" < layer2+

Solution2

image

Affect

  • Layer1: neo-core need to support different neovm version and syscalls
  • Layer2+: They all need to rollback, It may also cause double spend or loss of assets on layer2+, like dex, exchange, lease, AMM, etc.
  • "Victim": It'll be happy

Applicable scope:

  • Serious problem
  • "Victim" > layer2+

Solution3

Roll back the block and re-execute the old block with the new version of vm.

image

Affect

  • Layer1: neo-core need to support different neovm version and syscalls, but it may has an additional negative effect.
    • Transactions that failed in the past may be executed successfully.
    • Transactions that successed in the past may be executed failure.
    • Some related variables will also change.
  • Layer2+: They all need to re-check the status of their transactions, their account balance, and dAPP, etc. It may lead to new problems, such as double spend or loss of assets on layer2+, like dex, exchange, lease, AMM, etc.
  • "Victim": It'll be happy

Applicable scope:

  • Serious problem
  • Secondary impact is relatively small
  • "Victim" > layer2+

Solution4

image

Affect

  • Layer1: Just use the latest node for recalculation and start from the first block. But it also may has an additional negative effect.
    • Transactions that failed in the past may be executed successfully.
    • Transactions that successed in the past may be executed failure.
    • Some related variables will also change.
  • Layer2+: They all need to re-check the status of their transactions, their account balance, and dAPP, etc. It may lead to new problems, such as double spend or loss of assets on layer2+, like dex, exchange, lease, AMM, etc.
  • "Victim": It'll be happy

Applicable scope:

  • Serious problem
  • "Victim" > layer2+

Summary

  • Solution1,2,3 are easy to accept in some cases.
  • Solution1, Solution2, Solution3 need neo-core to support different neovm version and syscalls.
  • Solution3, Solution4 may have additional effect, and it's a little uncertain and unpredictable to users.

I think that we put state root in block header will better than p2p message, which will benifit upgrade. What do you think?

@shargon
Copy link
Member

shargon commented Jun 12, 2020

I vote for Solution 3

@vncoelho
Copy link
Member

We discussed this before,@igormcoelho emphasized that failed should not be converted. Perhaps we need a Solution 4,which is 3 with restrictions.
I will talk to Igor and remember the past discussions.

@roman-khimov
Copy link
Contributor

Oh, it'd be so easy to discuss that in one room with a whiteboard, but anyway.

My take on solutions 3 and 4 is that they're dangerous as they would change the state for old blocks which I think is not acceptable for several reasons:

  • it breaks state root, recalculated state is going to be different
  • it's not cross-chain friendly because if the other blockchain is to refer to some state root and we're to change it that link is going to be lost
  • probably the most important is that it breaks user's trust in the system, as a user I expect the state of the system to remain consistent no matter how the software changes, all changes to the state must be explicit and traceable

Solution 2 basically has the same issues, but it also directly contradicts one of the core Neo features which is one block finality.

And I don't think that we're only left with solution №1, we should take it and amend with one important addition from #1287 which is corrective transactions to get a solution number 5.

Solution 5
It would look something like this (sorry, it would take me like a week to redraw it following your style):
corrective-tx-flow

It basically follows solution number one in that we're releasing VM version 2 and all blocks starting with some number N get processed using this new VM. But at the same time if we know that the bug fixed in VM version 2 has caused some wrong states we explicitly correct them with another corrective transaction in one of subsequent blocks. So if accounts A1 and A2 have wrong balances we adjust them by this transaction.

Obviously it's a very powerful (and dangerous) mechanism, so this transaction would be signed by CNs or even whole governance committee (now that we have it). And it would only be applied when there is a need to, if some wrong cryptocuties were generated because of this bug we may as well decide that they're way too cute to change them (and thus they just won't be corrected). But if there is a change in the state, the user would know why it has happened and it at the same time won't break any references to old state.

Affects

  • Layer1: neo-core needs to add support for VM/syscalls versioning and corrective transactions
  • Layer2+: will follow the corrective transaction changes, it still may cause some problems here, but IMO they'd be easier to solve than with any other solution because the state change is explicit.
  • "Victim": 👍

Applicable scope: just about anything.

@igormcoelho
Copy link
Contributor

igormcoelho commented Jun 12, 2020

Nice discussion @Tommo-L , as always. I know many options are already on the table, but like @roman-khimov , I still feel like some extra important information/options is missing for decision-making. Let me try to contribute
General Idea: Divide deployed contracts into those who may accept BUGFIX, and eventually change their states, and those which won't accept NOBUGFIX, and if such "tricky" situation is ever presented, the contract is either frozen or migrated to a new one (no state ever changes).

There are two things still left behind from reasoning: "in-time" consensus perspective and contract programming nature itself.
Issues with In-time Consensus
As discussed some times with Yong Qiang, we need "in-time" requirement for consensus to work, we tend to forget that, but if we could go back in time and re-process blocks, those CN who were elected in the past, and then were taken out in the future (due to bad things) would still have access to their "past private keys", meaning they could still collude (in the past) to generate bad blocks (in the past). So, someone who is "left behind" may process bad data without any other way to discover that, causing a hard fork that would only be solved by a "world consensus" (suddently some blocks would sync and others won't, it two realities). So, in my opinion, this definitively rules out Options 2, 3 and 4, leaving just Options 1 and 5 on the table.

Issues with Contract Programming
Now, I'll present my perspective on Contract Programming. We are used to C, C#, C++, Java, but forget about D Language and others that support the concept of Contract Programming. Usually, contracts can be basically done with assert() statements (something we remove on Release for efficiency), but blockchain world can benefit a lot from this perspective. Sometime ago I defended that users should be able to defend themselves from any undesired State Change, and proposed some draft NEP called Solid States. The idea is simple: as long as a transaction passes, it should always pass, and if it fails, it should always fails. So, if some user wants to protect its assets (with or without State Root hashes), it simply puts an "assert()" in its "code", any fail/exception launching opcode is enough for this, so if this "assert()" guarantees that its balance will be 2000, since this tx passed, it can rest assured that it will always be 2000 at that point. This is very good and, in my perspective it's currently necessary on Neo2, for DEX, and any cross-chain mechanism.

What happens is that, if users "abuse" from this, and suppose they "assert()" every operation on their Entry Script for transactions, paying this tiny extra GAS cost, they could ensure that "known bad operations" will be kept forever (example, it knows that DIV operation does not fail with zero, so it quickly submits a tx with an assert that x/0 equals 5, and it passes, locking our ability to fix that forever).
So, for me, we can progress quicker in this discussion if we decide a few things regarding Neo3 Contract Programming logic:
(a) Do we want to verify states for every deployed contract, without any bugfix capability or should we let user choose it?
(b) Do we want that any possible "cheap user assert" locks our ability to bugfix anything in the future?

On my perspective, which I don't think was discussed before here, I'll guess what I think we should do:


Solution 6: We should allow users to mark deployed contracts as BUGFIX or NOBUGFIX.

  • If contract is NOBUGFIX, it will have its state stored on the NEXT BLOCK (on practice for efficiency, we take ALL contracts that are NOBUGFIX and create a merkle tree just to aggregate that contract root states). This effectively resolves the problem for all NEP-X tokens that want to crosschain or DEX. If they have some bug behavior on some block, some day, their state will never be changed, that's the rule for them.
  • The other contracts that are BUGFIX, will only have their states distributed via P2P, meaning that external entities should be aware that those operations may, for some unexpected situation in the future, have some state changed due to some bugfix.
  • Transactions that Fail may unfail, or the opposite, since Entry Script will not be protected from changes.
  • Deployed contracts should be renewed periodically (automatically renewed by default in the network), and this renewal may be blocked in the future if it's marked as NOBUGFIX and it explores a known bug (this prevents intentional injection of bugs in the ecosystem). In this case, Network Governance would be able to push some migration, in a previously authorized operation "force_migration_by_network" that must exist on all NOBUGFIX contracts. This "force_migration_by_network" could also be used by some future network migrations (such as Neo3 -> Neo4).

Now we come to the point, which contracts should be BUGFIX and which shouldn't? Now, we let users choose, and we let Neo Foundation also choose.

Ecosystem adjustments and Two NEO's
My opinion is that:

  • Neo/Gas native should be BUGFIX
  • NEP-X tokens on general should be NOBUGFIX (or a hybrid, explained in the end)

This also means that we can create an alternative NEP-5-NEO-TOKEN-NOBUGFIX, which is collateralized by Native Neo BUGFIX.
So, if some major stealing (such as on ETH DAO) ever happens, Native Neo will be fixed, but NOBUGFIX version may suffer future losses in its global collateral balances.
This way, DEX could trade NEO-NOBUGFIX without any worry about state changes, and if some fix needs to be made on Native Neo, community on general would need to decide what to do about the collateral on NEO-NOBUGFIX. Seems quite reasonable to me. As long as collateral is low, risk is low, and if something "strange" happens, NOBUGFIX contract is temporarily locked for investigation.

Hybrid BUGFIX/NOBUGFIX
Finally, I've also defended some time ago some Contract Inheritance strategy, meaning that we should be able to Inherit from Native Neo/Gas logic, thus creating a Hybrid BUGFIX/NOBUGFIX token, where its balance logic may suffer changes (and fixes) as Native Neo, while preserving its Governance Logic as rock solid. In this case, it should be marked as BUGFIX, otherwise it would become inconsistent when its "base" changed.
For a proper "logic", a NOBUGFIX contract would only access/invoke other NOBUGFIX contracts, meaning that it could only receive Neo from a NOBUGFIX NEO... so, the process of "minting" would probably involving transforming you "volatile" Neo into a "non-volatile" Neo, and using this for minting. Neo blockchain itself could provide a "non-volatile" version and do this "automatically", as long as the holding logic of the "possibly lost future assets" is clear on NOBUGFIX NEO (the challenge of someone else coding this is precisely the limitation of NOBUGFIX do not invoke BUGFIX).
To implement this "bridge", NOBUGFIX Neo could invoke some Oracle-like operation, like getCurrentStateForContractX, that would append on Tx header the p2p state at that time X (from speaker node, that would also be validated by other nodes when putting Tx on block), giving "allowance" for "unsafe Neo" to become "solid Neo". If state changes in the future, what happens in a discrepancy in what is effectively storage on NOBUGFIX Neo collateral balance. This NOBUGFIX NEO could also be FRACTIONAL, different from original Neo, and eventually could store some Neo for future insurance on collateral (if the backing state is not the state anymore, without exposing this to other platforms).

My opinion is that minting when raising funds in a NOBUGFIX contract could be done this way, accepting "volatile" BUGFIX Neo, even if the contract token itself is NOBUGFIX. The only risk is such state-changing unexpected event happening on NEO token during crowdfunding, so losses may affect only fundraising itself, at that particular time, but without any risk of "contaminating" this risk to third-parties (or DEX), since states wouldn't change for that specific token.

My vote
So, I vote for this Solution 6, being operationalized by some "future fix operation" as proposed in Solution 5.

For the original question, we would need A (distribute global state with consensus p2p) and D (store on block header last state of NOBUGFIX contracts).
Regarding versioning: I don't think it's fully possible to version everything, but one small useful versioning is the syscall list a least, so that introducing new syscalls won't break anything in the future.

Affects

  • Layer1: neo-core needs to add support for VM/syscalls versioning and corrective transactions (for state changes on BUGFIX contracts, ONLY in rare situations where Future Tx is used)
  • Layer2+: No problem if following NOBUGFIX contracts, and if these are disabled in the future, this will also be done properly/publicly in the blockchain, for the future only, without any state change. The issues are same as Solution 5 if following BUGFIX contracts. Some DEX may simply only list NOBUGFIX contracts, which is simpler to implement, so a NEO-TOKEN-NOBUGFIX would be very welcome).
  • "Victim": 👍 (if using NOBUGFIX, it's NOBUGFIX so don't complain, and if someone else is using a hybrid BUGFIX contract and you prefer that, just use that. "Victim" can choose.)

@vncoelho
Copy link
Member

vncoelho commented Jun 12, 2020

I agree with these points,Igor, I vote for Solution 6.

@ZhangTao1596
Copy link

ZhangTao1596 commented Jun 15, 2020

I have a crazy idea here.

First , should we keep the storages and balances from vm-1.x?

I think yes. If we all agree, we can continue.

Solution 7 GetSnapshot from network

After we implemented MPT, We can use MPT Proof to verify storage. So we can even sync storages from other nodes.
We don't keep all different execution environments in one node instead we use the power of the whole neo network and keep different execution environments in the network.
It will act like this in vm-2.x.
edit (1)

In the neo network, vm-1.x-staorages are in node-1.x and vm-2.x-storages are in node-2.x.
When node-2.x first sync block with ver-1.x, It doesn't execute them but just verify.
When node-2.x start execute block with ver-2.x, It will sync storages it need from other nodes and use MPT to verify them.

And what will happen in vm-1.x
state root upgrade
Maybe we can stop execution and persist when sync higher version block.
state root upgrade (1)
Or we can even keep storage going by syncing.
state root upgrade (2)

Advantages

  • Don't need put all verioning logics in one node.
  • Don't re-execute block and tx.
  • Keep history storages and balances.

Disadvantages

  • Keep different verions node in network
  • Add extra p2p messages to sync storage
  • Maybe execute slowly when persist first few blocks after upgrade.

@ZhangTao1596
Copy link

@erikzhang Can you please have a look at these solutions, which one do you prefer?

@erikzhang
Copy link
Member

Option B is good to me.

@roman-khimov
Copy link
Contributor

Solution 7

This one seems to be cross-chain compatible, but still it has a problem of state changing suddenly. It's like you have a $1000 on your account at block N and now you have $100 at block N+1. You may wonder why did it happen, but the only answer you get is "we've had a software upgrade". I think that for every state change there has to be a clear traceable answer to that question of why.

Option B is good to me.

How are we going to solve cross-chain issues and the problem of synchronization between two chains in the same network (like #1702 (comment))?

And what's the advantage of it? If we're to return to the two main points reminded to us by @Tommo-L:

If we separate state persistence from block persistence, the consensus node can quickly process transactions without waiting for state writes and smart contract execution. original posted at #302 (comment)

CNs have to have an up-to-date state to participate in consensus, they can't do anything useful until they have this state (it can be seen even in the current neox-2.x implementation), so I don't see how detached state makes CNs more performant. In fact it may even slow things down because of networking issues (or even completely break consensus for the same reason).

As shown in the figure below, if a bug in the VM caused an incorrect calculation of 1+1=3, we want to fix it.

And this one now has like 7 different solutions right in this thread.

I want to make sure we're doing the best we can for Neo 3.

@igormcoelho
Copy link
Contributor

@KickSeason if I understood correctly, the issue I see on Solution 7 is precisely that we may not want to keep some states from vm-1.x, for example, if these resulted in asset losses that were fixable in vm-2.x.

We already have 7 solutions, maybe we can agree on the basics first:

  • Why do we want to allow bugfix? For me, the reason is: if native assets suffer a "sudden crazy change" because some poor implementation passed on VM/Interop layer that went to production, we can fix these.
  • Why do we want to prevent bugfix? For cross-chain and DEX we want to ensure that past states are immutable, otherwise we can generate severe external inconsistencies.

@Tommo-L @KickSeason @roman-khimov @shargon and specially @erikzhang , is there any chance you agree with some idea of having bugfixing limited to some contracts (like native Neo) and some states stored on header for other contracts (like some token that wants immutability)? (presented Solution 6).

We can find some hybrid final solution, as long as we agree on fundamentals that we want network to provide.

@vncoelho
Copy link
Member

@KickSeason,

After we implemented MPT, We can use MPT Proof to verify storage. So we can even sync storages from other nodes.
We don't keep all different execution environments in one node instead we use the power of the whole neo network and keep different execution environments in the network.

That is a good point! Nice insight.
It is possible to sync storages and, then, just check and sync with the current P2P broadcast from Validators. In this sense, you can speed-up sync considerably.

@roman-khimov
Copy link
Contributor

Why do we want to allow bugfix?

Because there is some intention that we have when writing software and the source code is just an attempt at formalizing this intention, this can be a nice attempt, but still it (quite often) happens that the formalization is not exactly what we've intended. A bit shorter version: there are bugs.

Why do we want to prevent bugfix?

Because "we" are the ones exploiting this bug? Sorry, but probably that's the only possibility I can think of. In general, I think that once the bug is known most of the people would want to fix it, because their intentions didn't include this bug. And to be fair that's actually why I think marking contracts as bugfix-impossible (part of solution number 6) won't be used a lot and thus one generic bugfixing path is enough.

But at the same time IMO it's not correct making a direct relationship between allowing/disallowing bug fixes and allowing/disallowing state root changes. It is possible to be able to fix bugs and don't change old state roots at the same time. And make any state changes resulting from bug fixes traceable.

We can find some hybrid final solution, as long as we agree on fundamentals that we want network to provide.

We should and therefore I also should note that I'm basing on the following expected characteristics set:

  • ability to fix bugs (otherwise the chain will only live until the first serious bug found)
  • 1:1 immutable relationship between block and state of the chain (and 100% reproducibility of this state when processing the chain from the genesis block, otherwise it's not possible to reliably refer to the state of the chain)
  • the state is only changed by blocks (yes, we're talking blockchain here)
  • any change in the state must be fully auditable (no sudden changes in the state out of nowhere, the chain should explain the change by itself)

Basically, it's all about making the behavior of the system predictable.

@ZhangTao1596
Copy link

This one seems to be cross-chain compatible, but still it has a problem of state changing suddenly. It's like you have a $1000 on your account at block N and now you have $100 at block N+1. You may wonder why did it happen, but the only answer you get is "we've had a software upgrade".

@roman-khimov When node-2.x persist the first block after upgrade, it use new execution logic but the storages are from node-1.x. Why is there sudden changing? If there is change, it must happen in some tx in N + 1 block.

@roman-khimov
Copy link
Contributor

When node-2.x persist the first block after upgrade, it use new execution logic but the storages are from node-1.x. Why is there sudden changing?

Ah, maybe I've misunderstood this one a little. So basically it's the same as solution number 1, it's just that the new node doesn't contain the logic for VM 1.x and the only way to get the state for old blocks is via P2P from old nodes?

@ZhangTao1596
Copy link

When node-2.x persist the first block after upgrade, it use new execution logic but the storages are from node-1.x. Why is there sudden changing?

Ah, maybe I've misunderstood this one a little. So basically it's the same as solution number 1, it's just that the new node doesn't contain the logic for VM 1.x and the only way to get the state for old blocks is via P2P from old nodes?

Yea!

@roman-khimov
Copy link
Contributor

OK, thanks for clarifying that. For some reason my first impression was that new (VM 2.x) nodes bring the new state as if they were running from the genesis block.

But then solution number 7 (S7) has the same basic characteristics (limited scope) as solution number 1 (S1) and is mostly concerned with questions of compatibility and maintenance, where S1 is about keeping the VM 1.x code in the node, S7 makes the node cleaner by removing it and relying on the network to get proper states for VM 1.x epoch.

It's a nice hack, but at the same time I think that shifting this maintenance burden from the node code to node instances is a bit problematic as nothing guarantees long-term node-1.x existence in the network. And we can have like 10 VM versions, so there would have to exist nodes for each VM version and someone would have to maintain them. And what if some non-VM bug would need to be fixed? We'll have to update all 10 versions of the node. And it would be hard to ensure we're not breaking anything. I think in practice it would outweigh the effort required to maintain compatibility in one code base and single code base is more reliable, it's trivial to test the code against known behavior (state root should match).

This mechanism of P2P state sharing may be useful for some other purposes, though.

@Tommo-L
Copy link
Contributor Author

Tommo-L commented Nov 13, 2020

Some other options we can consider in the future:

  • SMT
  • MPT with binary tree
  • IAVL tree
  • lightweight checksum in header and put state root outside the header
  • ....

@roman-khimov
Copy link
Contributor

Some other options we can consider in the future:

These are different approaches to how to calculate state, and I'm open to any better suggestions wrt this topic. But this is a separate topic from where to store state data and it's the most important one. I think the advantages of header-included state root are well covered here, so let's concentrate on possible negative effects. Performance and size concerns are now ruled out by #1526 (comment) (and #1526 (comment)), so we have just two problems left:

  • repairing the state without regenerating the blocks
  • block generation may stop frequently in the future

Repairing the state
First of all, this claim is based on assumption of possibility to replace state root chain (if it is separate from the block chain) and thereby "repair" it. But in fact doing so creates more problems than it solves:

  • cross-chain references become invalid
  • the network becomes split on state (old nodes will still have the old one, requiring update and resynchronization)
  • state finality is broken (Neo is advertising one block finality, but what's the value of it if we can't guarantee state finality?)

It essentially is a fork, although in a state chain and forks are better be avoided (at least in case of Neo). And it also is changing the game rules while playing which is not a good practice. Instead, if we're in the situation where we need to repair the state we need to... repair it actually. Not replace it, not rerun old transactions under new rules, not fork the state (or block) chain, but repair the state we've got. And this can be done following the scheme from #1526 (comment) with explicit committee-signed corrective transactions that will fix the state. It of course also needs execution environment versioning, but that's not a big problem either.

Potential consensus failures
It is true that consensus can fail if CNs are not to agree on the current state. But sorry, it is a feature.

First of all, can state difference cause consensus to fail now? Yes it can. There is a potential for this even now, without any state roots. If CNs are to disagree on GAS contract state they can fail primary node proposal validation and depending on mempools this situation can happen again and again after view changes. But good luck debugging that without clear state data.

Second, how likely in practice are we going to see complete consensus failure (network stop)? Not really likely. For it to happen we need at least f + 1 nodes having different state from the majority of CNs. If we're talking about single implementation running on all nodes that means f + 1 nodes failing in a strange way (being alive, but with incorrect state) simultaneously, not really likely. If we're talking about different implementations that means that we need at least f + 1 non-C# nodes on the network and I just don't see it at the moment. If that ever to be the case I think we'll have mature enough implementations by that time to make it less likely. What's more likely is for one node (irrespective of its implementation) to go wild which is quickly detected and fixed. That's what we actually want for network's reliability, quickly detect misbehaving CNs and fix them.

Third, even if we're to imagine CN state split the question is --- is it better to continue running the network with unknown state or fix the problem and continue running properly synchronized network? This is exactly the case where stopping the network is appropriate, because doing so can prevent bigger problems (like the ones requiring state repair). Remember #1989, it's not a critical bug (luckily), but it is a perfect demonstration of the problem that can be detected and prevented by (different implementations of) CNs exchanging their state during consensus process. Continuing running the network unsychronized can potentially lead even to accepting wrong blocks. And that is a catastrophic scenario for the network, especially the one declaring one block finality. This must be prevented even if it costs network outage.

So I'm absolutely sure that exchanging state data during consensus process (and including it in the header) actually improves the network reliability. We're likely to see some CNs go wild occasionally, but this will be a signal, either CN needs to be fixed, or there is an implementation problem, or there is some protocol problem that resulted in this, but without this signal we're just blind to these problems. We're not likely to see f + 1 CNs having different state, but if we're to see that --- there should be some reason and we better deal with that reason, again, not crossing fingers and hoping that this is fine.

On the importance of this issue for Neo 3
IMO this issue is absolutely critical to solve before Neo 3 launch exactly because it is a cornerstone, it's easy to have it since block 0 and it's not that simple to add it at any other block after the genesis. It directly affects header format and rolling out this change would require changing any client/node implementation with the need to keep backwards compatibility at the same time (we're not starting a new chain because of this). It can be done, but it's better be avoided, that's why I think leaving this question open for post-3.0 is not an option. There are questions that can be left post-3.0 easily, but not this one.

@roman-khimov
Copy link
Contributor

Regarding this and #2905/#2974. The way to do this is via block version one with an additional field that works the same way NeoGo StateRootInHeader option does. This means calculating and storing MPT root for the block. But if we want 3.7 to be the last resynchronized version then we have a problem since MPT is only calculated by the StateService plugin which 99.9% of people don't use. Even if we're to integrate it into the core in 3.8 this would still mean resynchronization at least to get the state. So either we have 3.8 with resync or we should integrate MPT into 3.7 to be calculated already (even though it's not a part of the block yet).

@shargon
Copy link
Member

shargon commented Mar 5, 2024

Regarding this and #2905/#2974. The way to do this is via block version one with an additional field that works the same way NeoGo StateRootInHeader option does. This means calculating and storing MPT root for the block. But if we want 3.7 to be the last resynchronized version then we have a problem since MPT is only calculated by the StateService plugin which 99.9% of people don't use. Even if we're to integrate it into the core in 3.8 this would still mean resynchronization at least to get the state. So either we have 3.8 with resync or we should integrate MPT into 3.7 to be calculated already (even though it's not a part of the block yet).

We will need to resync in 3.8 or we will delay 3.7 forever...

@roman-khimov
Copy link
Contributor

An non-resynchronizing update can be performed with a trick:

  • calculate current MPT over existing KV data stored (will take some time, but it's done once)
  • process new blocks updating MPT in a regular way
  • switch to v1 block at some HF and have stateroot in the block header

@roman-khimov
Copy link
Contributor

Prototyped in nspcc-dev/neo-go#3566, works as expected. Block version switches at the given hardfork and then version 1 blocks have one field more. The problem for us though is that we can't have #3175/#3178 in the same hardfork with version 1 blocks without breaking NeoFS network compatibility for C# nodes (the trick from #3175 (comment) won't work with stateroot-enabled blocks, it must happen somewhere before). So we're blocked by those PRs currently.

@roman-khimov roman-khimov modified the milestones: v3.8.0, v3.9.0 Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Used in questions
Projects
None yet
Development

No branches or pull requests

7 participants