Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting of invalid voluntary_exit messages to mesh peers #24

Open
cortze opened this issue May 16, 2024 · 3 comments
Open

Broadcasting of invalid voluntary_exit messages to mesh peers #24

cortze opened this issue May 16, 2024 · 3 comments

Comments

@cortze
Copy link
Contributor

cortze commented May 16, 2024

Description

We've seen that after 2 to 2.5 hours of running Hermes starts experiencing sudden spikes in the GRAFT and PRUNE events affecting all the topics.

Although we couldn't see any direct implication in the number of peers in each mesh, it is a clear concern that could point to a decreasing peerscore that could prevent us from establishing stable connections with other nodes on meshes.

Due to the lack of message validation on each PubSub topic, it is possible that our node is forwarding non-valid messages to our mesh nodes, decreasing our score.

This is something that has been already present at our control Prysm node, where erigon/caplin peers have been sending non-valid volintary_exits.

time="2024-05-16 12:35:01" level=debug msg="Gossip message was rejected" agent="erigon/caplin" error="non-active validator cannot exit" gossipScore=-6182.725625534806 multiaddress="/ip4/120.31.71.167/tcp/55742" peerID=16Uiu2HAkzNLy2S3voLw3CFxET1kXYSZVLV6QwkHuP3RaDdGJSk2E prefix=sync topic="/eth2/6a95a1a9/voluntary_exit/ssz_snappy"
time="2024-05-16 12:35:01" level=debug msg="Gossip message was rejected" agent="erigon/caplin" error="non-active validator cannot exit" gossipScore=-6182.725625534806 multiaddress="/ip4/120.31.71.167/tcp/55742" peerID=16Uiu2HAkzNLy2S3voLw3CFxET1kXYSZVLV6QwkHuP3RaDdGJSk2E prefix=sync topic="/eth2/6a95a1a9/voluntary_exit/ssz_snappy"

Possible Solution

Suggest to not subscribe to the voluntary_exists for now. The interest on debugging that particular topic is rather low, and seems to be isolated to only that one.

@yiannisbot
Copy link
Member

Great catch, which definitely deserves a deeper look! Two quick questions:

  • why would we see this behaviour only after 2-2.5hrs and not continuously? We suspect receipt of those invalid messages is a random event which happened to start after 2hrs of running our node/experiment?
  • given that meshes and peer scores are per topic, why would our node get PRUNE'd from topics other than the voluntary_exists one?

@guillaumemichel
Copy link
Contributor

We can unsubscribe from the voluntary_exits for now as a quick fix 👍🏻

On the long run, could we copy the validation logic for this topic over to hermes as well?

@cortze
Copy link
Contributor Author

cortze commented May 17, 2024

replying to @yiannisbot

why would we see this behaviour only after 2-2.5hrs and not continuously? We suspect receipt of those invalid messages is a random event which happened to start after 2hrs of running our node/experiment?

Voluntary exits are messages with a rather short frequency, as they represent a validator sending their voluntary exit from the list of active validators. Thus, they are pretty sporadical.

given that meshes and peer scores are per topic, why would our node get PRUNE'd from topics other than the voluntary_exists one?

If your score gets too low, it can actually affect other topics as well ->

... The score is computed across all (configured) topics with a weighted mix, such that faulty behaviour in one topic percolates to other topics. ....
...
Heartbeat Maintenance
The score is checked explicitly during heartbeat maintenance such that:
- Peers with a negative score are pruned from all meshes.

to @guillaumemichel

We can unsubscribe from the voluntary_exits for now as a quick fix

I've applied the quick-fix for the night run I did locally. (fingers-crossed) If that improves the mesh connectivity, I'll add a quick patch and think of a more long-term solution.

I thought that we could easily fetch the list of active validators right at the start of the tool from our trusted Prysm node and then start judging whether the exit is valid or not to modify that list on the go 🤷🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants