Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove provider checks at startup #5337

Merged
merged 1 commit into from
Jun 19, 2024
Merged

Conversation

mangas
Copy link
Contributor

@mangas mangas commented Apr 11, 2024

Fixes #3937

  • New chain now created with ChainIdentifier::default() on start
  • Adapter genesis+net_version now checked by ProviderManager
  • Adapter checks will be disabled temporarily after an intermittent failure
  • Adapter won't be returned by the ProviderManager if the ident returned is not the expected value.
  • BlockIngestor now takes a ChainClient and tries to get an adapter with each poll, this means it won't stop if the first poll fails
  • node/src/main.rs refactored so that most of the init code is automatically executed by the new Networks type.
  • graphman run now uses the same init code as the node, should be able to run using any chain
  • EthereumNetworks type removed
  • FirehoseNetworks type removed
  • New chains require a working adapter for creation

Future improvements:

  • Cache genesis once it's set
  • Cache chain ident once it's set

@mangas mangas marked this pull request as draft April 11, 2024 09:09
@mangas mangas changed the title Remove provider checks at startup Remove provider checks at startup [WIP] Apr 11, 2024
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 4d95fd8 to 64963c5 Compare April 15, 2024 16:20
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 2 times, most recently from b486f8a to 1c43770 Compare April 22, 2024 13:23
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 8 times, most recently from 9ad4249 to d6c76f8 Compare May 13, 2024 10:41
@mangas mangas marked this pull request as ready for review May 13, 2024 15:11
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from f15c7dd to 002def6 Compare May 13, 2024 15:20
@mangas mangas changed the title Remove provider checks at startup [WIP] Remove provider checks at startup May 13, 2024
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from cdfca0f to c4886a3 Compare May 14, 2024 09:32
@mangas mangas requested a review from lutter May 14, 2024 09:35
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from c4886a3 to c9784a8 Compare May 14, 2024 15:11
@fordN fordN requested review from incrypto32 and zorancv and removed request for lutter May 14, 2024 15:56
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from c828c7d to d3a0e68 Compare May 16, 2024 10:30
@fordN fordN requested review from lutter and removed request for incrypto32 May 16, 2024 15:41
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from 808fd97 to ec0306a Compare May 17, 2024 14:38
}

/// get_all will trigger the verification of the endpoints for the provided chain_id, hence the
/// async. If this is undesirable, check `get_all_verified` as an alternatives that does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: get_all_unverified

@@ -593,7 +593,7 @@ impl<S: Store> IndexNodeResolver<S> {
}
BlockchainKind::Starknet => {
let unvalidated_subgraph_manifest =
UnvalidatedSubgraphManifest::<graph_chain_substreams::Chain>::resolve(
UnvalidatedSubgraphManifest::<graph_chain_starknet::Chain>::resolve(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix.

Copy link
Contributor

@zorancv zorancv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you did refactoring along with functional improvements.

node/src/main.rs Outdated
.cleanup_ethereum_shallow_blocks(eth_network_names)
.unwrap();
}
None => todo!(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this ever be reached? Is unreachable!() more suitable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep good catch I've updated and added an explanation

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from 380de39 to 738af4b Compare June 6, 2024 08:13
node/src/network_setup.rs Outdated Show resolved Hide resolved
Comment on lines +125 to +128
#[cfg(debug_assertions)]
call_only_adapters.iter().for_each(|a| {
a.is_call_only();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason we need this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just debug time sanity check

chain/ethereum/src/network.rs Outdated Show resolved Hide resolved
) -> Result<(), Error> {
if call_only {
warn!(logger, "Call only providers not supported"; "provider" => provider);
return Err(anyhow!("Call only providers not supported"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer to put the provider into the error, too; it'll save some hunting through logs if this ever happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code hasn't been added in this PR, I need to check why this difference is showing, might have been part of a rebase


// Get chain head ptr from store
let head_block_ptr_opt = self.chain_store.cheap_clone().chain_head_ptr().await?;

// To check if there is a new block or not, fetch only the block header since that's cheaper
// than the full block. This is worthwhile because most of the time there won't be a new
// block, as we expect the poll interval to be much shorter than the block time.
let latest_block = self.latest_block().await?;
let latest_block = self.latest_block(&logger, &eth_adapter).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but logger is already a &Logger, so no need to pass &logger here

.cheapest()
.await
.ok_or_else(|| graph::anyhow::anyhow!("unable to get eth adapter"))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How stable is this selection of an adapter? Meaning: if everything is working fine, will we be always using the same adapter? This is important since switching between providers can produce pretty confusing results if one of them is lagging behind another (i.e., the chain head could be jumping back and forth) or if they are on different branches.

It also feels a little weird that most of the methods now take an eth_adapter argument, but the ingestor also has a way to select one. That might be an indication that there is some abstraction missing for adapter selection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_pool will loop inside IIRC and will only exit if there was an error, So if this will only be called on a retry. Before, if the adapter was broken it would never be replaced. Where are these methods taken an eth_adapter? They are probably called after selection, most place that do adapter selection should have a ChainClient IIRC and then inner methods just get the simpler eth_adapter so they don't need to deal with error. That's the general pattern, this was one of the few places where this behavior hadn't been implemented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant methods like latest_block or ingest_block just preceding this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm not sure I follow, once the ingestor fetches the adapter it will re-use the same one until there is an error, in which case we would get a new one which goes through the selection that's already implemented everywhere else. Did I miss some part of it?

impl EthereumNetworkAdapters {
pub fn new(retest_percent: Option<f64>) -> Self {
pub fn empty() -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be #[cfg(debug_assertions)] as it's only used in tests, and having a scarier name like empty_for_testing might be good, too, to assure the casual reader that this isn't used in production code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to revert to renaming because of the build in release mode step on CI that builds the test crate for some reason
image

.unwrap_or_default();

Self::available_with_capabilities(all, required_capabilities)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that these two methods and available_with_capabilities should be methods on the ProviderManager.

Copy link
Contributor Author

@mangas mangas Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should those methods do for firehose or substreams?

.manager
.get_all(&self.chain_id)
.await
.unwrap_or_default();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent behind the .unwrap_or_default() that we'll return an empty iterator? It might be better if this returned Result<impl Iterator<Item = &EthereumNetworkAdapter> + '_, SomeError> to force the caller to handle that condition. If not, the method should at least have a comment on the circumstances under which this returns an empty iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

let all = self
.manager
.get_all_unverified(&self.chain_id)
.unwrap_or_default();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on the unwrap_or_default as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

}
_ => AvailableCapacity::High,
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this discretize the SubgraphLimit? I would have thought you want to select the provider with the biggest available capacity and/or the lowest usage. But it seems this message first classifies providers into 3 groups, and presumably something than picks one from one of the groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code wasn't modified by this, it was just moved, probably because of a rebase and I don't really want to open the scope of the review to that code since it's not part of the PR

.map_or(Ok(None), |key| {
key.parse::<MetadataValue<Ascii>>().map(Some)
})
.expect("Firehose key is invalid");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these expect and panics will cause a configuration error to stop graph-node from starting, and AFAICT, we don't even have tooling to check that part of the configuration. So a typo like url = "htps://some.where.com/" will make it impossible to start graph-node. These issues should result in an error, and that part of the config should be ignored, but the rest of graph-node should start up successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the previous comment, this wasn't changed, this code has been reviewed elsewhere and hasn't been modified

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that this code was moved over. I went through the PR to look at places where it can produce a panic, as we've had several instances now where panics can cause huge operational issues and we need to avoid them as much as possible.

(k, BlockchainKind::Substreams) => k,
(k, _) => k,
})
.expect("each chain should have at least 1 adapter");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to log an error and ignore that chain instead of aborting startup of the entire process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a valid state of configuration where an empty section can be provided then sure, I would also like to know what we should do with that setup.

Do you have a suggestion for the error message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reading the config file validates that there is at least one provider, logs an error and removes the chain from the config otherwise, it would be ok to leave the expect there to say something like validation should have checked we have at least one provider. Otherwise, the error should be something like Chain {name} does not have any providers configured. Ignoring it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@lutter
Copy link
Collaborator

lutter commented Jun 7, 2024

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement.

@mangas
Copy link
Contributor Author

mangas commented Jun 8, 2024

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement.

The way I see it, we can either learn to live with these "out of scope" improvements in a way that will be addressed somewhere in the future, since they are, after all, out of the scope of this particular PR or we need to essentially accept that whenever we write any PR we will need to fix the world in one go. I think it's already a really bad practice to have a PR large as this as the standard and would prefer to avoid it when possible.

I also don't think whoever works on a specific piece should be "owning" any future improvements since most of those require time that someone needs to spend but anyone can really spend that time.

In this case if you are unwilling to accept the compromise then I guess I would like to understand what you are willing to accept from this PR so that I can get the needed functionality merged in.

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from 9549e19 to a568ed5 Compare June 10, 2024 10:27
@mangas mangas requested a review from lutter June 10, 2024 10:59
@lutter
Copy link
Collaborator

lutter commented Jun 10, 2024

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement.

The way I see it, we can either learn to live with these "out of scope" improvements in a way that will be addressed somewhere in the future, since they are, after all, out of the scope of this particular PR or we need to essentially accept that whenever we write any PR we will need to fix the world in one go. I think it's already a really bad practice to have a PR large as this as the standard and would prefer to avoid it when possible.

I agree that such a large PR is bad practice, and it would have been better to split it into multiple PRs focused on specific aspects (like 'restructure code, no functional change', 'Delay provider checking') especially when the individual commits are not really reviewable by themselves and the commit messages give no hint at what the commit tries to achieve (a message of 'fix tests' does not give me confidence that the implementation followed a plan)

Fixing 'out of scope' improvements is part and parcel of working with a large, mature codebase; there will always be things we understand better now than we did when the code was first written. The work for aggregations for example was 70-80% just refactoring how we deal with the subgraph schema to bring some sanity to that code; the actual functionality was a relatively small amount of the overall work.

I also don't think whoever works on a specific piece should be "owning" any future improvements since most of those require time that someone needs to spend but anyone can really spend that time.

I think this is a little facile, and sounds a lot like "I don't want to deal with it, somebody else should do it", especially in light of PR 4916 which copy-pasted some very intricate code and which I approved with the understanding that you would fix that - it's been 6 months.

The attitude of "let's commit this and we'll fix issues with it later (never)" is what makes the code around ingestion from providers so complex: the code wasn't in the greatest shape when we had just RPC, we then shoehorned Firehose and substreams on top of it, and now how any of that works is incredibly hard to follow from the code as it is scattered across a million places. This code needs an owner who has a keen interest in improving it - that requires rethinking how we do that, writing up a plan, and actually doing it. Without somebody who understands this code at a deep level, fixing things in isolation will only make matters worse.

In this case if you are unwilling to accept the compromise then I guess I would like to understand what you are willing to accept from this PR so that I can get the needed functionality merged in.

At a minimum, please file tickets for the various issues I pointed out (like crashing on misconfiguration) We also need to figure out who will own the whole front side of the house so that we can make progress on improving this code.

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 06523cf to 7f9388f Compare June 13, 2024 08:25
Copy link
Collaborator

@lutter lutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this with the understanding that we will schedule a more thorough revamp of the whole ingestion side of graph-node soon.

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 492c073 to 6428635 Compare June 18, 2024 13:23
@mangas
Copy link
Contributor Author

mangas commented Jun 18, 2024

Approving this with the understanding that we will schedule a more thorough revamp of the whole ingestion side of graph-node soon.

Turned both cards into issues so I can link them here:

#5499
#5500

They are, however, still not prioritised

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from b9b59ec to b2c51cf Compare June 19, 2024 10:13
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from b2c51cf to 0d321df Compare June 19, 2024 11:50
@mangas mangas merged commit 9b9b8f9 into master Jun 19, 2024
7 checks passed
@mangas mangas deleted the filipe/remove-start-provider-checks branch June 19, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make graph-node tolerate chains not being available during startup
4 participants