Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Keep Indexer Data on Disk #1634

Open
containerman17 opened this issue Oct 4, 2024 · 5 comments · May be fixed by #1660
Open

Discussion: Keep Indexer Data on Disk #1634

containerman17 opened this issue Oct 4, 2024 · 5 comments · May be fixed by #1660
Assignees

Comments

@containerman17
Copy link
Contributor

containerman17 commented Oct 4, 2024

I propose storing the indexer data on disk. Right now, it's all kept in RAM, then copied to disk, and later restored from disk to RAM during Node startup. Speed shouldn't be a concern. Modern SSDs are cheap and abundant, and with ext4's page cache, performance should be solid. In production, a simple caching server could be placed in front for added speed.

I love having a built-in indexer. Indexers in EVM are a pain, so let's make this a fully functional one—we're already 99% there.

I also suggest removing the const maxBlockWindow uint64 = 1_000_000 limit on stored blocks. Since the data will be on disk, it’s no longer necessary. Instead, we can limit the size with an option like --max-indexer-size=4TB.

Indexer nodes shouldn't be validators, and validator nodes shouldn't index. That's how it works in EVM, and I envision the same for HyperSDK.

P.S. The only issue I see is that block history won't be syncable across nodes, but we've never discussed keeping the entire chain history anyway.

@aaronbuchwald
Copy link
Collaborator

We do currently keep the full retention window on disk, but we explicitly avoid storing a blockID -> height mapping because it will produce a heavy random write workload. To support lookups by both height and blockID, we iterate over the full height based mapping on startup so that we can load the blockID -> height mapping into memory. This motivates setting an upper bound to prevent prolonged load times on startup.

There are a couple of tradeoffs here:

  • do we support an on-disk blockID -> block height mapping (heavy random workload)?
  • do we set a maximum retention window?
  • how much do we rely on cache vs. read from disk? (if we set a low enough limit, no problem with relying on cache)
  • to what extent can we push users from depending directly on HyperSDK node APIs, so that we're trapped into supporting them vs. pushing users to using scalable services outside of the node

The big question is where to draw the line between an API served by the HyperSDK in the node, as a sidecar using code provided by the HyperSDK, or as an external service built to scale out horizontally?

@containerman17
Copy link
Contributor Author

HyperSDK should provide sufficient tooling for at least 80% of projects by default without needing any additional software, IMHO. Let me know if you disagree.

I'll run my benchmarks on NVMe and EBS and will get back to you with the results so we can continue the conversation with data.

@containerman17
Copy link
Contributor Author

containerman17 commented Oct 8, 2024

I made a benchmark, and the performance is more than adequate.

The benchmark runs on 100k blocks, fills them, then queries randomly.

Setup

The transactions are Transfer transactions with a TransferResult result, signed with a dummy key. This setup is as close to reality as possible. Each block contains 1000 transactions, so 100k blocks contain 100 million transactions total, occupying 1.2GB on disk.

Database structure

Blocks are stored as height -> block bytes pairs. To find blocks by ID, we store block ID => height pairs. We don't store transactions separately. Instead, every time we fetch a transaction from an entire block, we keep txID => block height pairs in a separate database.

Write benchmark

The benchmark calls indexer.Accept in batches of 1000 blocks at a time. Note: this isn't actual API-level batching, but rather calling indexer.Accept 1000 times with wait groups. On my Mac, it takes 250 to 500 ms to record 1000 blocks, containing 1 million transactions.

Read benchmark

In the read benchmark, we retrieve:

  • 2000 random blocks using random(0, chainHeight-1)
  • From these random blocks, get their IDs and then retrieve the same 2000 blocks by ID
  • From the random blocks, get one transaction per block, and retrieve those by ID

Overall results:

  • Retrieving blocks by ID: ~3300 requests per second
  • Retrieving blocks by height: ~3700 requests per second
  • Retrieving transactions by ID: ~1500 requests per second

Problems

  1. I often hit OOM when retrieving more than 2000 transactions at once, but that's probably due to the benchmark structure.
  2. Occasionally, I get a negative WaitGroup counter error in Pebble. Not sure if I'm using Pebble incorrectly or if it's a race condition bug.
  3. I haven't tested on network block storage (like default AWS EC2 storage), but I'm 100% sure performance will be significantly lower due to the high latency of network-based block storage.

Conclusions

For 100k TPS writing 100 blocks per second, enabling the indexer could be a major slowdown for a validating node. However, for a node dedicated to indexing, running even on such an underpowered machine, as Macbook Air, the limit would be at least 2k blocks with 1000 transactions each—around 2 million TPS, which is more than enough.

Possible improvements

  • Batch writes to disk once per second (I believe should do at least 10x improvements on ingestion)
  • Add read caching
  • Keep the last 100 blocks in memory
  • Store transaction duplicates. Will make tx lookups way faster, but requires almost double the storage space
  • Add pruning

Here's the branch with the benchmark: indexer-on-disk
Benchmark log

@containerman17
Copy link
Contributor Author

From offline conversation: Need to test 100+ gigs DBs

@containerman17
Copy link
Contributor Author

Updated benchmark for an 11GB database:

Database size on disk: 11GB. Blocks are digested at around 400-500 blocks per second without parallelization enabled. Each block contains 1000 transactions, so that's 400-500k TPS. The read speed has decreased slightly but remains solid—2k+ RPS for whole blocks and 1k+ RPS for individual transactions. Overall, this benchmark processed over 2 million blocks with a rolling window of 1 million blocks.

2024/10/10 11:31:31 accepted 9359 blocks
2024/10/10 11:31:32 accepted 9691 blocks
2024/10/10 11:31:32 accepted 10k blocks containing 10m txs (height=2m) in 21.654585233s. Database occupies 11G on disk. 461k TPS
2024/10/10 11:31:33 Retrieved 1000 blocks by ID in 439.584919ms (2274.87 RPS)
2024/10/10 11:31:34 Retrieved 1000 blocks by height in 405.051033ms (2468.82 RPS)
2024/10/10 11:31:35 Retrieved 1000 transactions by ID in 821.499821ms (1217.29 RPS)
2024/10/10 11:31:38 accepted 359 blocks
2024/10/10 11:31:39 accepted 720 blocks

@containerman17 containerman17 linked a pull request Oct 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants