-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bypass PageCache for compact_level0_phase1
#8184
Comments
compact_level0_phase1
This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - #8401 - #8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.
Did some initial scouting work on this:
|
This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - #8401 - #8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.
part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.
Status update:
|
part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.
Status update:
Plan / needs decision:
|
Status update: validation mode enabled in pre-prod Pre-Prod AnalysisFirst night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199 However, here's the list of dashboards I looked at:
Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)
Screenshot from the log scraping query, which I found quite insightful Can we enable it in prod?What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput. So, is prod compaction throughput bottlenecked on the global semaphore? We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):
There are some places where we have sampling skew, so, do clamping
the p99.9 instance in that plot looks like this
average like this
|
For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849 |
Decision from today's sync meeting:
|
This week, as per discussion thread:
|
Results from pre-prod are looking good.
|
Plan:
|
Results from rollout shared in this Slack thread tl;dr:
|
…idation After this PR is merged, deployed, and guaranteed to not be rolled back, we can remove the field from `pageserver.toml`s. refs #8184
refs #8184 We are running streaming-kmerge without validation everywhere and won't roll back.
refs #8184 Our staging and production `pageserver.toml` doesn't contain this field anymore. It was already being ignored by the last release.
refs #8184 Our staging and production `pageserver.toml` doesn't contain this field anymore. It was already being ignored by the last release.
…-kmerge without validation (#8934) refs #8184 PR neondatabase/infra#1905 enabled streaming-kmerge without validation everywhere. It rolls into prod sooner or in the same release as this PR.
#8935) refs #8184 stacked atop #8934 This PR changes from ignoring the config field to rejecting configs that contain it. PR neondatabase/infra#1903 removes the field usage from `pageserver.toml`. It rolls into prod sooner or in the same release as this PR.
Tasks
The
compact_level0_phase1
currently usesValueRef::load
here, which internally usesread_blob
with theFileBlockReader
against the delta layer's VirtualFiles. This still goes through thePageCache
for the data pages.(We do use vectored get for
create_image_layers
, which also happens during compaction. But I missed thecompact_level0_phase1
.)Complete PageCache Bypass
We can extend the
load_keys
step here to also load the lengths of each blob into memory (instead of just the offset)neon/pageserver/src/tenant/timeline/compaction.rs
Lines 498 to 503 in 9b98823
This allows us to go directly to the VirtualFile when we use the
ValueRef
here:neon/pageserver/src/tenant/timeline/compaction.rs
Line 623 in 9b98823
The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.
Do we rely on the PageCache for performance in this case?
Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers.
One instance by example:
Quick Fix 1: RequestContext-scoped mini page cache.
In earlier experiments, I used a RequestContext-scoped mini page cache for this.
Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.
Proper Fix
Use streaming compaction with iterators where each iterator caches the current block.
We do have the diskbtree async stream now.
We could wrap that stream to provide a cache for the last-read block.
The text was updated successfully, but these errors were encountered: