bypass PageCache for `compact_level0_phase1` #8184

problame · 2024-06-27T12:53:57Z

The compact_level0_phase1 currently uses ValueRef::load here, which internally uses read_blob with the FileBlockReader against the delta layer's VirtualFiles. This still goes through the PageCache for the data pages.

(We do use vectored get for create_image_layers, which also happens during compaction. But I missed the compact_level0_phase1.)

Complete PageCache Bypass

We can extend the load_keys step here to also load the lengths of each blob into memory (instead of just the offset)

neon/pageserver/src/tenant/timeline/compaction.rs

Lines 498 to 503 in 9b98823

    
           let mut all_keys = Vec::new(); 
        
           for l in deltas_to_compact.iter() { 
        
               all_keys.extend(l.load_keys(ctx).await?); 
        
           }

This allows us to go directly to the VirtualFile when we use the ValueRef here:

neon/pageserver/src/tenant/timeline/compaction.rs

Line 623 in 9b98823

let value = val.load(ctx).await?;

The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.

Do we rely on the PageCache for performance in this case?

Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers.
One instance by example:

Quick Fix 1: RequestContext-scoped mini page cache.

In earlier experiments, I used a RequestContext-scoped mini page cache for this.

Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.

Proper Fix

Use streaming compaction with iterators where each iterator caches the current block.

We do have the diskbtree async stream now.

We could wrap that stream to provide a cache for the last-read block.

The text was updated successfully, but these errors were encountered:

This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - #8401 - #8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.

problame · 2024-07-22T10:50:39Z

Did some initial scouting work on this:

the holes functionality of compaction (introduced in Skip largest N holes during compaction #3597) requires scanning of all keys before scanning all values
- tl;dr for what holes does:
  - So the trade-off is that we rather create smaller L1s than creating sparse delta space on top of image layers.
  - And if we're ingesting a lot of data on both sides of the hole, then this is definitely the right trade-off because we will have full-sized L1s on either side.
  - But if we have little L0 data, then we create the small L1s.
  - And all of this is necessary because we have the stupid count_deltas as the trigger for image layer creation.
- More details in private Slack DM
we need to preserve the holes functionality until we have a better approach for image layer creation at the top
testing: I can't find any dedicated Rust unit tests. Would be nice to extract the existing logic enough to get coverage for existing behaviors, but that's a lot of work.

This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - #8401 - #8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.

…ck expressions Byproduct of scouting done for #8184 refs #8184

…ck expressions (#8544) Byproduct of scouting done for #8184 refs #8184

part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.

problame · 2024-07-31T14:55:54Z

Status update:

Initial PR merged compaction_level0_phase1: bypass PS PageCache for data blocks #8543
- Read the PR description, it contains a detailed rollout strategy.
default neon_local, unit tests, regression tests, and Staging are running a validation mode
- log scraping alerts for staging and prod are configured
- these will detect validation failures and forward to #on-call-storage-{staging,prod}-stream
post-merge of above PR, I ran some manual tests to inspect overhead of validation mode
- results here: compaction_level0_phase1: bypass PS PageCache for data blocks #8543 (comment)
pre-prod and prod require the following PR to be merged before the release next week:
- https://github.com/neondatabase/aws/pull/1663
- If we don't merge it, pre-prod and prod will run in validating mode as well, which we probably don't want yet.

…ck expressions (#8544) Byproduct of scouting done for #8184 refs #8184

part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.

problame · 2024-08-12T09:38:04Z

Status update:

No validation failures in Staging
Spot-checked flaky tests dashboard & error messages, no indication that the validation failure is happening there, either.
- https://github.com/neondatabase/cloud/issues/16404 would be better than spot-checking

Plan / needs decision:

Validation mode in pre-prod region? Measure CPU impact, see if low enough so we can enable validation mdoe in prod. Probablistic validation?
- https://github.com/neondatabase/aws/pull/1724

problame · 2024-08-16T09:15:20Z

Status update: validation mode enabled in pre-prod

Pre-Prod Analysis

First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199

However, here's the list of dashboards I looked at:

Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)

no noticable CPU / disk IOPS impact
but compaction iterations take about 2x wall clock time
- makes sense because validation does about twice the amount of VirtualFile calls, with no concurrency
- we're not bottlenecking on disk however

Screenshot from the log scraping query, which I found quite insightful

Can we enable it in prod?

What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.

So, is prod compaction throughput bottlenecked on the global semaphore?

We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):

(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}

There are some places where we have sampling skew, so, do clamping

clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)

the p99.9 instance in that plot looks like this

quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

average like this

avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

problame · 2024-08-16T14:26:06Z

For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849

problame · 2024-08-19T14:24:42Z

Decision from today's sync meeting:

https://github.com/neondatabase/infra/pull/1745
Create metric to measure semaphore contention.

pageserver: add counter for wait time on background loop semaphore #8769

Table decision for remaining regions until EOW / next week.

discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1724404251430089

problame · 2024-08-26T08:53:45Z

This week, as per discussion thread:

analyze perf impact in pre-prod (enable new mode, without validation)
- AFTER qualifying this week's release
- https://github.com/neondatabase/infra/pull/1827
no changes to prod

problame · 2024-08-28T15:26:57Z

Results from pre-prod are looking good.

cloudbench results for that time range
Less wall clock time for the same workload (see screenshot below) => GOOD
unchanged memory for that time range unchange
unchanged CPU for that time range

problame · 2024-09-02T11:48:22Z

Plan:

Roll the non-validating mode into more prod regions this week.
- https://github.com/neondatabase/infra/pull/1883

problame · 2024-09-05T10:48:36Z

Results from rollout shared in this Slack thread

tl;dr:

halved the PS PageCache eviction rate, and stabilized it a lot
halved the metric "wall clock time spent on compaction / ingested bytes" (see query below)

sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))

…idation After this PR is merged, deployed, and guaranteed to not be rolled back, we can remove the field from `pageserver.toml`s. refs #8184

refs #8184 We are running streaming-kmerge without validation everywhere and won't roll back.

refs #8184 Our staging and production `pageserver.toml` doesn't contain this field anymore. It was already being ignored by the last release.

…-kmerge without validation (#8934) refs #8184 PR neondatabase/infra#1905 enabled streaming-kmerge without validation everywhere. It rolls into prod sooner or in the same release as this PR.

#8935) refs #8184 stacked atop #8934 This PR changes from ignoring the config field to rejecting configs that contain it. PR neondatabase/infra#1903 removes the field usage from `pageserver.toml`. It rolls into prod sooner or in the same release as this PR.

problame mentioned this issue Jun 27, 2024

Epic: Bypass PageCache for user data blocks #7386

Open

5 tasks

problame changed the title ~~bypass PageCache for compact_level0_phase1~~ bypass PageCache for compact_level0_phase1 Jun 27, 2024

jcsp mentioned this issue Jul 16, 2024

tests: add test_compaction_l0_memory #8403

Merged

5 tasks

problame mentioned this issue Jul 29, 2024

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537

Merged

problame self-assigned this Jul 29, 2024

problame mentioned this issue Jul 29, 2024

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

Merged

problame added a commit that referenced this issue Jul 29, 2024

cleanup(compact_level0_phase1): some commentary and wrapping into blo…

ccb8636

…ck expressions Byproduct of scouting done for #8184 refs #8184

problame mentioned this issue Jul 29, 2024

cleanup(compact_level0_phase1): some commentary and wrapping into block expressions #8544

Merged

problame added a commit that referenced this issue Jul 30, 2024

cleanup(compact_level0_phase1): some commentary and wrapping into blo…

d95b46f

…ck expressions (#8544) Byproduct of scouting done for #8184 refs #8184

arpad-m pushed a commit that referenced this issue Aug 5, 2024

cleanup(compact_level0_phase1): some commentary and wrapping into blo…

fb6c1e9

…ck expressions (#8544) Byproduct of scouting done for #8184 refs #8184

problame mentioned this issue Sep 5, 2024

bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

Closed

problame added a commit that referenced this issue Sep 5, 2024

compaction_level0_phase1: change default streaming-kmerge without val…

c9ad0c6

…idation After this PR is merged, deployed, and guaranteed to not be rolled back, we can remove the field from `pageserver.toml`s. refs #8184

problame mentioned this issue Sep 5, 2024

compaction_level0_phase1: change default streaming-kmerge without validation #8933

Closed

problame added a commit that referenced this issue Sep 5, 2024

compact_level0_phase1: remove access mode config (part 1)

1772ddc

refs #8184 We are running streaming-kmerge without validation everywhere and won't roll back.

problame mentioned this issue Sep 5, 2024

compact_level0_phase1: ignore access mode config, always do streaming-kmerge without validation #8934

Merged

problame added a commit that referenced this issue Sep 5, 2024

compact_level0_phase1: remove access mode config (part 2)

65308b7

refs #8184 Our staging and production `pageserver.toml` doesn't contain this field anymore. It was already being ignored by the last release.

problame mentioned this issue Sep 5, 2024

compact_level0_phase1: remove final traces of value access mode config #8935

Merged

problame added a commit that referenced this issue Sep 5, 2024

compact_level0_phase1: remove access mode config (part 2)

fe985d6

refs #8184 Our staging and production `pageserver.toml` doesn't contain this field anymore. It was already being ignored by the last release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bypass PageCache for `compact_level0_phase1` #8184

bypass PageCache for `compact_level0_phase1` #8184

problame commented Jun 27, 2024 •

edited

Loading

Tasks

problame commented Jul 22, 2024

problame commented Jul 31, 2024

problame commented Aug 12, 2024 •

edited

Loading

problame commented Aug 16, 2024 •

edited

Loading

problame commented Aug 16, 2024

problame commented Aug 19, 2024 •

edited

Loading

problame commented Aug 26, 2024 •

edited

Loading

problame commented Aug 28, 2024

problame commented Sep 2, 2024 •

edited

Loading

problame commented Sep 5, 2024

bypass PageCache for compact_level0_phase1 #8184

bypass PageCache for compact_level0_phase1 #8184

Comments

problame commented Jun 27, 2024 • edited Loading

Tasks

Complete PageCache Bypass

Do we rely on the PageCache for performance in this case?

Quick Fix 1: RequestContext-scoped mini page cache.

Proper Fix

problame commented Jul 22, 2024

problame commented Jul 31, 2024

problame commented Aug 12, 2024 • edited Loading

problame commented Aug 16, 2024 • edited Loading

Pre-Prod Analysis

Can we enable it in prod?

problame commented Aug 16, 2024

problame commented Aug 19, 2024 • edited Loading

problame commented Aug 26, 2024 • edited Loading

problame commented Aug 28, 2024

problame commented Sep 2, 2024 • edited Loading

problame commented Sep 5, 2024

bypass PageCache for `compact_level0_phase1` #8184

bypass PageCache for `compact_level0_phase1` #8184

problame commented Jun 27, 2024 •

edited

Loading

problame commented Aug 12, 2024 •

edited

Loading

problame commented Aug 16, 2024 •

edited

Loading

problame commented Aug 19, 2024 •

edited

Loading

problame commented Aug 26, 2024 •

edited

Loading

problame commented Sep 2, 2024 •

edited

Loading