Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bypass PageCache for compact_level0_phase1 #8184

Open
20 of 21 tasks
Tracked by #7386
problame opened this issue Jun 27, 2024 · 10 comments
Open
20 of 21 tasks
Tracked by #7386

bypass PageCache for compact_level0_phase1 #8184

problame opened this issue Jun 27, 2024 · 10 comments
Assignees

Comments

@problame
Copy link
Contributor

problame commented Jun 27, 2024

Tasks

The compact_level0_phase1 currently uses ValueRef::load here, which internally uses read_blob with the FileBlockReader against the delta layer's VirtualFiles. This still goes through the PageCache for the data pages.

(We do use vectored get for create_image_layers, which also happens during compaction. But I missed the compact_level0_phase1.)

Complete PageCache Bypass

We can extend the load_keys step here to also load the lengths of each blob into memory (instead of just the offset)

let mut all_keys = Vec::new();
for l in deltas_to_compact.iter() {
all_keys.extend(l.load_keys(ctx).await?);
}

This allows us to go directly to the VirtualFile when we use the ValueRef here:

let value = val.load(ctx).await?;

The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.

Do we rely on the PageCache for performance in this case?

Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers.
One instance by example:

image

Quick Fix 1: RequestContext-scoped mini page cache.

In earlier experiments, I used a RequestContext-scoped mini page cache for this.

Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.

Proper Fix

Use streaming compaction with iterators where each iterator caches the current block.

We do have the diskbtree async stream now.

We could wrap that stream to provide a cache for the last-read block.

@problame problame changed the title bypass PageCache for compact_level0_phase1 bypass PageCache for compact_level0_phase1 Jun 27, 2024
jcsp added a commit that referenced this issue Jul 17, 2024
This test reproduces the case of a writer creating a deep stack of L0
layers. It uses realistic layer sizes and writes several gigabytes of
data, therefore runs as a performance test although it is validating
memory footprint rather than performance per se.

It acts a regression test for two recent fixes:
- #8401
- #8391

In future it will demonstrate the larger improvement of using a k-merge
iterator for L0 compaction (#8184)

This test can be extended to enforce limits on the memory consumption of
other housekeeping steps, by restarting the pageserver and then running
other things to do the same "how much did RSS increase" measurement.
@problame
Copy link
Contributor Author

Did some initial scouting work on this:

  • the holes functionality of compaction (introduced in Skip largest N holes during compaction #3597) requires scanning of all keys before scanning all values
    • tl;dr for what holes does:
      • So the trade-off is that we rather create smaller L1s than creating sparse delta space on top of image layers.
      • And if we're ingesting a lot of data on both sides of the hole, then this is definitely the right trade-off because we will have full-sized L1s on either side.
      • But if we have little L0 data, then we create the small L1s.
      • And all of this is necessary because we have the stupid count_deltas as the trigger for image layer creation.
    • More details in private Slack DM
  • we need to preserve the holes functionality until we have a better approach for image layer creation at the top
  • testing: I can't find any dedicated Rust unit tests. Would be nice to extract the existing logic enough to get coverage for existing behaviors, but that's a lot of work.

problame pushed a commit that referenced this issue Jul 22, 2024
This test reproduces the case of a writer creating a deep stack of L0
layers. It uses realistic layer sizes and writes several gigabytes of
data, therefore runs as a performance test although it is validating
memory footprint rather than performance per se.

It acts a regression test for two recent fixes:
- #8401
- #8391

In future it will demonstrate the larger improvement of using a k-merge
iterator for L0 compaction (#8184)

This test can be extended to enforce limits on the memory consumption of
other housekeeping steps, by restarting the pageserver and then running
other things to do the same "how much did RSS increase" measurement.
@problame problame self-assigned this Jul 29, 2024
problame added a commit that referenced this issue Jul 29, 2024
…ck expressions

Byproduct of scouting done for #8184

refs #8184
problame added a commit that referenced this issue Jul 30, 2024
…ck expressions (#8544)

Byproduct of scouting done for
#8184

refs #8184
problame added a commit that referenced this issue Jul 31, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
@problame
Copy link
Contributor Author

Status update:

arpad-m pushed a commit that referenced this issue Aug 5, 2024
…ck expressions (#8544)

Byproduct of scouting done for
#8184

refs #8184
arpad-m pushed a commit that referenced this issue Aug 5, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
@problame
Copy link
Contributor Author

problame commented Aug 12, 2024

Status update:

Plan / needs decision:

@problame
Copy link
Contributor Author

problame commented Aug 16, 2024

Status update: validation mode enabled in pre-prod

Pre-Prod Analysis

First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199

However, here's the list of dashboards I looked at:

Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)

  • no noticable CPU / disk IOPS impact
  • but compaction iterations take about 2x wall clock time
    • makes sense because validation does about twice the amount of VirtualFile calls, with no concurrency
    • we're not bottlenecking on disk however

Screenshot from the log scraping query, which I found quite insightful
Image

Can we enable it in prod?

What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.

So, is prod compaction throughput bottlenecked on the global semaphore?

We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):

(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}

There are some places where we have sampling skew, so, do clamping

clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)

Image

the p99.9 instance in that plot looks like this

quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

Image

average like this

avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

Image

@problame
Copy link
Contributor Author

For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849

@problame
Copy link
Contributor Author

problame commented Aug 19, 2024

Decision from today's sync meeting:

  1. https://github.com/neondatabase/infra/pull/1745
  2. Create metric to measure semaphore contention.
  1. Table decision for remaining regions until EOW / next week.

@problame
Copy link
Contributor Author

problame commented Aug 26, 2024

This week, as per discussion thread:

@problame
Copy link
Contributor Author

Results from pre-prod are looking good.

Image

@problame
Copy link
Contributor Author

problame commented Sep 2, 2024

Plan:

@problame
Copy link
Contributor Author

problame commented Sep 5, 2024

Results from rollout shared in this Slack thread

tl;dr:

  • halved the PS PageCache eviction rate, and stabilized it a lot
  • halved the metric "wall clock time spent on compaction / ingested bytes" (see query below)
sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))

problame added a commit that referenced this issue Sep 5, 2024
…idation

After this PR is merged, deployed, and guaranteed to not be rolled back,
we can remove the field from `pageserver.toml`s.

refs #8184
problame added a commit that referenced this issue Sep 5, 2024
refs #8184

We are running streaming-kmerge without validation everywhere
and won't roll back.
problame added a commit that referenced this issue Sep 5, 2024
refs #8184

Our staging and production `pageserver.toml` doesn't contain
this field anymore. It was already being ignored by the last release.
problame added a commit that referenced this issue Sep 5, 2024
refs #8184

Our staging and production `pageserver.toml` doesn't contain
this field anymore. It was already being ignored by the last release.
problame added a commit that referenced this issue Sep 6, 2024
…-kmerge without validation (#8934)

refs #8184

PR neondatabase/infra#1905 enabled
streaming-kmerge without validation everywhere.

It rolls into prod sooner or in the same release as this PR.
problame added a commit that referenced this issue Sep 23, 2024
#8935)

refs #8184
stacked atop #8934

This PR changes from ignoring the config field to rejecting configs that
contain it.

PR neondatabase/infra#1903 removes the field
usage from `pageserver.toml`.

It rolls into prod sooner or in the same release as this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant