Epic: pageserver image layer compression #5431

jcsp · 2023-10-02T08:30:49Z

Background

We may substantially decrease the capacity & bandwidth footprint of tenants by compressing data in their image layers.

There are many possible implementations, from compressing whole layers files as streams, to introducing some chunked format and decompressing a chunk at a time, to simply compressing individual pages.

Compressing individual pages in image layers is by far the simplest thing to do, and should have a high payoff as:

image layers are often the majority of a tenant's storage footprint.
image layers provide 8kib pages that should be large enough to meaningfully compress.

Compressing deltas is a harder problem (individual deltas are likely too small to usefully compress), and is left as a possible future change.

Implementation

There is a preliminary version here: #7091, which demonstrates that per-page compression in image layers may be added as a relatively lightweight code change.

To get this ready for production, there is more work to do:

Evaluate compression algorithms on realistic datasets. We should analyze:
- zstd
- LZ4
- zstd/LZ4 plus dictionaries: we could craft a dictionary-per-layer to get better compression of each page in the layer.
- Pay particular attention to read performance: this is the part that will be in the hot path for getpage latency.
Revise page header format to enable stashing compression flags -- we currently have a four byte header which is gratuitously large, and we should be able to store compression info in there without adding more header bytes (discussed at Compress image layer #7091 (comment))
Handle compressed user data efficiently: if the user's data is already compressed, we should detect that and avoid re-compressing it on the pageserver.
Define a phased roll-out approach: there maybe significantly more CPU load once compression is in use.

PRs/issues

The text was updated successfully, but these errors were encountered:

We'd like to get some bits reserved in the length field of image layers for future usage (compression). This PR bases on the assumption that we don't have any blobs that require more than 28 bits (3 bytes + 4 bits) to store the length, but as a preparation, before erroring, we want to first emit warnings as if the assumption is wrong, such warnings are less disruptive than errors. A metric would be even less disruptive (log messages are more slow, if we have a LOT of such large blobs then it would take a lot of time to print them). At the same time, likely such 256 MiB blobs will occupy an entire layer file, as they are larger than our target size. For layer files we already log something, so there shouldn't be a large increase in overhead. Part of #5431

problame · 2024-05-27T13:37:00Z

Last week:

arpad wrote a tool to compress image layers Add a pagectl tool to recompress image layers #7879

This week:

identify interesting / representative tenants / layers
determine achievable space savings by running the tool against the identified layers

koivunej · 2024-06-10T13:44:32Z

This week:

implement decompression
compare decompression speed
have a meeting with Konstantin, Stas, and John later this week
- which algorithm is chosen right now

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Oct 2, 2023

MMeent mentioned this issue Jan 19, 2024

Epic: Safekeeper S3 (WAL segment) compression #6409

Open

jcsp mentioned this issue Mar 28, 2024

Data compression in the pageserver #548

Closed

jcsp changed the title ~~Epic: pageserver compression~~ Epic: pageserver image layer compression Apr 15, 2024

jcsp assigned arpad-m May 20, 2024

arpad-m mentioned this issue May 22, 2024

Warn if a blob in an image is larger than 256 MiB #7852

Merged

arpad-m mentioned this issue May 24, 2024

Add a pagectl tool to recompress image layers #7879

Draft

arpad-m mentioned this issue Jun 19, 2024

Add support for reading and writing compressed blobs #8106

Merged

arpad-m mentioned this issue Jul 2, 2024

Only support compressed reads if the compression setting is present #8238

Merged

This was referenced Jul 3, 2024

Use bool param for round_trip_test_compressed #8252

Merged

Add find-large-objects subcommand to scrubber #8257

Merged

arpad-m added a commit that referenced this issue Jul 4, 2024

Use bool param for round_trip_test_compressed (#8252)

a004d27

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

arpad-m mentioned this issue Jul 4, 2024

Flatten compression algorithm setting #8265

Merged

This was referenced Jul 5, 2024

Improve parsing of ImageCompressionAlgorithm #8281

Merged

Enable zstd in tests #8288

Draft

Add concurrency to the find-large-objects scrubber subcommand #8291

Merged

This was referenced Jul 6, 2024

Remove ImageCompressionAlgorithm::DisabledNoDecompress #8300

Open

Implement decompression for vectored reads #8302

Open

VladLazar pushed a commit that referenced this issue Jul 8, 2024

Use bool param for round_trip_test_compressed (#8252)

a6e8902

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

VladLazar pushed a commit that referenced this issue Jul 8, 2024

Use bool param for round_trip_test_compressed (#8252)

304375a

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

VladLazar pushed a commit that referenced this issue Jul 8, 2024

Use bool param for round_trip_test_compressed (#8252)

2e79292

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

VladLazar pushed a commit that referenced this issue Jul 8, 2024

Use bool param for round_trip_test_compressed (#8252)

0a63bc4

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: pageserver image layer compression #5431

Epic: pageserver image layer compression #5431

jcsp commented Oct 2, 2023 •

edited by arpad-m

Loading

problame commented May 27, 2024 •

edited

Loading

koivunej commented Jun 10, 2024

Epic: pageserver image layer compression #5431

Epic: pageserver image layer compression #5431

Comments

jcsp commented Oct 2, 2023 • edited by arpad-m Loading

Background

Implementation

PRs/issues

problame commented May 27, 2024 • edited Loading

koivunej commented Jun 10, 2024

jcsp commented Oct 2, 2023 •

edited by arpad-m

Loading

problame commented May 27, 2024 •

edited

Loading