Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSD wear when utilizing compression #92

Open
akseg73 opened this issue Sep 28, 2024 · 17 comments
Open

SSD wear when utilizing compression #92

akseg73 opened this issue Sep 28, 2024 · 17 comments

Comments

@akseg73
Copy link

akseg73 commented Sep 28, 2024

Is there some way to reduce the "chunk-size" for compressed blocks in order to reduce write amplification. "chunk-size" here would mean the smallest block of compressed data, which has to written individually. Depending upon the use case at hand, is it possible to reduce the chunk size to something closer to 16kb or less. Having a much larger chunk size can create problems in presence of frequent modifications and can severely degrade SSD life.

chunksize above may not mean the same thing as the option provided to lvcreate.

@akseg73
Copy link
Author

akseg73 commented Sep 28, 2024

Even if such a feature is not explicitly provided, i would imagine that the change should require modifying some constant somewhere and rebuilding the kernel modules and installing with those. If you can point me towards the header file that has the said constant i can make the change and rebuild and test with it

@lorelei-sakai
Copy link
Member

The short answer is no. All compression is done in 4kb chunks and the granularity cannot be changed.

But to go into a little more depth, what write amplification are you talking about that you want to prevent? VDO generally will only write each unique block once (compressed or not). As phrased, the question doesn't seem to make sense for VDO.

The compression handling is mostly in packer.[hc]. That would be the place to start looking if you want to dig into this yourself.

@akseg73
Copy link
Author

akseg73 commented Sep 28, 2024

Thanks for the response. I was guessing the internal workings of VDO based upon how ZFS works. So if the file is divided up into contiguous compressed chunks, then modifying 4k within a chunk would lead to reevaluation of the compression of the whole chunk and lead to write amplification. This is a kind of usage in which modifications are expected to occur on compressed data. If you compress the data and it remains read only thereafter then it does not matter. On the other hand if you expect to modify compressed data, then you would have to recompress it. Now it becomes a matter of what other data is adjascent to the data you are compressing/uncompressing and what other data will end up sharing space with the data that you are recompressing.

So the question i am posing only makes sence in an environment where compressed data is frequently modified. If VDO does not work in such a case then offcourse write amplification makes no sense.

@lorelei-sakai
Copy link
Member

I'm not familiar with the way ZFS works. VDO really only works on 4k blocks and doesn't use larger chunks for anything, so I think these amplification concerns don't affect VDO. If the compressed data is modified and rewritten, vdo does have to compress the new data, but it don't rewrite compressed data that hasn't been changed. Newly compressed fragments are combined with other newly compressed data and written as new data blocks.

@akseg73
Copy link
Author

akseg73 commented Sep 28, 2024

Thanks for the response. Then it appears to be quite different, it would also seem that this design maybe is more aligned with the natural bahavior of SSDs which do not perform update in place. On the other hand this design conceivably requires more book keeping to find out what is where and might end up adding to some write overhead. I would have to dig much further into the internals to determine if this really matters or you have some nifty ways to avoid that as well.

@lorelei-sakai
Copy link
Member

Yes. VDO does have to track a fair amount of metadata and the write amplification concerns it has mostly center around making sure the metadata writes are manged efficiently. Generally the I/O saved by deduplication and compression will outweigh the metadata costs, but for workloads that don't have much duplicate data, VDO may not be a good choice.

@akseg73
Copy link
Author

akseg73 commented Sep 28, 2024

The data we have is highly compressible but we will turn off dedup, we don't have duplicate data. A use case where compression is the main goal, would that work with VDO?

@raeburn
Copy link
Member

raeburn commented Sep 28, 2024

It's not the sort of case we test or tune for, but it should work fairly well. At least, assuming the data is compressible in small chunks, and not because, say, a 5kB string of binary data doesn't compress well itself but is repeated a lot with only minor variations -- then all of your compressibility would come from seeing those long strings of bits repeated a lot. Most compressible data I've seen compresses fine in smaller chunks (e.g., text, or small and highly structured binary records), so yes, I would expect that to work well with VDO.

Depending on your data, you might find it useful to tweak the compression code a little to specify a higher level of compression (and higher CPU utilization) if it improves the space savings. I believe the driver is hard-coded to use the default or fastest level of compression. Unfortunately(?) we don't have controls available for the compression level or alternate compression algorithms.

@akseg73
Copy link
Author

akseg73 commented Sep 28, 2024

What about a case where 4k bytes contain 30% contiguous empty space, that is an ideal case for compression. There is also a case where two pages (4k maybe) are identical with the exception of 50 bytes in this case i would imagine you would require dedup to be switched on, compression alone lz4 would not be able to figure this out.

Lets focus just on the case where every block of 4k written is 30% empty would that work reasonably well.... just to get a baseline understanding of what to expect

@raeburn
Copy link
Member

raeburn commented Sep 30, 2024

The 30% empty space certainly provides some compressibility, but similarity between different 4kB blocks does not, the way VDO works.

VDO also needs to pack two or more compressed chunks (plus a little bit of metadata) into one 4kB block for storage; the compressed chunks can't span block boundaries. If the 30% is the only compressibility, each data block would compress to about 70% of its size, and no two would fit together in 4kB. If some of the remaining 70% percent is compressible as well, it may still be able to pack 2:1 or better.

As a trial, you could try "split -b 4096" on some sample data files or disk images and "lz4 -1" on the split files, to get a ballpark estimate for how well it might go.

@akseg73
Copy link
Author

akseg73 commented Sep 30, 2024

This is sort of what i was trying to understand earlier as to what the chunk_size is, from your description it seems that the chunk size is always 4k. This is where ZFS would allow a different chunk size. And if for eg the chunk size happened to be 8k instead and we had 33% empty space in each page, then we would be able to fit 3 pages into the 8k instead of 2 pages and save space for 1 page. There must be a way for me to change the chunk size to 8k or 12k and experiment with that.
If we have an absolute requirement that 2 pages must get compressed to 4kb for this to work then this is probably going to be problematic because i would typically expect a page to compress down to 60% of its original size most often.

Also if we are compressing 2 pages into a single block of 4k, then we come back to the original question, if we modify only 1 of these two pages and issue a write, will that not end up rewriting the whole 4k chunk with recompress of the page that changed and leaving the other page unchanged. Or will it simply forget the previous 4k altogether and club together two previously unrelated but modified pages into a single 4k block again.

If it follows the later strategy i imagine that it will increase the book keeping by a lot.

@raeburn
Copy link
Member

raeburn commented Sep 30, 2024

It's essentially baked into the design that the same size is used throughout. For example, a logical address (say you read the 4kB block at offset 0x12345000) maps to a physical address and an index into the array of compressed chunks (with a special value for "not compressed"). A block of compressed chunks is stored with an array of offsets into the 4kB block showing where each of up to 14 compressed chunks start. Packing 3:2 doesn't really work under this model.

We use reference counts to track when a previously allocated block can be recycled. If we pack two compressed blocks into one block for writing, it'll start with a ref count of 2. If one logical address is overwritten with new data, that data is written elsewhere, and the ref count on the first block drops to 1. If the other logical address is overwritten too, then the ref count drops to 0 and we can reallocate the block. In theory it's possible to fill a VDO device with compressed blocks that are down to a ref count of 1 each, wasting a lot of space; we expect that the randomness of access patterns and variation in compressibility will keep that from happening too often. With a highly structured access pattern and data, it's possible to defeat this.

There's currently no manual "garbage collection" or "defrag" mechanism for VDO compressed blocks still partly in use but containing some unreferenced compressed chunks, but it would be possible to put something together, in theory. It would be expensive to run, as it would probably need to scan the whole logical address space mapping, and look at the stored compressed blocks (to see how many index values are used, to figure out if any are no longer referenced anywhere) and their ref counts. If a compressed fragment gets moved, all entries in the address map have to get updated, and there are no back pointers to those map entries.

(And, it's worth noting, we don't aim for "perfect" or "ideal" space efficiency with our deduplication or compression, just "very good" for the most typical use cases. Very long window between two writes of the same data block content? We may miss it, if the index entry expires. None of your data blocks compress by more than 30%? We won't be able to pack them compactly.)

The block size is assumed to be 4kB, because in the past we found that worked best for deduplication, which has always been our main focus. (This was years ago and I don't know if the data are public.) For compression without deduplication, it's possible larger block sizes might work better. The block size was configurable early on in development, as I recall, but fixing the value simplified things greatly and it didn't seem to be an option we expected people to use much, if deduplication would be made worse by it. It might be possible to add it back, but it would probably be a large and subtle piece of work, and I'm not sure it would be worthwhile.

@akseg73
Copy link
Author

akseg73 commented Sep 30, 2024

Thanks for the detailed explanation. It seems that the functionality has been fine tuned to specific usage. I don't see how we can guarantee that two blocks fit together in 4k, it could happen but it would be too restrictive to enforce it.

ZFS allows configuring the chunk size, but ZFS is very large and complicated and will require us to overhaul our storage design. And it will add CPU costs of its own.

If you are aware of any other block compression libraries do let me know.

If you have examined other tools in similar space that could be useful please do provide the details.

@akseg73
Copy link
Author

akseg73 commented Sep 30, 2024

For a lot of usages the kinds of things that can help are: a) you allow the chunk size to be flexible, 4k, 8k, 12k or something else. b) you also provide a mechanism to be able to recover one page from another which differ in only a few bytes.

And ZFS has something for each kind of usage, except as mentioned, ZFS is a large piece of code which comes with its own CPU costs.

@lorelei-sakai
Copy link
Member

From the way you describe your dataset I don't think VDO will help very much. VDO is designed around deduplication, while compression feature was added on later. This is why VDO uses a 4k block size. (Duplication drops off significantly as the block size increases.)

A few years back we looked into making a compression-only storage target. Such a thing would significantly reduce the overhead associated with VDO, and might allow features like a configurable chunk size. However, we didn't see a lot of demand for such a target so it didn't even get to the detailed design stage. I mention this mostly because at the time we looked into it and did not find any other existing storage system focused solely on compression, either.

@akseg73
Copy link
Author

akseg73 commented Sep 30, 2024

Thanks for the response.

@akseg73
Copy link
Author

akseg73 commented Oct 11, 2024

I am considering the possibility of making an open source project and contributing it back in a BSD license which focuses solely on a very lightweight compression scheme.

I believe that the lack of demand you saw may have something to do with
a) the audience you sampled
b) lack of education.

If done right, this kind of utility can become a standard for nearly all data products, we have to guarantee that the CPU doesn't degrade any more than ~ 10%

If you have testing infrastructure that can be utilized to validate such s/w or would like to collaborate on it, or like to provide any other input ..pls chime in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants