Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could -U use an SSD as if it was a RAM drive? #214

Open
giovariot opened this issue Dec 28, 2021 · 3 comments
Open

Could -U use an SSD as if it was a RAM drive? #214

giovariot opened this issue Dec 28, 2021 · 3 comments

Comments

@giovariot
Copy link

I know very little about the science behind it and never properly understood the mmap concept, but looking at a 1TB disk image compression on a 16GB RAM system I happened to notice that lrzip was using a few GBs of swap space to do an unlimited window compression.

Knowing disk access is different on SSD could it be possible to make lrzip work as if it was loading the file to be compressed as if it was already stored in RAM. In simple terms: could lrzip -U the SSD as if it was a RAM disk? Apart from the huge speed difference isn't the nice thing about SSDs that one can access addresses just like one can access RAM?

I'm trying to imagine faster ways of compressing whole disk images, and in my current example the disk image is stored on an SSD drive while the compressed file is going to be on an HDD.

Am I talking nonsense or is there any chance to use this to make -U deduplication faster?

Thanks in advance

@pete4abw
Copy link
Contributor

Put your swap partition on SSD. Also set your TMP directory on SSD too. By default, lrzip and lrzip-next will use the current directory for temporary file space, $TMP will override that. Putting swap on SSD will accomplish a lot. To make the most, run the app in a directory on SSD and then use -O dir or -o dir/file.lrz to direct compressed file output to your HDD.

That said, using -U will always slow things down and I have yet to see a case where it improves compression that makes the additional time worth it. Keep in mind that -U only impacts the rzip compression. The backend compression will still operate on chunks and blocks within them. From manpage:

-U | --unlimited
Unlimited window size. If this option is set, and the file being compressed does not fit into the available ram, lrzip-next will use a moving second buffer as a "sliding mmap" which emulates having infinite ram. This will provide the most possible compression in the first rzip stage which can improve the compression of ultra large files when they're bigger than the available ram. However it runs progressively slower the larger the difference between ram and the file size, so is best reserved for when the smallest possible size is desired on a very large file, and the time taken is not important.

in lrzip-next, you can tune the rzip pass with the -R# option meaning regardless of compression level, rzip compression can be set independently from 1-9. This will help pack the file better on the first pass.

@giovariot
Copy link
Author

giovariot commented Dec 28, 2021

I'm not sure I have understood anything at all right now 😂

Here is the description of the problem: I'm trying to compress the best way I know a before-recovery disk image so that if my client data is ruined by the recovery they can go back to the original disk-image and try to extract the damaged data directly from the dd image.
Now some info about the disk itself: it was a 1TB SSD with 2 partitions, one with the OS and most data and another one with two huge almost duplicate folders coming from 2 different file recovery programs (they seem to be a master of destroying hard disks). So in the disk image the are for sure lots of duplicates among the first and the second partition and in the second partition there are for sure almost only duplicates. I thought that what lrzip -U option was doing was removing the duplicate redundancies from the whole file, even if the duplicates files were, for example: the first one in the first GBs of the first partition, the second at the start of the second partition and the third duplicate was at the end of the second partition so that the output file could only contain "one" copy of the data.

Isn't this -U perfect for infinite-windows redundancy removal? As far as I know only lrzip can do this sort of infinite redundancy removal, while rzip it's limited to a 900MB window. I also checked your lrzip-next fork but not finding the -U option I came back to the standard lrzip and I've used it like so: lrzip -n -U img -o output.lrz using the -n option to then have a "clean" file to use another compression program on. Have I not understood anything at all about all this works?

What does the -R option in lrzip-next accomplish? I thought that the first rzip pass was only doing redundancy checks and not compression too, does the -R option also check for hash duplicates at, ie 450GBs of distance?

Thanks in advance @pete4abw, I'm sorry for being such a noob 😅

@pete4abw
Copy link
Contributor

Isn't this -U perfect for infinite-windows redundancy removal? As far as I know only lrzip can do this sort of infinite redundancy removal, while rzip it's limited to a 900MB window. I also checked your lrzip-next fork but not finding the -U option I came back to the standard lrzip. Have I not understood anything at all about all this works?

rzip is not used. rzip functions are. This is the first pass which creates hash indexes of data over long distances. Your idea of "one copy of the data" is a little simplistic. The rzip pass evaluates data as it comes. It is stream-based, not file based. Theoretically, your file, or parts of it, may be hashed depending on where in the stream of data it occurs how large the stream chunk is. I think the largest evaluation block in rzip is 64MB but I could be wrong.

The -R# option defines the level of the rzip pass. By default it is the same of the compression level selected for lrzip or lrzip-next. As with any compression method, the higher the number, the better the compression. So you could use lrzip-next with the L4 option and -R9 option, meaning the first pass will be more thorough even though the backend compression won't be tasked with a high level.

The rzip pass certainly does compress. Look at the output of lrzip-next -vvi file.lrz. Here you will see that the first pass saves 10% of the original 1GB file. Then, the backend, zpaq in this example, compresses the 908MB stream to 17% of it's original size. You can see the overall effect. Going forward, please move this topic to the lrzip-next discussions page. Good luck.

Summary
=======
File: /tmp/enwik9.z9bs11.lrz
lrzip-next version: 0.8 file

  Stats         Percent       Compressed /   Uncompressed
  -------------------------------------------------------
  Rzip:          90.8%       908,203,127 /  1,000,000,000
  Back end:      17.5%       159,020,559 /    908,203,127
  Overall:       15.9%       159,020,559 /  1,000,000,000

  Compression Method: rzip + zpaq -- Compression Level = 5, Block Size = 11, 2,048MB

  Decompressed file size:  1,000,000,000
  Compressed file size:      159,020,651
  Compression ratio:               6.288x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants