Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate par2 format for parity #22

Closed
za3k opened this issue Jul 31, 2021 · 19 comments
Closed

Investigate par2 format for parity #22

za3k opened this issue Jul 31, 2021 · 19 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@za3k
Copy link
Owner

za3k commented Jul 31, 2021

I've been working on this on my own implementation.

There's an easy solution that already exists. Par2 is a recovery file which generates Reed Solomon codes. It's a packetized format, so you can zero pad each packet to fit in a QR code and the programs will still read them.

The downside is that you need to be aware that recovery packets are 64 bytes larger compared to the recovery block size.

If you don't care about exact alignment, you can just use Par2 as is. If you do, I am currently working on Par2Py, which will allow for alignment and for using an optional packet type that stores the data itself inside the Par2 file itself.

The advantage of this approach is if you do one Par2 packet per QR code, then not only can you recover from some codes missing, but you can also recover from codes being scrambled!

Originally posted by @EmperorArthur in #2 (comment)

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

Thanks a lot for sharing! I'll add your project to the list of related projects.

I think I would rather not bring in a dependency on the par2 format, or the par2 command line tool unless it provides some large benefit. In this case, the obvious benefit would be allowing restore with missing QR codes, without 'qr-backup' present (the biggest reason I was avoiding this feature to start with is that it would require qr-backup for restore). I'll have to think about whether I can write a bash oneliner to do what you're describing.

I re-read your comment and I'm still a little lost on what you mean by either "alignment" (just in the sense of padding?) or scrambling. Could you provide a reference on the par2 format? I'm familiar with the tool, but not the actual file format, and staring at it would probably help.

@EmperorArthur
Copy link

EmperorArthur commented Jul 31, 2021

What dependencies are required for restore is always a critical question.

Personally, the approach I've taken* is to throw a tar file header on the first QR code, and use Par2 for the rest.

My par2 implementation is still in progress, but explicitly has the goal of supporting the optional "Input File Slice packet", and zero padding the packets for maximum recoverability in QR codes.

One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled. An intermediate solution is to use a tar file header (512 bytes) in front of the file data, and another header (512 bytes) in front of the par2 file. As long as the QR codes are read in the proper order and those headers survived, this would allow for recovery even if a data QR code was unreadable.

* Different requirements, and design philosophy.

Important information regarding Par2:

@EmperorArthur
Copy link

Oh, another approach is the master branch of par2cmdline supports generating recovery blocks over 100%. I have confirmed that at 100% recovery, the software can restore a file even if it was completely deleted.

The down side is you must use a par2 program to do so. However, my approach ends up with that requirement anyways, so thinking about it, I will probably ignore the whole "Input File Slice packet" thing, and just generate over 100% recovery data. That allows any par2 program to recover the data!

That does put a dependency that you may not be happy with however. Which is another reason I am working on a pure python implementation.

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021 via email

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

OK yeah on the whole I think this seems like a good parity option because of the command-line restore option. Thanks for taking the extra time to explain the format you're using to me.

Obstacles:

  1. Deal with the 64-byte overhead. Not only is this big, but I run into issues if different QR codes are different sizes.
  2. Figure out how to encode/decode on Windows/Android/etc. A par2 dependency might be tough for those.
  3. Verify that par2 can actually deal with 1000s of small packets. When I started using reed-solomon, I ran into a lot of unexpected gotchas.
  4. See if I can remove header packets without making the restore awful. I'm not thrilled with the idea of these messing up "any 19 QR codes let you recover" by being special. This is one reason I started rolling my own.

Let me know if you have any answers here.

@EmperorArthur
Copy link

Sorry, I took so long to respond. Yes I mean if something like the pages of QR codes are mixed up, or the scanner decides to do something like read top to bottom instead of left to right.

Tar format and QR size

I have focused on 512 byte QR codes, and am considering actually bumping up to 1024 bytes. With a full 25% per code redundancy. Just using my phone at this point, 512 scans every time, but 1024 does not.

The tar format works on 512 byte blocks and is a streaming format. Basically, you take a regular file and slap a tar header in front of it. So you can do:

  • 512 byte tar header
  • Actual File data
  • 512 byte tar header
  • par2 recovery file

Python has a tar file parser as part of the standard library, so there are no additional dependencies to do that.

My goals, and Par2

My goal is to be able to make a book of QR codes* that if I ever need to recover I can just throw in an auto document feed scanner, convert the whole thing to one big raw file, and then recover everything with one command. Even if a page is missing, the pages were not in the right order, or water/fire damage destroyed a corner of every page.

Par2, with each packet being on a separate QR code, supports everything I just said. Especially since the format supports something like cat *.par2 > ../big.par2, and then being able to just use one big file, even if everything was done at different times. The downside being that it actually ads almost too much redundancy to the "header" packets. This is because, like tar Par2 supports multiple archives in one file.

Unfortunately, packet size consistency and alignment is explicitly not a goal of the par2 format. Here is what I have found, using the names from my Par2Py:

  • The "RecoveryPacket"s are a fixed size of 64+ 8+<recovery_block_size> bytes.**
  • The "FileVerificationPacket" is 64+16+ciel(<file_size> / <recovery_block_size>) * 16 bytes.
    • There is one of these for each file in the Par2 archive, and par2cmdline deliberated duplicates these to reduce the risk.
  • The "FileDescriptionPacket" is 64+16+16+16 + ("File name length" padded to nearest 4 byte word) bytes
    • There is one of these for each file in the Par2 archive, and par2cmdline deliberated duplicates these to reduce the risk.
  • The "MainPacket" is 64+8+4+16 * <number_of_FileDescription/FileVerification_packets> bytes.
    • There is one of these for each Par2 archive, and par2cmdline deliberated duplicates these to reduce the risk.
  • The "CreatorPacket" is 64+4+variable
    • This is required by the standard, but only holds the name of the program used to create the Par2 archive.

* Even at different times, provided all file names are different.
** Which is an optional parameter to par2cmdline

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

If you haven't checked it out already, I have a much simpler method to "de-scramble" things: sort -u | cut -c5-. It doesn't handle missing codes on its own, though.

@EmperorArthur
Copy link

Thanks! That's a super simple way of handling it, and has far less overhead than the 68 bytes of a Par2 packet header.

Re-reading your last comment, the largest issue is each Par2 archive is limited to 32768 input blocks. Which at a 444 byte block size results in a maximum size of just under 14MB. Here is me discussing the issue with the par2cmdline people, and being informed of this limitation.

However, that's per archive. You can have multiple archives, they just wont share recovery data.

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

As a note to my future self, the backup format would be a file always called 'file' and a file always called 'file.par2' both inside a nameless tarball. Here are the needed qr-backup QR "packet" types:

  • A: Tar header, 'file' metdata [static]
  • N: Normal file
  • B: Tar metadata for 'file.par', par2 header [not quite static, MD5s and length of file]
  • P: Parity chunks (par2 packets, one per QR)
  • C: par2 footer, tar footer [static? OK seems like could be made nonexistent]

Restore process

  • To decode an uncorrupted file without par2 installed, take "N" packets and concatenate them.
  • To decode a corrupted file (needs par2):
    • Concatenate together A, N, B, P, and C packets.
    • Untar them
    • Apply par2 to .par file to check/restore the file 'file'. Delete the .par, outputting only 'file'.

@EmperorArthur
Copy link

EmperorArthur commented Jul 31, 2021

Minor notes:

  • Par2 does not have a footer
  • Tar technically has a footer of 1024 bytes of 0x00, but since the Tar metadata contains file size almost every implementation will happily ignore it and work with it missing.

Edit: More notes:

Here is a spec for Tar headers.

  • A is not quite static, at the least you need to update the checksum and length. You will probably want to update the timestamp as well.

The Tar file format explicitly allows a single Tar file to have two headers with the same file name. This is because it was designed for tape drives where you could be backing up a file multiple times. It is implementation defined how those are unpacked. Most implementations either use the last file in the archive, or the file with the newest timestamp.

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

Revised, there's no reason to invoke tar, what was I thinking?

  • N: Normal file
  • PH: par2 header [not quite static, MD5s and length of file]
  • P: Parity chunks (par2 packets, one per QR)

Restore process

To decode an uncorrupted file without par2 installed, take "N" packets and concatenate them.
To decode a corrupted file (needs par2):

  • Concatenate together N packets into 'file'
  • Concatenate together PH, P packets into 'file.par2'
  • Run par2 r file.par2 file, and delete file.par2

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

OK, looked into whether it's possible to trim the 64 bytes--yes, but not while keeping it as a readable bash oneliner.

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

All right, overall I'm leaning toward not using par2 for now. On qr-backup's default settings, 64 binary bytes overhead for each code would mean something like doubling the number of QR codes, which is quite bad. Also, par2 adds a must-have "header" QR code, which I have been specifically avoiding because it makes things complicated for users to think about. My feeling is that currently the costs outweigh the benefits.

I do think it's still a good idea, so let me know if/when you get it working on your own project please. And I may revisit if I get larger QR codes working, and a system for requesting duplicate copies of QR codes.

@za3k za3k closed this as completed Jul 31, 2021
@za3k za3k reopened this Jul 31, 2021
@EmperorArthur
Copy link

there's no reason to invoke tar, what was I thinking?

You were following one of my early approaches, not yours. Different design goals and considerations. Pros and cons to each. I suspect we have slightly different definitions of "simple".

I'm glad we could discuss this. Thanks :)

@za3k
Copy link
Owner Author

za3k commented Jul 31, 2021

Blergh, all right, given that the alternatives look worse (reed-solomon coding isn't clearly a real option), let's give this a try! I'll see what the blowup looks like in practice. I do like the possibility of CLI restore quite a lot.

@EmperorArthur
Copy link

Hey, don't stress out about it. I'm already working on things to make it easier/aligned. Let me deal with that part, and you can focus on how you want everything to be packed / laid out.

@za3k
Copy link
Owner Author

za3k commented Aug 1, 2021

Heh, I appreciate it. I just want parity working, and I'm frustrated that all the options are bad (IMO anyway). But having something is better than having nothing. I'll probably do it in a couple days.

That said, if you want to give me a PR feel free!

@za3k za3k added enhancement New feature or request question Further information is requested labels Mar 22, 2022
@za3k
Copy link
Owner Author

za3k commented Mar 26, 2022

Investigated a little further. Linux, Windows, and Mac should be able to restore the par2 format. There are no apps to do so on Android

@za3k
Copy link
Owner Author

za3k commented Sep 21, 2022

Closed since I added erasure coding

@za3k za3k closed this as completed Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants