Investigate par2 format for parity #22

za3k · 2021-07-31T16:46:55Z

I've been working on this on my own implementation.

There's an easy solution that already exists. Par2 is a recovery file which generates Reed Solomon codes. It's a packetized format, so you can zero pad each packet to fit in a QR code and the programs will still read them.

The downside is that you need to be aware that recovery packets are 64 bytes larger compared to the recovery block size.

If you don't care about exact alignment, you can just use Par2 as is. If you do, I am currently working on Par2Py, which will allow for alignment and for using an optional packet type that stores the data itself inside the Par2 file itself.

The advantage of this approach is if you do one Par2 packet per QR code, then not only can you recover from some codes missing, but you can also recover from codes being scrambled!

Originally posted by @EmperorArthur in #2 (comment)

za3k · 2021-07-31T16:47:59Z

Thanks a lot for sharing! I'll add your project to the list of related projects.

I think I would rather not bring in a dependency on the par2 format, or the par2 command line tool unless it provides some large benefit. In this case, the obvious benefit would be allowing restore with missing QR codes, without 'qr-backup' present (the biggest reason I was avoiding this feature to start with is that it would require qr-backup for restore). I'll have to think about whether I can write a bash oneliner to do what you're describing.

I re-read your comment and I'm still a little lost on what you mean by either "alignment" (just in the sense of padding?) or scrambling. Could you provide a reference on the par2 format? I'm familiar with the tool, but not the actual file format, and staring at it would probably help.

EmperorArthur · 2021-07-31T17:03:33Z

What dependencies are required for restore is always a critical question.

Personally, the approach I've taken* is to throw a tar file header on the first QR code, and use Par2 for the rest.

My par2 implementation is still in progress, but explicitly has the goal of supporting the optional "Input File Slice packet", and zero padding the packets for maximum recoverability in QR codes.

One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled. An intermediate solution is to use a tar file header (512 bytes) in front of the file data, and another header (512 bytes) in front of the par2 file. As long as the QR codes are read in the proper order and those headers survived, this would allow for recovery even if a data QR code was unreadable.

* Different requirements, and design philosophy.

Important information regarding Par2:

Official Specification
A C++ implementation of required packet types, and signatures. More readable than the spec.
par2cmdline issue regarding "Input File Slice".
My implementation.

EmperorArthur · 2021-07-31T17:10:40Z

Oh, another approach is the master branch of par2cmdline supports generating recovery blocks over 100%. I have confirmed that at 100% recovery, the software can restore a file even if it was completely deleted.

The down side is you must use a par2 program to do so. However, my approach ends up with that requirement anyways, so thinking about it, I will probably ignore the whole "Input File Slice packet" thing, and just generate over 100% recovery data. That allows any par2 program to recover the data!

That does put a dependency that you may not be happy with however. Which is another reason I am working on a pure python implementation.

za3k · 2021-07-31T17:24:47Z

One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled.

Again, what do you mean by "scrambled"? I'll point out that the QR codes I'm using have ~200 bytes, so 64-byte overhead would be a lot. If you're having luck with actually scanning/photographing significantly bigger ones, can you let me know what tools/hardware you're using? I'm taking a look at par2. Hand-rolling par2 IMO seems more complex than using reed-solomon directly--it would only be worthwhile for the restore process. I think we may have the same design requirements, not different ones, if I'm understanding correctly. https://github.com/za3k/qr-backup/blob/master/docs/FAQ.md#what-are-the-design-goals-of-qr-backup Are you decoding to a 2-member tar archive (separate file and .par2)? Or is it a 1-member archive? I'd like the restore to work without par2, which seems doable, but needs a lot of magic knowledge on the encoding end. The main downside is that no one will be able to understand the format any more--only the restore process. P.S. Make sure think about what happens if your headers are the QR codes which are damaged--add redundancy to these! I'm sure you already are :)

za3k · 2021-07-31T17:39:11Z

OK yeah on the whole I think this seems like a good parity option because of the command-line restore option. Thanks for taking the extra time to explain the format you're using to me.

Obstacles:

Deal with the 64-byte overhead. Not only is this big, but I run into issues if different QR codes are different sizes.
Figure out how to encode/decode on Windows/Android/etc. A par2 dependency might be tough for those.
Verify that par2 can actually deal with 1000s of small packets. When I started using reed-solomon, I ran into a lot of unexpected gotchas.
See if I can remove header packets without making the restore awful. I'm not thrilled with the idea of these messing up "any 19 QR codes let you recover" by being special. This is one reason I started rolling my own.

Let me know if you have any answers here.

EmperorArthur · 2021-07-31T17:43:39Z

Sorry, I took so long to respond. Yes I mean if something like the pages of QR codes are mixed up, or the scanner decides to do something like read top to bottom instead of left to right.

Tar format and QR size

I have focused on 512 byte QR codes, and am considering actually bumping up to 1024 bytes. With a full 25% per code redundancy. Just using my phone at this point, 512 scans every time, but 1024 does not.

The tar format works on 512 byte blocks and is a streaming format. Basically, you take a regular file and slap a tar header in front of it. So you can do:

512 byte tar header
Actual File data
512 byte tar header
par2 recovery file

Python has a tar file parser as part of the standard library, so there are no additional dependencies to do that.

My goals, and Par2

My goal is to be able to make a book of QR codes* that if I ever need to recover I can just throw in an auto document feed scanner, convert the whole thing to one big raw file, and then recover everything with one command. Even if a page is missing, the pages were not in the right order, or water/fire damage destroyed a corner of every page.

Par2, with each packet being on a separate QR code, supports everything I just said. Especially since the format supports something like cat *.par2 > ../big.par2, and then being able to just use one big file, even if everything was done at different times. The downside being that it actually ads almost too much redundancy to the "header" packets. This is because, like tar Par2 supports multiple archives in one file.

Unfortunately, packet size consistency and alignment is explicitly not a goal of the par2 format. Here is what I have found, using the names from my Par2Py:

The "RecoveryPacket"s are a fixed size of 64+ 8+<recovery_block_size> bytes.**
The "FileVerificationPacket" is 64+16+ciel(<file_size> / <recovery_block_size>) * 16 bytes.
- There is one of these for each file in the Par2 archive, and par2cmdline deliberated duplicates these to reduce the risk.
The "FileDescriptionPacket" is 64+16+16+16 + ("File name length" padded to nearest 4 byte word) bytes
- There is one of these for each file in the Par2 archive, and par2cmdline deliberated duplicates these to reduce the risk.
The "MainPacket" is 64+8+4+16 * <number_of_FileDescription/FileVerification_packets> bytes.
- There is one of these for each Par2 archive, and par2cmdline deliberated duplicates these to reduce the risk.
The "CreatorPacket" is 64+4+variable
- This is required by the standard, but only holds the name of the program used to create the Par2 archive.

* Even at different times, provided all file names are different.
** Which is an optional parameter to par2cmdline

za3k · 2021-07-31T17:46:38Z

If you haven't checked it out already, I have a much simpler method to "de-scramble" things: sort -u | cut -c5-. It doesn't handle missing codes on its own, though.

EmperorArthur · 2021-07-31T17:50:01Z

Thanks! That's a super simple way of handling it, and has far less overhead than the 68 bytes of a Par2 packet header.

Re-reading your last comment, the largest issue is each Par2 archive is limited to 32768 input blocks. Which at a 444 byte block size results in a maximum size of just under 14MB. Here is me discussing the issue with the par2cmdline people, and being informed of this limitation.

However, that's per archive. You can have multiple archives, they just wont share recovery data.

za3k · 2021-07-31T17:52:32Z

As a note to my future self, the backup format would be a file always called 'file' and a file always called 'file.par2' both inside a nameless tarball. Here are the needed qr-backup QR "packet" types:

A: Tar header, 'file' metdata [static]
N: Normal file
B: Tar metadata for 'file.par', par2 header [not quite static, MD5s and length of file]
P: Parity chunks (par2 packets, one per QR)
C: par2 footer, tar footer [static? OK seems like could be made nonexistent]

Restore process

To decode an uncorrupted file without par2 installed, take "N" packets and concatenate them.
To decode a corrupted file (needs par2):
- Concatenate together A, N, B, P, and C packets.
- Untar them
- Apply par2 to .par file to check/restore the file 'file'. Delete the .par, outputting only 'file'.

EmperorArthur · 2021-07-31T18:15:46Z

Minor notes:

Par2 does not have a footer
Tar technically has a footer of 1024 bytes of 0x00, but since the Tar metadata contains file size almost every implementation will happily ignore it and work with it missing.

Edit: More notes:

Here is a spec for Tar headers.

A is not quite static, at the least you need to update the checksum and length. You will probably want to update the timestamp as well.

The Tar file format explicitly allows a single Tar file to have two headers with the same file name. This is because it was designed for tape drives where you could be backing up a file multiple times. It is implementation defined how those are unpacked. Most implementations either use the last file in the archive, or the file with the newest timestamp.

za3k · 2021-07-31T19:08:32Z

Revised, there's no reason to invoke tar, what was I thinking?

N: Normal file
PH: par2 header [not quite static, MD5s and length of file]
P: Parity chunks (par2 packets, one per QR)

Restore process

To decode an uncorrupted file without par2 installed, take "N" packets and concatenate them.
To decode a corrupted file (needs par2):

Concatenate together N packets into 'file'
Concatenate together PH, P packets into 'file.par2'
Run par2 r file.par2 file, and delete file.par2

za3k · 2021-07-31T19:23:39Z

OK, looked into whether it's possible to trim the 64 bytes--yes, but not while keeping it as a readable bash oneliner.

za3k · 2021-07-31T19:31:22Z

All right, overall I'm leaning toward not using par2 for now. On qr-backup's default settings, 64 binary bytes overhead for each code would mean something like doubling the number of QR codes, which is quite bad. Also, par2 adds a must-have "header" QR code, which I have been specifically avoiding because it makes things complicated for users to think about. My feeling is that currently the costs outweigh the benefits.

I do think it's still a good idea, so let me know if/when you get it working on your own project please. And I may revisit if I get larger QR codes working, and a system for requesting duplicate copies of QR codes.

EmperorArthur · 2021-07-31T19:33:19Z

there's no reason to invoke tar, what was I thinking?

You were following one of my early approaches, not yours. Different design goals and considerations. Pros and cons to each. I suspect we have slightly different definitions of "simple".

I'm glad we could discuss this. Thanks :)

za3k · 2021-07-31T20:21:17Z

Blergh, all right, given that the alternatives look worse (reed-solomon coding isn't clearly a real option), let's give this a try! I'll see what the blowup looks like in practice. I do like the possibility of CLI restore quite a lot.

EmperorArthur · 2021-07-31T22:00:45Z

Hey, don't stress out about it. I'm already working on things to make it easier/aligned. Let me deal with that part, and you can focus on how you want everything to be packed / laid out.

za3k · 2021-08-01T05:36:45Z

Heh, I appreciate it. I just want parity working, and I'm frustrated that all the options are bad (IMO anyway). But having something is better than having nothing. I'll probably do it in a couple days.

That said, if you want to give me a PR feel free!

za3k · 2022-03-26T18:16:26Z

Investigated a little further. Linux, Windows, and Mac should be able to restore the par2 format. There are no apps to do so on Android

za3k · 2022-09-21T00:06:53Z

Closed since I added erasure coding

za3k closed this as completed Jul 31, 2021

za3k reopened this Jul 31, 2021

za3k added enhancement New feature or request question Further information is requested labels Mar 22, 2022

za3k closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate par2 format for parity #22

Investigate par2 format for parity #22

za3k commented Jul 31, 2021

za3k commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021 via email •

edited

Loading

za3k commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021 •

edited

Loading

za3k commented Jul 31, 2021 •

edited

Loading

za3k commented Jul 31, 2021

za3k commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021

EmperorArthur commented Jul 31, 2021

za3k commented Aug 1, 2021

za3k commented Mar 26, 2022

za3k commented Sep 21, 2022

Investigate par2 format for parity #22

Investigate par2 format for parity #22

Comments

za3k commented Jul 31, 2021

za3k commented Jul 31, 2021 • edited Loading

EmperorArthur commented Jul 31, 2021 • edited Loading

Important information regarding Par2:

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021 via email • edited Loading

za3k commented Jul 31, 2021 • edited Loading

EmperorArthur commented Jul 31, 2021

Tar format and QR size

My goals, and Par2

za3k commented Jul 31, 2021

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021 • edited Loading

EmperorArthur commented Jul 31, 2021 • edited Loading

za3k commented Jul 31, 2021 • edited Loading

za3k commented Jul 31, 2021

za3k commented Jul 31, 2021 • edited Loading

EmperorArthur commented Jul 31, 2021

za3k commented Jul 31, 2021

EmperorArthur commented Jul 31, 2021

za3k commented Aug 1, 2021

za3k commented Mar 26, 2022

za3k commented Sep 21, 2022

za3k commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021 •

edited

Loading

za3k commented Jul 31, 2021 via email •

edited

Loading

za3k commented Jul 31, 2021 •

edited

Loading

za3k commented Jul 31, 2021 •

edited

Loading

EmperorArthur commented Jul 31, 2021 •

edited

Loading

za3k commented Jul 31, 2021 •

edited

Loading

za3k commented Jul 31, 2021 •

edited

Loading