Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag for automatic Blocksize in commandline #171

Open
the123blackjack opened this issue Mar 20, 2022 · 18 comments
Open

Flag for automatic Blocksize in commandline #171

the123blackjack opened this issue Mar 20, 2022 · 18 comments

Comments

@the123blackjack
Copy link

the123blackjack commented Mar 20, 2022

Hi @BlackIkeEagle @mdnahas
I mainly use par2 for creating parity for family photos and videos
This is an enhancement request for par2cmdline to choose the best possible block size and count based on the analysis of file structure and sizes as i am / users may be unsure of what should be set

Command i currently use: (not mentioning -s or -b switches)
par2 create -r10 "parchivename" "FilesToCreateParity*.extention"

What i hope for is

  • Par2cmdline to select the best possible block count and size if unspecified
  • Documentation / Guide for choosing Block Count and Size
  • How par2cmdline is currently choosing the he block count and
    Size is it based on analysis of file size or a default value.

Hoping for your valuable feedback and thoughts...

@animetosho
Copy link
Contributor

the best possible block size and count

What is the "best" though?

Generally the ideal size/count depends on your preference for performance, error resilience and size of recovery (if a percentage wasn't specified).

Documentation / Guide for choosing Block Count and Size

A better explanation in the help may be useful.

For a general overview, PAR2 is block based, in the sense that the data is broken up into blocks. A single error renders the entire block it's within, corrupt, meaning that larger blocks are less resilient against random errors.
Increasing the block count improves recoverability, at the expense of performance and size of recovery files. Note that the percentage option is just a convenience function to help you calculate one from the other - in reality, PAR2 only cares about block size and count.

How par2cmdline is currently choosing the he block count

From memory, there's no analysis, it just picks 2000 as the count. Personally I think it should be removed, as there's no reasoning behind the '2000' figure, and it can make people think that it's a good value to use (when there's really no reason to believe so).

@the123blackjack
Copy link
Author

the123blackjack commented Mar 20, 2022

@animetosho Yeah the absolute best scenario is for the par2cmdline to select the appropriate block count / size based on the file structure and size

But yes the enhanced documentation on the man pages with a formula to come up with this value for block count pr size would definitely help.

Or even a flag for Medium , Low or High values

Since i am not mentioning any switches (s or b) my block count would be 2000, is there any formula that you use to come up with a value for block count / size

I also would like to know if 2000 was chosen as a best figure and if there is a specific reason that we are not seeing @BlackIkeEagle @mdnahas

@animetosho
Copy link
Contributor

Yeah the absolute best scenario is for the par2cmdline to select the appropriate block count / size based on the file structure and size

Firstly, I don't know what you mean by "file structure", but I think you're assuming that an "ideal" setting exists, when there's no such thing.
If you really do think such a thing exists, would you be able to explain how it would work in detail, and provide exact examples?

Or even a flag for Medium , Low or High values

Like above, you'd have to define what these actually mean.

is there any formula that you use to come up with a value for block count / size

Amount of recovery data = recovery block count * block size

You need to supply two of the values above to work out the third.

I also would like to know if 2000 was chosen as a best figure

It's not.
2000 has no particular meaning, and I think it should be removed because it makes people incorrectly believe it's a sensible default..

@the123blackjack
Copy link
Author

the123blackjack commented Mar 21, 2022

@animetosho Many thanks for the insights and guidance

Amount of recovery data = recovery block count * block size

That means for a 10GB file , if i am setting a recovery record percentage of 10%
The error that can be handled is 1 gig.So the block size would be 100000000bytes (1GB) / 2000 (Default Block Count) = 500000 bytes or 500KB
Or
If the file is of 1 GB with my Recovery record of the 10% the error that can be handled would be 100mb the block size would be 100000000bytes (100MB) / 2000 (Default Block count) = 50000 bytes or 50 KB.

  1. Is my above understanding correct?
  2. Referring to “Block count (2000) cannot be smaller than the number of files(20711)” https://bbs.archlinux.org/viewtopic.php?id=237708.
    a) why the user ended up with 29gig parity file for 31gig data for 1% redundancy?
    b) what is rationale for the block count cannot be smaller than number of files?
    c) should i be using block counts or block sizes ?
    d)Any guidance for me to set the tight value while targeting an hefty 10% recovery?
    e) What is the -m memory switch for the memory the default seems to be 16mb ? Is there any recommended value?

@animetosho
Copy link
Contributor

Is my above understanding correct?

Yes, that's correct.

what is rationale for the block count cannot be smaller than number of files?

Each input block cannot contain more than one file, which means that you'll at least need a block for each file.

why the user ended up with 29gig parity file for 31gig data for 1% redundancy?

I can't really say with the given details. They have a lot of files, and if their files vary greatly in size, you'll get a lot of inefficiency due to files needing to be padded to block lengths.

For example, if you have two files: 1GB and a 1B file. The minimum number of input blocks is 2, but lets say you choose to have 3 blocks instead. The 1GB file will be broken up into two blocks, each 500MB, whilst the 1B file will consume the third block. Since the block size chosen here is 500MB, the third file will consume a whole 500MB block, despite only being 1B in size.
Now as far as recovery goes, if you chose to have one recovery block (i.e. 33% redundancy), it would be 500MB in size, even though one would think 33% of 1GB+1B should be ~333MB.

should i be using block counts or block sizes ?

Assuming you specify a percentage, it doesn't matter, because the one you don't specify (count or size) will be calculated from the other.

Any guidance for me to set the tight value while targeting an hefty 10% recovery?

As previously mentioned, it depends on your preference for performance and error resilience.

What is the -m memory switch for the memory the default seems to be 16mb ? Is there any recommended value?

I recall the default in current versions is actually based on the amount of memory available on your system (which can be problematic if you have too much RAM available).
The value refers to the maximum amount of memory that par2cmdline is allowed to use for holding recovery data whilst processing. If it isn't enough to hold all the recovery data, it'll make multiple passes over the data to compute the recovery.

What's ideal depends on the RAM you have available and how much you're willing to dedicate to par2cmdline. I probably wouldn't go past several GB though, as the benefit you get from reduced I/O quickly diminishes.

@the123blackjack
Copy link
Author

@animetosho what i specifying is the -r switch with value 10 and i am not specifying either s (size) or block count(b)

Then it would default to 2000 block count and the size gets derived from 10% of (r) and 2000 of (default b)

Is there anyway for me to come up with the block count or size.
Well basically myself i am unable to decide myself

@animetosho
Copy link
Contributor

I really don't understand what you're asking there.
If you're asking what an appropriate value for size/count is, then as said, there's no "best value". Consider the type of errors you're trying to protect against (bitflips at random locations, blocks of errors, failed sectors?) and how long you're willing to wait for create/repair to run.
For example, if errors come in 64KB chunks (because you're using 64KB sectors on disk and you're trying to protect against bad sectors), there's little reason to have a block size smaller than 64KB. Or, if errors are very small but spread out randomly across your data, then large blocks are going to be less flexible than smaller blocks. Then consider the speed of processing - if it's too slow for your liking, you may want to use larger blocks.

@the123blackjack
Copy link
Author

@animetosho Many thanks i have learned quite a lot from this thread with your engagement.

By any chance through the commandline or by other means we will be able to analyze the already created parity files to know if what the block count and size it is created with ?

@animetosho
Copy link
Contributor

I don't recall par2cmdline offering any feature to show info of a PAR2.
par2j does have a 'list' command which shows such info. There may be other tools out there which do a similar thing.

@Joshfindit
Copy link

If I may interject from a 'help desk' perspective:

@the123blackjack may be coming from the position of the archetypal (intelligent and domain-inexperienced) new user.

From that perspective, par2cmdline is somewhat difficult to interact with. One cannot reliably do par2 create <multiple files> because of all the knowledge required (as shown above).

From that perspective:

the best possible block size and count

What is the "best" though?

Can be answered by "whatever is a (non-optimized) value that successfully creates the parity set without using an unexpected amount of disk space".

Users in this category are not usually looking for the optimal number. They are looking for the "right now" number.

In my experience this is normally because they are in step 2 of the journey that goes "I hopefully know enough to find a tool that does what I want" > running the command the first time and seeing what happens > checking whether it did what is wanted > starting a (probably endless) loop of { reading documentation > tweaking the options > refining the result > asking questions }.

They know that there are an almost infinite number of tools that just will not accomplish what they need, and so those first couple of steps are to find out if a specific tool is worth looking in to more.

This type of user can actually be a big benefit because they can call attention to the challenges that potential users face before silently giving up (opening up an opportunity for the project maintainers to address them), and as they get more experience they can 'grow in to' users who add significant value to the project itself.

@animetosho
Copy link
Contributor

animetosho commented Oct 19, 2022

Not trying to be mean, but that's a lot of talk without providing any workable solution.

I wouldn't describe 'best' as a "right now" number, but even if I use the latter, I have no clue how to approach that.

Don't get me wrong - I'm all for making things friendlier to users and seriously value it myself. And yeah, I get that understanding these parameters does require some effort - something I'd wish wasn't necessary.
But unless you put forward a sensible solution, you're just shouting into the wind.

@Joshfindit
Copy link

I wasn’t trying to shout at all, but instead to first build a bridge for understanding.

Until they reply I’m not sure whether @the123blackjack actually agrees with the perspective I’m sharing, but in my experience these kinds of discussions do help reach the solution.

I’m all for helping provide a workable solution, so let me add this suggestion for a non-optimized auto mode:

number of blocks = number of files
block size = size of the smallest file unless it’s smaller than (some minimum value)

The “minimum block size” would be at or above the threshold where processing times get significantly higher. I have no expertise here, but I suspect that there’s a reasonable number that’s already known.

Notes:

  • Filesystem block sizes should be ignored for auto mode since the user might move the files to a different system anyway
  • Edge cases can be ignored. In the example of a 1GB file and a 1B file the block size would be very low and processing would take longer than an optimized block size, but it would still work while providing parity protection.
  • A “reasonable amount of disk space” has very subjective and fuzzy definitions, but:
    • Don’t fill the drive (estimate and fail if it might. Provide an override if the user knows the math and wants to push through anyway)
    • Not more than (double?) the percentage (eg: If the user asks for 10% in recovery files, don’t use more than ~20% on-disk)

@animetosho
Copy link
Contributor

Well firstly, appreciate the suggestion.

number of blocks = number of files
block size = size of the smallest file unless it’s smaller than (some minimum value)

This almost certainly won't work because the settings likely conflict with each other.
'Number of blocks = number of files' means that the block size must be >= the largest file. Choosing a block size smaller than the largest file will mean that there's more blocks than files (as long as files aren't exactly the same size).

Note that 'number of blocks' and 'block size' are related, so you generally only specify one of those values and the other is derived.

Not more than (double?) the percentage (eg: If the user asks for 10% in recovery files, don’t use more than ~20% on-disk)

Are you expecting the user to supply a percentage? Your earlier example excluded this.

@Joshfindit
Copy link

'Number of blocks = number of files' means that the block size must be >= the largest file. Choosing a block size smaller than the largest file will mean that there's more blocks than files (as long as files aren't exactly the same size).

Aha. My understanding needs some refinement. I was going in the wrong direction based on the fact that each file needs it’s own block.

Instead: block size as I mentioned, and then number of blocks is derived.

Are you expecting the user to supply a percentage? Your earlier example excluded this.

You’re right. Apologies for polluting the conversation. I’m not expecting a user to specify a percentage at first, though it is something that I expect they will tweak fairly soon in the learning process. I don’t know what par2cmdline currently does to determine the amount of parity data it adds, but whatever that is is fine. Mainly I brought that point in to highlight the likely user expectation about disk space used.

The example above:

why the user ended up with 29gig parity file for 31gig data for 1% redundancy?

I think automatic values should check to see if a “reasonable” amount of disk space will be used and adjust according, thus avoiding a “wtf?” user reaction.

@animetosho
Copy link
Contributor

Instead: block size as I mentioned, and then number of blocks is derived.

Assuming you mean "block size = size of the smallest file", and ignoring the minimum value part for now.
If the user tries PAR2 on a single 1GB file, this would result in a 1GB block size, or a minimum recovery amount of 100%, which is likely undesirable.

I don’t know what par2cmdline currently does to determine the amount of parity data it adds, but whatever that is is fine

It requires the user to specify the amount.

I think automatic values should check to see if a “reasonable” amount of disk space will be used and adjust according, thus avoiding a “wtf?” user reaction.

I do think some warning could be added if efficiency falls below a certain threshold, although explaining it concisely may be a challenge.

@Joshfindit
Copy link

Assuming you mean "block size = size of the smallest file", and ignoring the minimum value part for now.
If the user tries PAR2 on a single 1GB file, this would result in a 1GB block size, or a minimum recovery amount of 100%, which is likely undesirable.

It requires the user to specify the amount.

I'm comparing the experience of par2 create <single file> and par2 create <multiple files>

A first time user can quite easily and simply call par2 create <single file> without having to specify any additional flags, without having to understand block size, and without having to understand any of the trade-offs for various block counts. That user is likely to end up with PAR2 file sizes and redundancy that they are happy with.

In the example of par2 create <multiple files>: the user can easily get halted by having more than 2000 files, so that's a simple check: if (number of files > automatic block count), increase block count.

As mentioned above: the 2000 block default is arbitrary, but I suspect that there is a formula that can be found that will make a good enough estimate for 80% of use-cases.

In the example of running create on one large file with a bunch of smaller ones: the user can run in to a block size that causes large PAR2 files with minimal protection. I'm not sure if the answer to this is simple (I suspect less simple than the 2000 check), but it could even be a warning of "This will result in 29GB of parity files and 1% protection. Proceed?".

@animetosho
Copy link
Contributor

That user is likely to end up with PAR2 file sizes and redundancy that they are happy with.

How so?

but I suspect that there is a formula that can be found

As mentioned earlier, you're going to have to be specific if you want to get anywhere. Unfortunately, you can't code based on wishful thinking alone.
Ideally, give a complete set of logic, with exact thresholds, rules etc, that someone could actually code up without ambiguity, or even better, submit a pull request.

@Joshfindit
Copy link

[when running par2 create on single files] That user is likely to end up with PAR2 file sizes and redundancy that they are happy with.

How so?

It already works. In my using it to learn more as a part of this thread the combined sizes of the PAR2 files have been <10% of the file they were created from while still providing meaningful protection (I did not track the percentage of protection for each, but minor (intentional) errors were corrected without errors)

As mentioned earlier, you're going to have to be specific if you want to get anywhere.

While some might see this response as frustrated, I appreciate that you’re willing to take the hard stances to avoid wasting time.

To be clear: I don’t yet have the facts that someone would need to make a ruleset. I have been asking direct questions and proposing ideas as a part of uncovering those facts. It also does not seem like there is such a thing as a “complete set of logic, with exact thresholds” because of how much variability there is for new users who wish to create parity files from their files.

In this thread I do see progress towards finding a set of logic and formula for 80% of use-cases. At that point it could be proposed or submitted as a PR.

We have already identified some edge cases, identified the necessary variables, and identified at least two rules that would be part of the final calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants