Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For large files, big difference to ParPar and MultiPar? #23

Closed
prdp19 opened this issue Sep 23, 2023 · 5 comments
Closed

For large files, big difference to ParPar and MultiPar? #23

prdp19 opened this issue Sep 23, 2023 · 5 comments

Comments

@prdp19
Copy link

prdp19 commented Sep 23, 2023

I have noticed an inconsistency that I do not understand.

Test 1
1 GB .rar-Files, ~3,07 GB

::%MULTIPAR% c /lc10 /rr10 /rd1 "%mainDir%\filename.par2" *.rar
17sec

%PARPAR% -s4M -S -r10%% --threads 10 -o %mainDir%\filename.par2 -R %mainDir%\
2,9sec

::%PAR2TURBO% create -t10 -r10 %mainDir%\filename.par2 %mainDir%\*.rar
7sec

Test 2
1 GB .rar-Files, ~70 GB

::%MULTIPAR% c /lc10 /rr10 /rd1 "%mainDir%\filename.par2" *.rar
6,9min

%PARPAR% -s4M -S -r10%% --threads 10 -o %mainDir%\filename.par2 -R %mainDir%\
14,4min

::%PAR2TURBO% create -t10 -r10 %mainDir%\filename.par2 %mainDir%\*.rar
2,98min (can this be true?)

Where does this diskeprancy come from?

Another test (even more confusing):

Test 3
1 GB .rar-Files, ~70 GB

::%PAR2TURBO% create -t10 -r10 %mainDir%\filename.par2 %mainDir%\*.rar
2,98min (can this be true?)

::%PAR2TURBO% create -s3000000 -t10 -r10 %mainDir%\filename.par2 %mainDir%\*.rar
20,13min (ya, > 20min)

6 Cores, 12 Threads a 4 Ghz, 32 GB-RAM. NVMe SSD.
CPU load while par2 calculating: ~90%.

Do you have any performance tips for me?
My source files vary from 50 MB to 100 GB. Shouldn't I rather make -s dynamic, or is that not necessary?

@animetosho
Copy link
Owner

animetosho commented Sep 23, 2023

As you may have discovered, the block count has a significant effect on the performance (as well as how recovery works), so for a comparison that actually makes sense, you need to ensure the block count/size is consistent across all applications.

Since you're using a 4MB block for ParPar (first two tests), you'll need to add /ss4194304 to par2j and -s4194304 to par2cmdline to match.

(the significance of this option is a reason why I'm reluctant to automatically choose some default, as it leads to people not understanding how much of an impact it plays)

Another option that can affect performance would be the allowed memory usage (-m parameter in par2cmdline/ParPar, /m in par2j). The defaults are different between the applications, which could affect how well they perform.

Do you have any performance tips for me?

Other than that mentioned above, the defaults are usually sensible.

If you want to experiment, you can try playing around with the number of threads, as a lower number (i.e. num threads = num CPU cores) might work slightly better or worse.

MultiPar and ParPar have a number of tunables if you want to experiment further. In particular, enabling GPU/OpenCL processing might help if you have a powerful GPU.

@prdp19
Copy link
Author

prdp19 commented Sep 23, 2023

As you may have discovered, the block count has a significant effect on the performance (as well as how recovery works), so for a comparison that actually makes sense, you need to ensure the block count/size is consistent across all applications.

Since you're using a 4MB block for ParPar (first two tests), you'll need to add /ss4194304 to par2j and -s4194304 to par2cmdline to match.

(the significance of this option is a reason why I'm reluctant to automatically choose some default, as it leads to people not understanding how much of an impact it plays)

Another option that can affect performance would be the allowed memory usage (-m parameter in par2cmdline/ParPar, /m in par2j). The defaults are different between the applications, which could affect how well they perform.

Thank you very much!

I think ParPar is theoretically the fastest option of all.
For example, I have 2 GB parts and count them.

Then I wrote this function for myself - that's the gist of what I had in mind.
I would just like to always choose the right block size. What do you think about that?

set FILECOUNT=0
for /R "%TARGETFOLDER%" %%i in (*.zip) do (
    set /A FILECOUNT+=1
)

echo Parts: %FILECOUNT%

if %FILECOUNT% lss 10 (
    set BLOCKSIZE=1000000
) else if %FILECOUNT% lss 50 (
    set BLOCKSIZE=2000000
) else if %FILECOUNT% lss 100 (
    set BLOCKSIZE=3000000
) else (
    set BLOCKSIZE=5000000
)

What influence does too small a block size have on very large files or many parts, and vice versa?

I would like to always get the best value of the block size with 10% recovery based on the source file/part size.

@animetosho
Copy link
Owner

There's no "right" or "best" block size, which is why it's exposed as an option.
You can find some discussion on the topic here.

The general gist is:

  • fewer blocks (i.e. larger block size) = better performance
  • more blocks (smaller block size) = better recoverability / resilience against corruption
  • you may want to align the size to some multiple, depending on how you expect corruption to occur (e.g. Usenet posts typically use a multiple of the article size; if storing on disk, perhaps a multiple of the disk sector size)
  • there's a limit on how small it can be due to PAR2 allowing a maximum of 32768 input blocks (plus more blocks incurs more metadata overhead)

So size is generally a tradeoff between performance and recoverability.

@prdp19
Copy link
Author

prdp19 commented Sep 24, 2023

Another option that can affect performance would be the allowed memory usage (-m parameter in par2cmdline/ParPar, /m in par2j). The defaults are different between the applications, which could affect how well they perform.

--memory 24G
has no effect on ParPar:

> Input data        : 67.78 GiB (8677 slices from 68 files)
> Recovery data     : 6944 MiB (868 * 8192 KiB slices)
> Input pass(es)    : 2, processing 868 * 4096 KiB chunks per pass
> Read buffer size  : 4096 KiB * max 8 buffers
> Hash method       : Scalar+PCLMUL (input), AVX2 (recovery)
> 
> Intel(R) Xeon(R) E-XXX CPU @ 4.00GHz (6 Cores, 12 Threads)
>   Multiply method : Shuffle (AVX2) with 8256 B loop tiling, 10 threads
>   Input batching  : 12 chunks, 2 batches
>   Memory Usage    : 3568.05 MiB (868 * 4096.06 KiB chunks + 96 MiB transfer buffer)

Memory Usage : 3568.05 MiB
But i dont know why.

I have also noticed that the higher the block size, the faster the PAR2 files are created. But I think that's exactly right.
Benchmarks:

Block size 0,5M: 22:26min
Block size 1M: 18:48min
Block size 2M: 14:00min
Block size 4M: 14:10min
Block size 8M: 07:29min

@animetosho
Copy link
Owner

has no effect on ParPar:

The option is an upper limit - ParPar can choose to use less.

More specifically, the memory is used to hold recovery data, so since you're generating 6944MB of recovery in the above example, it'll never use more than that.
There's also currently a limit where it won't use more than the read request size per block. If you really want to push it, you can specify something like --seq-read-size=8M, though it's not expected to have that much of an effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants