You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
warcio recompress will silently add a WARC-Payload-Digest field to records that don't already have a payload digest field. This appears to only happen if the record already has a WARC-Block-Digest field.
In my testing, I've seen this happen to both "metadata" records and non-HTTP "request" records. This is strange since warcio doesn't know what subset of these records's content block constitutes a "payload", so how could it be able to calculate a digest? The created Payload-Digest appears to just be a hash of the entire block. (This issue is in many ways an inverse of #156).
Attached is a ZIP file with orig.warc and warc-recompress.warc which was created by: example-warcs.zip
The "metadata" record in orig.warc contains a X.509 certificate and uses a Content-Type field of application/x-pem-file. The original has a block digest field, and no payload digest, since this metadata does not have a meaningful payload beyond the block digest. However if you look at ware-recompress.warc you will see that a WARC-Payload-Digest header has been added to the "metadata" record at the end. Additionally, the "request" record is for the Gemini protocol, and is not HTTP. Gemini requests do not have a meaningful payload, so the request record in in orig.warc does not have a WARC-Payload-Digest field. However warcio-compress.warc shows one has been added.
While similar to #161, I believe this is a higher severity. Payload digests have meaning, and are used in other tool chains like CDX indexes. However warcio is adding payload digests to records that don't have them, and without having any concept of what the payload is or its meaning for these records. This is in addition to strangeness documented in #161 like:
I would not expect a recompression operation to alter the records in the WARC.
This behavior isn't documented
It (very slightly) increases the size of the WARC
My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.
The text was updated successfully, but these errors were encountered:
warcio recompress
will silently add aWARC-Payload-Digest
field to records that don't already have a payload digest field. This appears to only happen if the record already has aWARC-Block-Digest
field.In my testing, I've seen this happen to both "metadata" records and non-HTTP "request" records. This is strange since warcio doesn't know what subset of these records's content block constitutes a "payload", so how could it be able to calculate a digest? The created Payload-Digest appears to just be a hash of the entire block. (This issue is in many ways an inverse of #156).
Attached is a ZIP file with
orig.warc
andwarc-recompress.warc
which was created by:example-warcs.zip
The "metadata" record in
orig.warc
contains a X.509 certificate and uses aContent-Type
field ofapplication/x-pem-file
. The original has a block digest field, and no payload digest, since this metadata does not have a meaningful payload beyond the block digest. However if you look atware-recompress.warc
you will see that aWARC-Payload-Digest
header has been added to the "metadata" record at the end. Additionally, the "request" record is for the Gemini protocol, and is not HTTP. Gemini requests do not have a meaningful payload, so the request record in inorig.warc
does not have aWARC-Payload-Digest
field. Howeverwarcio-compress.warc
shows one has been added.While similar to #161, I believe this is a higher severity. Payload digests have meaning, and are used in other tool chains like CDX indexes. However warcio is adding payload digests to records that don't have them, and without having any concept of what the payload is or its meaning for these records. This is in addition to strangeness documented in #161 like:
My suggestion would be that
warcio recompress
should not alter the records of the WARC it is operating on.The text was updated successfully, but these errors were encountered: