Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warcio recompress adds "WARC-Payload-Digest" to records without understanding them #162

Open
acidus99 opened this issue Jan 8, 2024 · 0 comments

Comments

@acidus99
Copy link

acidus99 commented Jan 8, 2024

warcio recompress will silently add a WARC-Payload-Digest field to records that don't already have a payload digest field. This appears to only happen if the record already has a WARC-Block-Digest field.

In my testing, I've seen this happen to both "metadata" records and non-HTTP "request" records. This is strange since warcio doesn't know what subset of these records's content block constitutes a "payload", so how could it be able to calculate a digest? The created Payload-Digest appears to just be a hash of the entire block. (This issue is in many ways an inverse of #156).

Attached is a ZIP file with orig.warc and warc-recompress.warc which was created by:
example-warcs.zip

warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz

The "metadata" record in orig.warc contains a X.509 certificate and uses a Content-Type field of application/x-pem-file. The original has a block digest field, and no payload digest, since this metadata does not have a meaningful payload beyond the block digest. However if you look at ware-recompress.warc you will see that a WARC-Payload-Digest header has been added to the "metadata" record at the end. Additionally, the "request" record is for the Gemini protocol, and is not HTTP. Gemini requests do not have a meaningful payload, so the request record in in orig.warc does not have a WARC-Payload-Digest field. However warcio-compress.warc shows one has been added.

While similar to #161, I believe this is a higher severity. Payload digests have meaning, and are used in other tool chains like CDX indexes. However warcio is adding payload digests to records that don't have them, and without having any concept of what the payload is or its meaning for these records. This is in addition to strangeness documented in #161 like:

  • I would not expect a recompression operation to alter the records in the WARC.
  • This behavior isn't documented
  • It (very slightly) increases the size of the WARC

My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant