Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitting pages to the wayback machine #166

Open
8W9aG opened this issue Dec 3, 2024 · 4 comments
Open

Submitting pages to the wayback machine #166

8W9aG opened this issue Dec 3, 2024 · 4 comments

Comments

@8W9aG
Copy link
Contributor

8W9aG commented Dec 3, 2024

Is there a way to allow this package to submit pages to the wayback machine? Either in the form of a request done and signed by this package or submitting the request to the wayback machine to do on the clients behalf?

@Mr0grog
Copy link
Member

Mr0grog commented Dec 4, 2024

I’d be happy to accept a pull request for this!

It’s a natural fit, but I haven’t prioritized it since there are a huge number of other tools out there for it (in Python, I generally recommend savepagenow) and I haven’t even had time this year to finish out the other big deal stuff that is already half-done, like #58. It’s also a little complicated to do an ideal implementation, which supports the v2 API (see an example here that was abandoned because of complexity: palewire/savepagenow#31).

@Mr0grog
Copy link
Member

Mr0grog commented Dec 4, 2024

Either in the form of a request done and signed by this package or submitting the request to the wayback machine to do on the clients behalf?

Also worth noting:

  1. Members of the public cannot upload a WARC (or any other format for archived web pages) that will actually be displayed in the Wayback Machine (too many issues around proving that your content is really what was hosted somewhere, and not something you just invented yourself), although you can upload a WARC for other people to download as a collection item (using the internetarchive package).

  2. BUT you can use the “save page now” API to ask the Wayback Machine to archive the live page itself (what I was talking about in my first reply). So that’s what we’d be doing here.

  3. The Internet Archive also has a for-pay service called Archive-It you can use to crawl and save large websites (and do so repeatedly on a regular basis). If your needs are large-scale, this is probably the best thing to do.

@8W9aG
Copy link
Contributor Author

8W9aG commented Dec 4, 2024

I wonder if technologies like SXG might go a long way to solving the problem of whether the content is manipulated by a middleman? Perhaps that is a bit orthogonal to the conversation here.

Either way good to know that this package is keen for a save page now solution, I'll see if I can create a PR soon.

@Mr0grog
Copy link
Member

Mr0grog commented Dec 4, 2024

Perhaps that is a bit orthogonal to the conversation here.

Yeah, I don’t work at the Internet Archive, so we are restricted to their current policies and tools as far as this stuff goes. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants