-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does, or should, ipwb support recursive pinning? #636
Comments
This is a fair thing to consider. So far, we were thinking along the lines of this problem of batch replication being solved independently by the IPFS community (and there has been some progress in this direction), instead of every application inventing the same solution. That said, perhaps a replay-time CLI flag can be introduced to spin a detached thread which can traverse over all the index records and pulls/pins records locally, if not present already. Another option would be to add UI elements in the Admin interface to perform selective or batch pinning of records. Most of our tests so far were performed on data being present in the local IPFS store, but there are tickets to allow attachment of the replay to any IPFS node or simply relying on the global resolver (the latter is only suitable for replay, not for indexing). |
I would suggest a separate CLI command, something like this: I can imagine running this command on a VPS which I create specifically to broadcast my archives to the network at any moment, regardless of whether my home machine is available. That was one of my motivations to propose factoring out of |
In that case, we really do not need any addition |
That may be the case, but:
|
Sure, there are good reasons to add a utility sub-command, but I was suggesting a quick workaround even before we add any additional code to the repo. |
I can do that if we are to create an issue for the purpose (or use this one); I would need pinning for my purposes anyway. |
I like this idea and involving ipwb in the workflow might make this a simple, intuitive process for users that don't want to futz with long command-line arguments. ipwb for the most part does not explicitly interact with pinning payloads in IPFS, just the implicit pinning that comes with adding it to IPFS as exhibited by the indexer. This could provide an interesting demo of supplementing one's local "archive" with the payloads from hashes of another. For example, if userA has a CDXJ index of their captures and userB happens to have an index referencing an embedded resource that gives a more temporally coherent composite representation to userA's captures, userA adding references to userB's captures following the explicit pinning procedure would allow userA observe a different composite memento when replaying their own index. This use case would stem on userB's CDXJ (as pinned through Following userA pinning userB's CDXJ index, should userB's CDXJ somehow be retained for use per above? If not, it would result in metadata being lost and userA pinning data that is unusable in their local ipwb replay system. |
I must confess I do not understand this point.
|
I did not mean to imply that resources are identical, just that they may has the same original URI (URI-R). Thus, Bob's representation of a the HTTP response of the URI would have a different IPFS hash than Alice's representation of a HTTP response. For example, Alice's index has an entry for the URI https://example.com (U1), which, when dereferenced, contains When all resources are dereferenced, because of ΔtA and all of the Δt's of the base representation (the HTML page) and each respective embedded resource, it can occur that Bob's representation of U2 has a Δt < ΔtA. |
Hm. So there are two timelines of archiving. alice.cdxj bob.cdxj
Replay system would know to prefer the records tagged with the same Crawl ID to the ones which used another Crawl ID. |
Metadata like crawl source can be present in the WARC but that is a feature of the crawler itself and not guaranteed. I am unsure if most crawlers give a unique identifier to the crawl instance/source. We could attribute metadata to the source of indexing and thus archive in the JSON block within the CDXJ, which would allow the client to, for example, give precedence to their own captures in replaying composite mementos but that can get quickly get complicated. |
@machawk1 the potential temporal inconsistency issue you are describing is a result of index merging, not of the IPFS record pinning. The reply system will request closes embedded resources based on closest matching records in the index, irrespective of whether or not corresponding data is pinned/cached in the local/primary IPFS store. If an IPFS record in locally available, it does not automatically jump into the replay until requested and if a record is locally missing, does not make it do-not-replay-and-fallback-to-another-match, it will be attempted to be discovered from peers instead or fail to resolve. |
|
Source locking will lose many advantages of merging collections from different sources to enrich the archive and patch pages with missing embedded resources. That said, I remember discussing a new model of archival replay system by building a resource dependency graph index which will allow replay of pages with prespecified versions of embedded resources unless explicitly updated. That model can be implemented in IPWB someday after giving more thoughts to it, but it will be a big change in the system. |
I agree with this. The origin of the payload ought to be agnostic of the method or creator though there could be a case of maintaining provenance to ensure the composite memento you view is solely made up of your own captures. There could be a need for this or a desire to be liberal with assembling a composite memento solely on the temporally closest embedded resources.
This sounds related to your Web Bundling work, @ibnesayeed.
Right, but if you a user explicitly pins the captures from an external CDXJ but does not have references in their own CDXJ, it seems that their local IPFS node could accumulate a lot of garbage without having a basis for its significance, despite the payload still be accessible. |
That's the kind of gap transactional on-demand per-page archiving model serves well (archive.today) would be a good example. Crawler based archives on the other hand capture atomic resources with the help of a frontier queue and a recently seen list to minimize repeated downloading of shared resources. To leverage crawl-based archiving while supporting a more coherent and fixity preserving replay I proposed the dependency graph style indexing mentioned above.
Yes, that was the context when I discussed it with @phonedude, but the model can be devised to support in a non-bundled environment too.
Resources in an IPFS store are unaware of their application. Pin management is something one can perform independently by identifying resources they do not want, so those can be unpinned to be garbage collected. Even at the time of pinning, you may want to add optional filters so that you do not pin every hash in every index you have, if that's a concern. However, the simplified assumption would be to pin everything that is present in any of your indexes, in case those are needed. If you have something in your index, it means you are willing to replay it. |
I believe #637 is a prerequisite to implement this. |
To ensure constant availability of every file loaded into IPFS from WARC archive, I would like to pin those files. I can see this can be rather straightforward: I only have to parse CDXJ file and pin every hash from it - but that seems tedious and requires extra code.
Would it be possible instead to add not one file but whole directory with all files from the archive plus, say,
index.cdxj
to provide navigation and metadata?Thus, every node who wishes to provide persistence to the data in question would only have to pin the directory itself.
Would you mind sharing your view on this?
The text was updated successfully, but these errors were encountered: