You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just realised that a feature we've been talking about for years in #archivebot still isn't filed here: bulk ignore handling.
The issue at hand is that wpull is fairly slow at handling ignores, at least in the context of large AB jobs with tens to hundreds of millions of URLs. For example, job 8ln624q16o9eghqd8rl6x7lq7 has processed only around 24 million URLs (virtually all ignored) in roughly 12 days. This is because every queue entry has to be checked out from the database, processed, and checked back in; the first and last step further involve SQLite transactions and syncing to disk, which makes this very inefficient. (Also, ArchiveTeam/wpull#427.)
A more efficient solution would be to directly run a database query like UPDATE queued_urls SET status = "skipped" WHERE <url matches ignores> AND status IN ("todo", "error").
Advantages:
Performance
This would allow unignoring a pattern and having the job still process those ignored URLs (whereas now those URLs would stay skipped). It could also easily be extended to requeueing URLs that were not retrieved correctly (e.g. temporary IP ban resulting in 403s).
Challenges:
To keep the matching semantics unchanged, the match criterion would have to be WHERE url_strings.url REGEXP 'pattern' with an implementation of regexp() calling out to Python (example). This may be a significant performance hit compared to sqlite3-pcre or similar implementations that don't require constantly switching from the DB to Python and back.
{primary_netloc} and {primary_url} handling
Logging of ignored URLs becomes a bit more complicated
The query would have to be rerun from time to time, namely whenever the queue from the time it was last executed is gone. So the MAX(id) would have to be kept somewhere. As AB does not support resumption from files, this could just be done in memory.
Implementing this in wpull is probably not worth it.
The other URL filters in wpull (e.g. --no-parent or span-hosts) would not benefit from this, though at least some could be implemented in a similar way with more complex queries.
The text was updated successfully, but these errors were encountered:
I just realised that a feature we've been talking about for years in
#archivebot
still isn't filed here: bulk ignore handling.The issue at hand is that wpull is fairly slow at handling ignores, at least in the context of large AB jobs with tens to hundreds of millions of URLs. For example, job 8ln624q16o9eghqd8rl6x7lq7 has processed only around 24 million URLs (virtually all ignored) in roughly 12 days. This is because every queue entry has to be checked out from the database, processed, and checked back in; the first and last step further involve SQLite transactions and syncing to disk, which makes this very inefficient. (Also, ArchiveTeam/wpull#427.)
A more efficient solution would be to directly run a database query like
UPDATE queued_urls SET status = "skipped" WHERE <url matches ignores> AND status IN ("todo", "error")
.Advantages:
Challenges:
WHERE url_strings.url REGEXP 'pattern'
with an implementation ofregexp()
calling out to Python (example). This may be a significant performance hit compared to sqlite3-pcre or similar implementations that don't require constantly switching from the DB to Python and back.{primary_netloc}
and{primary_url}
handlingLIMIT
clause). Cf. Slow processing of completed downloads can break connections of concurrent downloads wpull#397Notes:
MAX(id)
would have to be kept somewhere. As AB does not support resumption from files, this could just be done in memory.--no-parent
or span-hosts) would not benefit from this, though at least some could be implemented in a similar way with more complex queries.The text was updated successfully, but these errors were encountered: