Bulk ignore handling #446

JustAnotherArchivist · 2020-06-06T19:11:52Z

I just realised that a feature we've been talking about for years in #archivebot still isn't filed here: bulk ignore handling.

The issue at hand is that wpull is fairly slow at handling ignores, at least in the context of large AB jobs with tens to hundreds of millions of URLs. For example, job 8ln624q16o9eghqd8rl6x7lq7 has processed only around 24 million URLs (virtually all ignored) in roughly 12 days. This is because every queue entry has to be checked out from the database, processed, and checked back in; the first and last step further involve SQLite transactions and syncing to disk, which makes this very inefficient. (Also, ArchiveTeam/wpull#427.)

A more efficient solution would be to directly run a database query like UPDATE queued_urls SET status = "skipped" WHERE <url matches ignores> AND status IN ("todo", "error").

Advantages:

Performance
This would allow unignoring a pattern and having the job still process those ignored URLs (whereas now those URLs would stay skipped). It could also easily be extended to requeueing URLs that were not retrieved correctly (e.g. temporary IP ban resulting in 403s).

Challenges:

To keep the matching semantics unchanged, the match criterion would have to be WHERE url_strings.url REGEXP 'pattern' with an implementation of regexp() calling out to Python (example). This may be a significant performance hit compared to sqlite3-pcre or similar implementations that don't require constantly switching from the DB to Python and back.
{primary_netloc} and {primary_url} handling
Logging of ignored URLs becomes a bit more complicated
Jobs may block for a very long time. It should probably be done in blocks of e.g. 1k URLs (i.e. LIMIT clause). Cf. Slow processing of completed downloads can break connections of concurrent downloads wpull#397

Notes:

The query would have to be rerun from time to time, namely whenever the queue from the time it was last executed is gone. So the MAX(id) would have to be kept somewhere. As AB does not support resumption from files, this could just be done in memory.
Implementing this in wpull is probably not worth it.
The other URL filters in wpull (e.g. --no-parent or span-hosts) would not benefit from this, though at least some could be implemented in a similar way with more complex queries.

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist added enhancement pipeline labels Jun 6, 2020

JustAnotherArchivist mentioned this issue Feb 21, 2021

Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk ignore handling #446

Bulk ignore handling #446

JustAnotherArchivist commented Jun 6, 2020

Bulk ignore handling #446

Bulk ignore handling #446

Comments

JustAnotherArchivist commented Jun 6, 2020