Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk ignore handling #446

Open
JustAnotherArchivist opened this issue Jun 6, 2020 · 0 comments
Open

Bulk ignore handling #446

JustAnotherArchivist opened this issue Jun 6, 2020 · 0 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

I just realised that a feature we've been talking about for years in #archivebot still isn't filed here: bulk ignore handling.

The issue at hand is that wpull is fairly slow at handling ignores, at least in the context of large AB jobs with tens to hundreds of millions of URLs. For example, job 8ln624q16o9eghqd8rl6x7lq7 has processed only around 24 million URLs (virtually all ignored) in roughly 12 days. This is because every queue entry has to be checked out from the database, processed, and checked back in; the first and last step further involve SQLite transactions and syncing to disk, which makes this very inefficient. (Also, ArchiveTeam/wpull#427.)

A more efficient solution would be to directly run a database query like UPDATE queued_urls SET status = "skipped" WHERE <url matches ignores> AND status IN ("todo", "error").

Advantages:

  • Performance
  • This would allow unignoring a pattern and having the job still process those ignored URLs (whereas now those URLs would stay skipped). It could also easily be extended to requeueing URLs that were not retrieved correctly (e.g. temporary IP ban resulting in 403s).

Challenges:

  • To keep the matching semantics unchanged, the match criterion would have to be WHERE url_strings.url REGEXP 'pattern' with an implementation of regexp() calling out to Python (example). This may be a significant performance hit compared to sqlite3-pcre or similar implementations that don't require constantly switching from the DB to Python and back.
  • {primary_netloc} and {primary_url} handling
  • Logging of ignored URLs becomes a bit more complicated
  • Jobs may block for a very long time. It should probably be done in blocks of e.g. 1k URLs (i.e. LIMIT clause). Cf. Slow processing of completed downloads can break connections of concurrent downloads wpull#397

Notes:

  • The query would have to be rerun from time to time, namely whenever the queue from the time it was last executed is gone. So the MAX(id) would have to be kept somewhere. As AB does not support resumption from files, this could just be done in memory.
  • Implementing this in wpull is probably not worth it.
  • The other URL filters in wpull (e.g. --no-parent or span-hosts) would not benefit from this, though at least some could be implemented in a similar way with more complex queries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant