-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ignores sometimes not being applied correctly due to thread-related race conditions #493
Fix ignores sometimes not being applied correctly due to thread-related race conditions #493
Conversation
…ed race conditions `set_patterns` is called by `ListenerWorkerThread` and modifies variables used by the main wpull thread. This introduces several race conditions. For example, this means that `_compiled` may rarely get cleared during the iteration in `ignores`, and `patterns` might change during compilation. For performance reasons, it's undesirable to use a lock on every call to `ignores`, especially as that method might get called many millions of times on large jobs. This implements an intermediate solution that avoids such locking. `_compiled` is only changed by the main thread, therefore the actual ignoring can't be affected by the updates. The updater thread might change `_primary` and `patterns` during or after the recompilation condition, however, such a change would always also require a recompilation anyway, and that is guaranteed to happen either during that `ignores` call or on the next one. The same applies if the change occurs after the recompilation. Although it doesn't completely prevent them, this mitigates possible race conditions.
I've tested the particular race condition we experienced on job 1mnfkzd070ichqvewbzx62kmy manually with some minor modifications to artificially trigger the race condition. Ignoracle patchdiff --git a/pipeline/archivebot/wpull/ignoracle.py b/pipeline/archivebot/wpull/ignoracle.py
index d50c411..f5763c7 100644
--- a/pipeline/archivebot/wpull/ignoracle.py
+++ b/pipeline/archivebot/wpull/ignoracle.py
@@ -80,6 +80,7 @@ class Ignoracle(object):
patterns.append(string)
with self._lock:
+ print('resetting patterns')
self.patterns = patterns
self._primary = None
# Don't replace _compiled here; _primary acts as a trigger for the recompilation.
@@ -111,11 +112,15 @@ class Ignoracle(object):
self._primary = (primaryUrl, primaryNetloc)
for pattern, compiled in self._compiled:
+ print(f'checking {pattern}')
match = compiled.search(url_record.url)
if match:
return pattern
+ import time
+ time.sleep(1)
+
return False
# vim: ts=4:sw=4:et:tw=78 test.pyimport ignoracle
import threading
import time
import wpull.pipeline.item
patterns = ['/a/', '/b/']
class UpdaterThread(threading.Thread):
def __init__(self, ignoracle, *args, **kwargs):
self._ignoracle = ignoracle
super().__init__(*args, **kwargs)
def run(self):
time.sleep(1.5)
self._ignoracle.set_patterns(patterns)
r = wpull.pipeline.item.URLRecord()
r.url = 'https://example.org/b/'
r.parent_url = 'https://example.org/'
r.level = 2
i = ignoracle.Ignoracle()
i.set_patterns(patterns)
t = UpdaterThread(i)
t.start()
print('checking after starting thread')
print(i.ignores(r))
# 1 second will have passed here
print('checking again')
print(i.ignores(r)) # After checking /a/, UpdaterThread will reset the patterns
t.join()
print('checking after thread join')
print(i.ignores(r)) Output
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable. I don't see any possibility for deadlock and it looks like atomic operations are minimized in time. Should prevent temporary clearing of ignore patterns during job runs.
Things, generally about the original code:
- I'm unsure of the purpose of setting
self._primary = None
in the original code - I don't actually know whether
self.patterns = patterns
is atomic. If it is, we could likely avoid the lock entirely, if we took a reference of self.patterns in ignores before processing it into the compiled list. That could be much slower though, depending on whether copy on write is used. - If the data in _compiled actually has to vary according to the current base URL (does it?), it seems very silly to keep a single one in global state. Alternating base URLs would be a worst case for performance. We may consider storing a hashmap of primaryUrl -> _patterns.
- The logic for recompiling all the ignore patterns is fragile. If we mean to recompile them whenever the compiled array is empty, we should say that instead of relying on
self._primary = None
to force recompilation. Too clever by half.
Thanks! Will merge shortly.
That still serves the same purpose: triggering a compilation on the first call to
Well, it's complicated...
The actual assignment happens within a single bytecode, and due to the GIL, only one bytecode can be executed at the same time in CPython. So in that sense, it is atomic. However, this isn't actually specified anywhere as far as I know. Nothing about thread-safety is, I believe, except for the code that's explicitly thread-safe like in the
Yeah, patterns may need to be recompiled if the base URL changes because the values of I actually thought about using a dict mapping primary URLs to patterns when I originally implemented the compilation a few years ago (#262). The main downside is that you then need to keep the compiled pattern objects once per primary URL, andhat will quickly blow up memory consumption for jobs of large URL lists. For example, one Also, your worst case actually only applies on the initial iteration over the URL list due to wpull's breadth-first recursion. Once it's done with that, it will have larger blocks of URLs discovered on each of the initial URLs, which obviously share the There is definitely room for improvement though. In particular, only patterns that use those special placeholders need to be recompiled, so the other objects should be kept. That said, it appears that all of this isn't actually as bad as it seemed at the time when I implemented #262. I just looked with py-spy at Twitter scrape job 2x0k6405fu19mms9y28ymvdlb while it was running through its ~130k URL list, i.e. it should represent the worst case you mentioned.
We can't clear the compiled list because it might be iterated over by the other thread at the same time. We'd have to use a lock or make a copy of the list. I tried the latter but found that messier. Also, we'd still have the Overall, I don't really want to spend more time on the existing ignoracle except for fixing bugs like in this PR. It will always have pretty horrible performance due to wpull checking each URL individually out of and back into the DB, URL parsing, and other things. #446 and the idea of silent ignores that are kept out of DB and log entirely would be a better approach. |
set_patterns
is called byListenerWorkerThread
and modifies variables used by the main wpull thread. This introduces several race conditions. For example, this means that_compiled
may rarely get cleared during the iteration inignores
, andpatterns
might change during compilation.For performance reasons, it's undesirable to use a lock on every call to
ignores
, especially as that method might get called many millions of times on large jobs. This implements an intermediate solution that avoids such locking._compiled
is only changed by the main thread, therefore the actual ignoring can't be affected by the updates. The updater thread might change_primary
andpatterns
during or after the recompilation condition, however, such a change would always also require a recompilation anyway, and that is guaranteed to happen either during thatignores
call or on the next one. The same applies if the change occurs after the recompilation. Although it doesn't completely prevent them, this mitigates possible race conditions.