Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

JustAnotherArchivist · 2021-01-13T20:07:51Z

set_patterns is called by ListenerWorkerThread and modifies variables used by the main wpull thread. This introduces several race conditions. For example, this means that _compiled may rarely get cleared during the iteration in ignores, and patterns might change during compilation.

For performance reasons, it's undesirable to use a lock on every call to ignores, especially as that method might get called many millions of times on large jobs. This implements an intermediate solution that avoids such locking. _compiled is only changed by the main thread, therefore the actual ignoring can't be affected by the updates. The updater thread might change _primary and patterns during or after the recompilation condition, however, such a change would always also require a recompilation anyway, and that is guaranteed to happen either during that ignores call or on the next one. The same applies if the change occurs after the recompilation. Although it doesn't completely prevent them, this mitigates possible race conditions.

…ed race conditions `set_patterns` is called by `ListenerWorkerThread` and modifies variables used by the main wpull thread. This introduces several race conditions. For example, this means that `_compiled` may rarely get cleared during the iteration in `ignores`, and `patterns` might change during compilation. For performance reasons, it's undesirable to use a lock on every call to `ignores`, especially as that method might get called many millions of times on large jobs. This implements an intermediate solution that avoids such locking. `_compiled` is only changed by the main thread, therefore the actual ignoring can't be affected by the updates. The updater thread might change `_primary` and `patterns` during or after the recompilation condition, however, such a change would always also require a recompilation anyway, and that is guaranteed to happen either during that `ignores` call or on the next one. The same applies if the change occurs after the recompilation. Although it doesn't completely prevent them, this mitigates possible race conditions.

JustAnotherArchivist · 2021-01-13T20:14:36Z

I've tested the particular race condition we experienced on job 1mnfkzd070ichqvewbzx62kmy manually with some minor modifications to artificially trigger the race condition.

Ignoracle patch

diff --git a/pipeline/archivebot/wpull/ignoracle.py b/pipeline/archivebot/wpull/ignoracle.py
index d50c411..f5763c7 100644
--- a/pipeline/archivebot/wpull/ignoracle.py
+++ b/pipeline/archivebot/wpull/ignoracle.py
@@ -80,6 +80,7 @@ class Ignoracle(object):
             patterns.append(string)
 
         with self._lock:
+            print('resetting patterns')
             self.patterns = patterns
             self._primary = None
             # Don't replace _compiled here; _primary acts as a trigger for the recompilation.
@@ -111,11 +112,15 @@ class Ignoracle(object):
                 self._primary = (primaryUrl, primaryNetloc)
 
         for pattern, compiled in self._compiled:
+            print(f'checking {pattern}')
             match = compiled.search(url_record.url)
 
             if match:
                 return pattern
 
+            import time
+            time.sleep(1)
+
         return False
 
 # vim: ts=4:sw=4:et:tw=78

test.py

import ignoracle
import threading
import time
import wpull.pipeline.item


patterns = ['/a/', '/b/']


class UpdaterThread(threading.Thread):
	def __init__(self, ignoracle, *args, **kwargs):
		self._ignoracle = ignoracle
		super().__init__(*args, **kwargs)

	def run(self):
		time.sleep(1.5)
		self._ignoracle.set_patterns(patterns)


r = wpull.pipeline.item.URLRecord()
r.url = 'https://example.org/b/'
r.parent_url = 'https://example.org/'
r.level = 2

i = ignoracle.Ignoracle()
i.set_patterns(patterns)

t = UpdaterThread(i)
t.start()
print('checking after starting thread')
print(i.ignores(r))
# 1 second will have passed here
print('checking again')
print(i.ignores(r)) # After checking /a/, UpdaterThread will reset the patterns
t.join()
print('checking after thread join')
print(i.ignores(r))

Output

resetting patterns
checking after starting thread
checking /a/
checking /b/
/b/
checking again
checking /a/
resetting patterns
checking /b/
/b/
checking after thread join
checking /a/
checking /b/
/b/

falconkirtaran

This looks reasonable. I don't see any possibility for deadlock and it looks like atomic operations are minimized in time. Should prevent temporary clearing of ignore patterns during job runs.

Things, generally about the original code:

I'm unsure of the purpose of setting self._primary = None in the original code
I don't actually know whether self.patterns = patterns is atomic. If it is, we could likely avoid the lock entirely, if we took a reference of self.patterns in ignores before processing it into the compiled list. That could be much slower though, depending on whether copy on write is used.
If the data in _compiled actually has to vary according to the current base URL (does it?), it seems very silly to keep a single one in global state. Alternating base URLs would be a worst case for performance. We may consider storing a hashmap of primaryUrl -> _patterns.
The logic for recompiling all the ignore patterns is fragile. If we mean to recompile them whenever the compiled array is empty, we should say that instead of relying on self._primary = None to force recompilation. Too clever by half.

JustAnotherArchivist · 2021-02-21T01:38:18Z

Thanks! Will merge shortly.

* I'm unsure of the purpose of setting `self._primary = None` in the original code

That still serves the same purpose: triggering a compilation on the first call to ignores. Since obviously None != (primaryUrl, primaryNetloc) will always be True. Yeah, it's a hack. The advantage is that there's only one condition to check inside ignores. That probably doesn't matter performance-wise though.

* I don't actually know whether `self.patterns = patterns` is atomic.  If it is, we could likely avoid the lock entirely, if we took a reference of self.patterns in ignores before processing it into the compiled list.  That could be much slower though, depending on whether copy on write is used.

Well, it's complicated...

self.patterns = patterns compiles to bytecode like this in CPython 3.9:

  5           0 LOAD_FAST                0 (patterns)
              2 LOAD_GLOBAL              0 (self)
              4 STORE_ATTR               1 (patterns)

The actual assignment happens within a single bytecode, and due to the GIL, only one bytecode can be executed at the same time in CPython. So in that sense, it is atomic.

However, this isn't actually specified anywhere as far as I know. Nothing about thread-safety is, I believe, except for the code that's explicitly thread-safe like in the threading module. So if something lifts the GIL in CPython or if you use a different implementation that doesn't have a GIL (like Jython or IronPython), it could break down, I guess.

* If the data in _compiled actually has to vary according to the current base URL (does it?), it seems very silly to keep a single one in global state.  Alternating base URLs would be a worst case for performance.  We may consider storing a hashmap of primaryUrl -> _patterns.

Yeah, patterns may need to be recompiled if the base URL changes because the values of {primary_url} and {primary_netloc} change, which are substituted before the compilation. That's used in the singletumblr igset, for example.

I actually thought about using a dict mapping primary URLs to patterns when I originally implemented the compilation a few years ago (#262). The main downside is that you then need to keep the compiled pattern objects once per primary URL, andhat will quickly blow up memory consumption for jobs of large URL lists. For example, one re.compile(r'^https?://example\.org/') object consumes 308 bytes in a quick test on CPython 3.9.1. Every job currently has at least 212 ignores (from the global igset), so that's somewhere around 65 kB of pattern objects. Even a tiny Twitter scrape job list of a few thousand URLs would mean a couple hundred MB of RAM then, and jobs with millions of URLs in the list aren't rare. So that isn't feasible.

Also, your worst case actually only applies on the initial iteration over the URL list due to wpull's breadth-first recursion. Once it's done with that, it will have larger blocks of URLs discovered on each of the initial URLs, which obviously share the primary_url and primary_netloc values, so no recompilation takes place then until it gets to the block of URLs from the next initial URL.

There is definitely room for improvement though. In particular, only patterns that use those special placeholders need to be recompiled, so the other objects should be kept.

That said, it appears that all of this isn't actually as bad as it seemed at the time when I implemented #262. I just looked with py-spy at Twitter scrape job 2x0k6405fu19mms9y28ymvdlb while it was running through its ~130k URL list, i.e. it should represent the worst case you mentioned. re._compile only accounted for 0.13 s in a sampling across about 31 s... Maybe that depends somewhat on the exact circumstances of a job, though I don't see how right now. I remember that it showed up as a major slow point in some performance measurements at the time but have no idea what the details were.

* The logic for recompiling all the ignore patterns is fragile.  If we mean to recompile them whenever the compiled array is empty, we should say that instead of relying on `self._primary = None` to force recompilation.  Too clever by half.

We can't clear the compiled list because it might be iterated over by the other thread at the same time. We'd have to use a lock or make a copy of the list. I tried the latter but found that messier. Also, we'd still have the _primary condition anyway.
Perhaps a cleaner implementation would be to have a self._mustRecompile bool that is set to True in set_patterns and OR'd with the self._primary comparison. I didn't think this through regarding thread safety though.

Overall, I don't really want to spend more time on the existing ignoracle except for fixing bugs like in this PR. It will always have pretty horrible performance due to wpull checking each URL individually out of and back into the DB, URL parsing, and other things. #446 and the idea of silent ignores that are kept out of DB and log entirely would be a better approach.

JustAnotherArchivist added bug pipeline labels Jan 13, 2021

JustAnotherArchivist requested a review from falconkirtaran January 13, 2021 20:07

falconkirtaran approved these changes Feb 20, 2021

View reviewed changes

JustAnotherArchivist merged commit 3fbf32b into ArchiveTeam:master Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

JustAnotherArchivist commented Jan 13, 2021

JustAnotherArchivist commented Jan 13, 2021 •

edited

Loading

falconkirtaran left a comment

JustAnotherArchivist commented Feb 21, 2021

Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

Conversation

JustAnotherArchivist commented Jan 13, 2021

JustAnotherArchivist commented Jan 13, 2021 • edited Loading

falconkirtaran left a comment

Choose a reason for hiding this comment

JustAnotherArchivist commented Feb 21, 2021

JustAnotherArchivist commented Jan 13, 2021 •

edited

Loading