Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single spider queue, for deterministic priority across all projects #187

Open
Digenis opened this issue Nov 2, 2016 · 5 comments
Open

Comments

@Digenis
Copy link
Member

Digenis commented Nov 2, 2016

Scheduled jobs are not run in FIFO+priority order.
Instead, there are multiple queues
that are also arranged in a queue-like fashion
but not round-robin or anything,
just an "arbitrary but constant" order (python dict).

A scheduled job with the current highest priority
will have to wait for all other jobs
that reside in all queues
that the dict-order puts in front of its queue

The poller demonstrates "preferences" for some projects over others
and lets them monopolize process slots.

The following code in poller.py
polls projects in whatever order iteritems() uses
which is arbitrary but constant.

for p, q in self.queues.iteritems():
    c = yield maybeDeferred(q.count)
    if c:
        msg = yield maybeDeferred(q.pop)
        returnValue(self.dq.put(self._message(msg, p)))

This becomes obvious when multiple projects
have several spiders queued and waiting for slots.
The first (project, queue) pair returned from iteritems()
that has queued spiders
will prevent any further projects from being polled.
If its queue is constantly full
only this project will have its spiders ran.

Possible solutions:

One way to solve this is suggested by @luc4smoreira in #140.
However it can't be a correct solution (for this issue)
because it can make total slot utilization vary a lot.
E.g. with 5 slots per project and 10 projects,
when 1 project is active there will be only 5 slots utilized
(even if the single active project is in need for more)
but when all 10 are active, there will be 50 slots.
In a workaround for the above, one can set the total slot limit
just above the slots per project
multiplied by the number of projects expected to monopolize slots
so that there will always be a few slots for others.

Another incomplete solution is to use the spider priority earlier in the poller
before picking a project. When I noticed this problem,
I mistakenly tried lowering the priority
of spiders belonging to the project monopolizing the slots.
Of course it didn't work ­― the priority is only within the project queue
and not the set of queues.
Update 2019-06-26:
A for-loop in the poller, querying all queues,
if more processes (threads too) write/read queues,
will cause race conditions to occur more often.

We can also stop using multiple queues
and keep everything in a single database.
Although it sounds like a big change,
by doing it we'll drop a lot of code from scrapyd
and we can even implement both of the 2 suggestions above
cleaner, faster and with less code.

@Digenis
Copy link
Member Author

Digenis commented Nov 2, 2016

The easiest way to fix this
is pick the queue to poll at random
instead of the iteritems order,
like in the following dirty workaround:

--- a/scrapyd/poller.py
+++ b/scrapyd/poller.py
@@ -1,3 +1,4 @@
+from random import sample
 from zope.interface import implements
 from twisted.internet.defer import DeferredQueue, inlineCallbacks, maybeDeferred, returnValue

@@ -17,11 +18,12 @@ class QueuePoller(object):
     def poll(self):
         if self.dq.pending:
             return
-        for p, q in self.queues.iteritems():
-            c = yield maybeDeferred(q.count)
-            if c:
-                msg = yield maybeDeferred(q.pop)
-                returnValue(self.dq.put(self._message(msg, p)))
+
+        for proj, queue in sample(self.queues.items(), len(self.queues)):
+            if (yield maybeDeferred(queue.count)):
+                msg = (yield maybeDeferred(queue.pop)).copy()
+                msg = dict(msg, _project=proj, _spider=msg.pop('name'))
+                returnValue(self.dq.put(msg))

     def next(self):
         return self.dq.get()

This is not the kind of fix that we'll release
but I provide it here for reference and for users in need of a patch.

This also makes me ask
if it's possible that a queue is deleted
in the reactor loops between sample() and returnValue.
If so, then it can be a race condition.

@Tarliton
Copy link

Tarliton commented Dec 6, 2016

Hello

Stopping to use multiple queues sounds good, what would be the requirements of an acceptable solution for this?

thanks

@Digenis
Copy link
Member Author

Digenis commented Dec 7, 2016

Any solution is more or less backwards incompatible.
It can't make it to 1.2

The requirements are:
The poller and/or spider queue should return the jobs
using the same order they enter the queue/queues (from the scheduler).
In a single database approach (which I prefer)
one should also demonstrate that there are no concurrency issues.

@Digenis
Copy link
Member Author

Digenis commented Jun 26, 2019

Similar:
scrapy/scrapy#3666
scrapy/scrapy#3679

@Digenis Digenis changed the title Polling order / Scheduled job priorities and queue priorities Polling order / Scheduled job priorities vs queue priorities Jun 26, 2019
@Digenis Digenis changed the title Polling order / Scheduled job priorities vs queue priorities Polling order / Scheduled jobs priorities vs queue priorities Jun 26, 2019
@Digenis Digenis mentioned this issue Apr 9, 2021
@jpmckinney jpmckinney mentioned this issue Sep 23, 2021
@jpmckinney jpmckinney changed the title Polling order / Scheduled jobs priorities vs queue priorities Use a single spider queue, for deterministic priority across all projects Jul 15, 2024
@jpmckinney
Copy link
Contributor

jpmckinney commented Jul 15, 2024

The tests in #344 #349 might be useful.

Can also consider #343

Idea from #197 (comment)

sqlite priority queue [...] may also be eventually replaced by https://github.com/scrapy/queuelib

Edit: Removing priority milestone as I don’t know how many users actually use priorities. Also, the ideal way to do this is a breaking change.

@jpmckinney jpmckinney removed this from the 1.x milestone Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants