-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task: Re-examine the event locks implementation #233
Comments
An additional problem worth noting / solving: 4) The same exact event can be run multiple times concurrently.At the moment we complete/reschedule an event right before running it. This is what core WP does, but problem is that core WP doesn't do concurrent event running. So this causes problems where the same exact event is run concurrently multiple times. Two examples: Example 1: Say an event has a 60 second recurring schedule but can run for up to say 2 minutes. So cron control will pick up the event, reschedule it for +60 seconds, and start processing it. In a minute the event is due again, but the first event is still running. Cron control will pick it up again and run it as well. Now we have the same event running twice. Example 2: The application repeatedly schedules a single event to run if it's not already scheduled with something like this:
At the moment, Solution: Set events to the |
Did a lot of testing and thinking about all the above. And I think we can solve pretty much all of the problems by setting events to the Mark events as "running" during event processingPretty straightforward idea. When an event is picked up to be run, we can do a query like this:
Event/action-level concurrency locks could be simplified to either allowed or disallowed concurrency.Instead of whitelisting actions with a number of how many allowed concurrent runs it can have, just have a boolean to signify "allowed_concurrency" or "disallowed_concurrency". We can keep backwards compatibly by checking the current whitelist filter and if the number is greater than 1 then the action has "allowed_concurrency". This keeps the promise that there is no same-action event concurrency by default, but events marked as "concurrent safe" no longer need the lock.
Site-level concurrency can also be simplified (or removed)With events marked as
Ultimately though, I still think we should remove the site-level lock entirely. The runner is already responsible for how many runWorkers are available. |
1) Prevent OOM requests from perpetually consuming a lock slot.
Problem example: There are 10 lock slots for concurrent event running. Event A OOMs and the process is killed off higher up the chain than at the request-level. The plugin never gets to free up the lock, so now the site-level lock is perpetually stuck with just 9 slots. This slowly gets worse and worse if OOMing events keep running every once in a while. Eventually, maybe, all 10 slots would be deadlocked and then the cache key will expire since it is no longer being updated. But this can take a while, and cron becomes very backed up in the process.
Possible solution: Mentioned here, instead of the lock being 1 cache key with incr/decr, each lock slot could be it's own key w/ it's own timestamp. This way we can free up locks individually when problems happen.
Another possible solution: Use a DB table for locks instead - #84
Another possible solution: Have the Go runner catch a kill signal and it could send a subsequent
cron-control cleanup event lock
in those cases. Downside being it's an extra request, and only solves for the go runner, but is a way to at least be notified about when this happens (vs relying on a cache timeout that may or may not be an accurate representation of how long an event can run successfully).2) Increase lock defaults
3) Drop locks completely?
Following the chain of thought from the point above, maybe we just drop the site-level concurrency locks completely. And event-level concurrency would default to "unlimited", but keep a simplified implementation that is used for events where concurrency is specifically disallowed.
The text was updated successfully, but these errors were encountered: