Make the system self-recoverable after connection errors #319

roll · 2018-09-06T14:13:39Z

We solve here various issues:

fix the database connection inside the celery worker after a database restarted the connection (actually this issue The celery worker can stuck after a SQL error #317 is even reproducible locally)
add 10 minutes soft time limit for all celery tasks - we are ensuring that there is no stale tasks at celery level
we mark all stale jobs/internal_jobs in the database as errored with a message time limit exceeded if the job is unfinished for more than 10 minutes (we run this cleanup job on SQL errors) - we are ensuring that there is no stale jobs at the database level (e.g. the UI stuck in syncing repos bug)

All these steps applied should guarantee eventual integrity for the system.

roll · 2018-09-07T08:18:21Z

amercader

Looks good. The only potential concern would be if the cleanup_session took a long time in the context of a web request (ie when used in the Flask error handler) but given that it only updates unfinished jobs, I don't thing that will be an issue.

roll · 2018-09-12T10:19:51Z

@amercader
Thanks!

I agree - it should not be a problem while we have only 1 or 2 simple requests there (and anyway inside the already errored session).

roll · 2018-09-12T10:20:46Z

Probably having regular clean-up jobs could be a better solution. But current should work for now.

roll added 3 commits September 6, 2018 17:09

Added database rollback on unhandled worker SQL error

a19cc36

Added celery config: task_soft_time_limit

6efc91b

Implemented cleanup_session for rollback/fix stale jobs

1fd3bde

roll changed the title ~~[WIP] Fix SQL connection for workers and clean stale jobs~~ Make the system self-recoverable after connection errors Sep 7, 2018

Fixed linting

442bf31

roll requested a review from amercader September 7, 2018 08:09

amercader approved these changes Sep 12, 2018

View reviewed changes

roll merged commit 36782b8 into master Sep 12, 2018

roll deleted the worker-sqlalchemy-rollback branch September 12, 2018 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the system self-recoverable after connection errors #319

Make the system self-recoverable after connection errors #319

roll commented Sep 6, 2018 •

edited

Loading

roll commented Sep 7, 2018

amercader left a comment

roll commented Sep 12, 2018

roll commented Sep 12, 2018

Make the system self-recoverable after connection errors #319

Make the system self-recoverable after connection errors #319

Conversation

roll commented Sep 6, 2018 • edited Loading

roll commented Sep 7, 2018

amercader left a comment

Choose a reason for hiding this comment

roll commented Sep 12, 2018

roll commented Sep 12, 2018

roll commented Sep 6, 2018 •

edited

Loading