Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

Make the system self-recoverable after connection errors #319

Merged
merged 4 commits into from
Sep 12, 2018

Conversation

roll
Copy link
Member

@roll roll commented Sep 6, 2018


We solve here various issues:

  • fix the database connection inside the celery worker after a database restarted the connection (actually this issue The celery worker can stuck after a SQL error #317 is even reproducible locally)
  • add 10 minutes soft time limit for all celery tasks - we are ensuring that there is no stale tasks at celery level
  • we mark all stale jobs/internal_jobs in the database as errored with a message time limit exceeded if the job is unfinished for more than 10 minutes (we run this cleanup job on SQL errors) - we are ensuring that there is no stale jobs at the database level (e.g. the UI stuck in syncing repos bug)

All these steps applied should guarantee eventual integrity for the system.

@roll roll changed the title [WIP] Fix SQL connection for workers and clean stale jobs Make the system self-recoverable after connection errors Sep 7, 2018
@roll roll requested a review from amercader September 7, 2018 08:09
@roll
Copy link
Member Author

roll commented Sep 7, 2018

@amercader
Please review

Copy link
Member

@amercader amercader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The only potential concern would be if the cleanup_session took a long time in the context of a web request (ie when used in the Flask error handler) but given that it only updates unfinished jobs, I don't thing that will be an issue.

@roll
Copy link
Member Author

roll commented Sep 12, 2018

@amercader
Thanks!

I agree - it should not be a problem while we have only 1 or 2 simple requests there (and anyway inside the already errored session).

@roll
Copy link
Member Author

roll commented Sep 12, 2018

Probably having regular clean-up jobs could be a better solution. But current should work for now.

@roll roll merged commit 36782b8 into master Sep 12, 2018
@roll roll deleted the worker-sqlalchemy-rollback branch September 12, 2018 10:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The celery worker can stuck after a SQL error Add time limits for celery tasks
2 participants