Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle sudden container failures without data loss/duplication (incl. support running crawls with preemtible nodes) #21

Open
motin opened this issue Aug 2, 2019 · 2 comments

Comments

@motin
Copy link
Contributor

motin commented Aug 2, 2019

Sometimes containers gets evicted due to out-of-memory issues or other random crashes. Also, running crawls with preemptible nodes is preferable since costs are much lower, but it comes with the risk of nodes being suddenly shut down. There is currently a risk for data loss and data duplication in case the container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3.

Data loss risk
If a container is shut down before the S3 Aggregator has had a chance to write it's in-memory contents to S3, data-loss occurs.

Data duplication risk
Even if we make sure that a new worker re-visits the site that the terminated container was in the midst of processing, we still have partially s3-written data from the previous container already written to S3, resulting in potentially duplicated data.

One way to mitigate this is to have the containers write the write queue to a persistent stable cache (such as Google Cloud Memorystore Redis) and run a separate job for draining the queue and writing to S3 atomically. Other solutions such as BigQuery could also be investigated, but could require more engineering to adopt.

@vringar
Copy link
Contributor

vringar commented Nov 19, 2019

Apparently we still might be forced to use non preemtible nodes as per the GKE docs any nodes on preemptible instances might be terminated without any notice (including not running the preStop hook)

or we bother with this complexity

@vringar
Copy link
Contributor

vringar commented Nov 19, 2019

For normal pod level stopping we have registered a preStop hook and have 30s to save everything before we get SIGTERMed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants