Wanted: reaping old tasks #45

rektide · 2021-08-17T18:32:57Z

I admit it, we have some containers that have slow memory leaks.

We'd love to have some good proven solutions for reaping old tasks. Something that will kill tasks over a week old, say. Slowly so as not to disrupt general service availability.

Was hoping to find something here, did not.

cristim · 2021-08-17T19:29:54Z

Thanks, that's actually a good idea for a new tool, I'll try to implement it once I'm done with my current work, I'll let you know once I have something ready for you.

cristim · 2021-08-17T19:42:34Z

Have you considered running the application on Spot instances? They're more likely to be interrupted so that the tasks uptime can be reduced. Or just bounce the ec2 instances from time to time using something like chaos-lambda

nathanpeck · 2021-08-17T20:49:57Z

Hey @rektide. I don't have a specific link for this, but I do have a solution for you:

ECS task definitions allows you to specify multiple containers, and these containers can be marked as "essential". If an "essential" container exits then the entire task is stopped and replaced. My suggestion is to run a tiny busybox container alongside your application container, mark that busybox container as essential, and configure the command to just be a sleep for however long you want the task to stay up. When the sleep ends, then that busybox sidecar will stop, and because it is marked as essential the entire task will be stopped and replaced by the ECS service.

This will effectively put a kill timer on your tasks and force them to restart on a schedule.

Alternatively if you want something that is a bit less bruteforce, and which has more logic about when to restart the tasks I'd suggest building a small Lambda function. The Lambda function can be configured to run on a schedule during your off peak hours. It can use the AWS SDK to list the tasks for the cluster, and issue a StopTask API call for any that are older than a certain threshold.

Edit: One more caveat/suggestion from Jon Wood on Twitter is to add a bit of jitter: timeout plus a random number of seconds. That way all your tasks don't die at once and cause an outage before they can be replaced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wanted: reaping old tasks #45

Wanted: reaping old tasks #45

rektide commented Aug 17, 2021

cristim commented Aug 17, 2021

cristim commented Aug 17, 2021

nathanpeck commented Aug 17, 2021 •

edited

Loading

Wanted: reaping old tasks #45

Wanted: reaping old tasks #45

Comments

rektide commented Aug 17, 2021

cristim commented Aug 17, 2021

cristim commented Aug 17, 2021

nathanpeck commented Aug 17, 2021 • edited Loading

nathanpeck commented Aug 17, 2021 •

edited

Loading