Containers-roadmap: Support Task execution timeout(maximum-lifetime for a container) in ECS fargate

Created on 12 Nov 2019 · 10Comments · Source: aws/containers-roadmap

Summary

ECS does not currently support a task execution timeout so that when a task exceeds more than certain period of time, the task must be stopped automatically like how AWS Batch has job timeouts. The task definition does not have a parameter to enforce a task/container execution timeout that will automatically trigger the container to stop after the set time.

Use-case example from a customer:
I have a NLP model training job I want to run in a fargate container triggered by a lambda function. At some time, a bug might be introduced in the training code that would cause it to run indefinitely. I don't want to accidentally have those tasks piling up and have 50 tasks running for a couple weeks before we notice. That could have a cost implication. Is there a native way to kill a container if it hasn't exited on its own before a certain time?

Can this be considered as a feature request?

ECS

Source

nitheesha-amzn

👍53

Most helpful comment

I can see another use case here, as mentioned in https://github.com/aws/containers-roadmap/issues/232

applies to both Fargate and EC2 execution methods for ECS, not just Fargate
when scheduling tasks using cron-style syntax with Cloudwatch Events/EventBridge, you would want to ensure tasks don't run indefinitely. if they did, and you have them set to spawn regularly, you would eventually exhaust cluster resources/service limits, effectively DoS'ing your AWS account

CpuID on 17 Jan 2020

👍19

All 10 comments

Thanks @nitheesha-amzn for submitting this for me!
As we discussed in the ticket, a more native approach would be to have AWS Batch support Fargate launch type. This seems to be kind of a force-fit edge case for ECS.

danieladams456 on 12 Nov 2019

👍1

moving this over to the container roadmaps as an ecs feature request

adnxn on 12 Nov 2019

👍1

I can see another use case here, as mentioned in https://github.com/aws/containers-roadmap/issues/232

applies to both Fargate and EC2 execution methods for ECS, not just Fargate
when scheduling tasks using cron-style syntax with Cloudwatch Events/EventBridge, you would want to ensure tasks don't run indefinitely. if they did, and you have them set to spawn regularly, you would eventually exhaust cluster resources/service limits, effectively DoS'ing your AWS account

CpuID on 17 Jan 2020

👍19

A problem I'm seeing is a task that is expected to be relatively short lived (few hours at most, but typically minutes) due to some bug is 'stuck' and still running after days.

It would be great to have a back stop that kills any jobs after X hours. Looking at the console with hundreds of tasks is hard to find the problem ones.