Currently ECS utilises ELB/ALB health checks to verify when a task is ready to accept traffic, and also when it is safe to terminate additional tasks as part of a rolling replacement/upgrade when bumping a task definition revision (to align with the service deployment configuration parameters).
Would it be possible when an ELB is not in use for an ECS service, to also look at the Docker health check status? There are some scenarios when you may not want an ELB in use, but need to gracefully rolling replace the containers as part of an upgrade.
Details on the feature introduced in Docker 1.12.x:
https://docs.docker.com/engine/reference/builder/#/healthcheck
https://docs.docker.com/engine/reference/run/#/healthcheck
@samuelkarp any feedback re this request?
@CpuID Thanks for the feature request! I'm interested in as much detail as you (or anyone else also interested in this) can provide, as that helps us both prioritize and make sure we're actually addressing the use-cases appropriately.
Some thoughts on what you've provided already:
@samuelkarp thanks for the response.
Replies inline below:
Docker does not implement a health check policy yet (it was rejected in their initial discussions), but defining policies (what to do when the health check fails) is required for the behavior that you're looking for. Are the types of settings that ELB/ALB health checks already support (timeout, interval, healthy threshold, unhealthy threshold) sufficient, or are you looking for something different?
I think those thresholds that exist for an ALB/ELB currently are perfectly sufficient for most requirements for now.
You're asking about services here, but ECS also supports one-off tasks. Would you also want a health check to be performed/enforced for those tasks?
I think initially supporting this functionality on ECS services only would be a fair judgement call. When you run an ECS one-off task, there is no service supervising the desired count (to replace an unhealthy instance). I think just allowing ECS one-off tasks to die off on their own as they do currently feels right. If you did do health checking on one-off tasks, then it would come down to what action to take if unhealthy (do you respawn the task? at this point you are entering service with a desired count 1 territory).
With ECS services, when a task is detected as unhealthy it would be terminated, then the ECS scheduler would realise the desired count is not met, and attempt to replace it with a new task. That feels like the real win here.
For a service that does have a load balancer, would you want both the load balancer health check and the Docker health check to be enforced? If you want both and one health check fails but the other succeeds, how important is it to know which health check failed?
Good question... a few valid approaches below (with varying engineering complexity):
_Option A_
_Option B_
Another consideration is how many container defs on an ECS task def need to have health checks attached, to use which method. As there is that extra abstraction layer between ECS tasks and underlying containers, that could get interesting. Especially since HEALTHCHECK definitions in a Dockerfile effectively run a CMD, as opposed to hitting something on a TCP port (like ELB checks do now, either TCP/HTTP/HTTPS). One option would be to say if at least 1 of the essential container definitions has a HEALTHCHECK attribute attached, it is considered that the entire ECS task definition is covered by Docker health checks.
Another note, Docker does have the ability to handle check intervals, retries and timeouts natively now:
https://docs.docker.com/engine/reference/run/#/healthcheck
--health-cmd Command to run to check health
--health-interval Time between running the check
--health-retries Consecutive failures needed to report unhealthy
--health-timeout Maximum time to allow one check to run
--no-healthcheck Disable any container-specified HEALTHCHECK
There are equivalents of these in the Dockerfile HEALTHCHECK definition, at least for interval/retries/timeout. I think using these is probably the safest approach, health-retries could replace both healthy and unhealthy threshold which exist separately on an ELB/ALB.
I'd like to see support for this, because I'd like to be able to monitor the memory usage of my container. If it climbs above a certain threshold (i.e. due to a memory leak), I'd like the system to gracefully restart my container. The Docker health check seems like a good way to check this.
IMO the Docker health check should take precedence over a LB health check, but they should both be used. The LB health check is useful for checking external indicators (container accepts HTTP requests on port 80), but in the scenario where a memory leak has caused memory usage to jump, the container may still be working fine (for now...)
I'd say that if either health check fails, the container should be considered unhealthy and gracefully restarted.
This functionality would potentially be useful for us as well. We're using ECS to power a rails app. We have a service that runs an nginx container and a rails app container. The nginx container's port is registered on the load balancer and this container proxies requests to the rails container.
Unfortunately, the rails app is very slow to start. The ECS service tries to register the containers on the load balancer almost immediately, and while the nginx container is ready to service requests, the rails container is not. Hence, we start failing load balancer health checks immediately while the rails app is still starting. If enough load balancer checks fail, the service is deregistered, the task is stopped and relaunched again, lather, rinse, repeat.
We can somewhat mask the problem by increasing the load balancer health check interval. This helps give the rails container time to initialize before being marked unhealthy. However, this also increases the amount of time it takes the load balancer to detect and deregister _genuinely unhealthy_ containers, which is undesirable.
If this feature were implemented in such a way that the docker health checks were the primary signal of health, and ECS only registered containers on the load balancer that pass the local docker health checks, then we'd be able to deterministically signal when our rails container is ready to service requests. We'd then be able to keep the load balancer health check interval and thresholds tight so that end-users are not exposed to failing containers for a long period of time.
That would be our potential use case for this feature, though we'd be happy with any means to deterministically signal when to register our containers with the load balancer.
This is a bit of a ditto but I want to share our use case.
Our backend services running on the JVM are often not healthy by the time ECS has done a full replacement because we are not using ALBs for these backend services and ECS does not know they are unhealthy. We are using Consul for service discovery and client side load balancing at each service including health checks. Using ALBs for every backend service would over complicate things for us.
This feature would help immensely as it would allow us to perform a proper in place replacement of a service and not have to spin up a new service, check Consul for healthy services then remove the old service.
@samuelkarp bump just wanted to check if you have any further feedback re implementing this? Is it something on your roadmap based on what you know so far? Did you need any further info from others? Seems there is a bit of interest from the community on getting this functionality...
Thank you everyone who has provided feedback so far! The descriptions here are exactly the kind of thing that we look at when trying to figure out what the right UX for a feature would be.
Some of the things I'm understanding from the use-cases here:
essential containers should likely cause the task to be torn down.From my own investigation:
interval after the container starts.docker inspect will report the 5 most recent statuses and a FailingStreak indicating the number of consecutive failures.docker events does not provide any indication of a failing health check, but does report an exec_start event when the health check command starts.I'd appreciate more feedback on the following areas:
essential containers is undefined; would we want to kill those containers or let them continue to run as the rest of the containers in the task are unaffected?HEALTHCHECK command in a Dockerfile is sufficient, or whether this should be part of the task definition or service. Feedback on this would be greatly appreciated.HEALTHCHECK defined in the Dockerfile have this behavior? Should it be enabled in the task definition? Note that if you inherit (use FROM) from an image that has a HEALTHCHECK defined in its Dockerfile, your image will inherit that setting as well.In terms of engineering work, it sounds like at a minimum the following things would need to be done:
docker inspect, this will increase the number and frequency of inspect calls that the agent will need to make; we'll need to balance this increase against whatever effect the increased load will have on the Docker daemon. Since there is no event provided for a health check, we'll either need to inspect on a predetermined interval or find the defined timeout (both in the inspect output) and trigger an inspect after seeing an exec_start event for that container.reason describing the failure.Unfortunately I don't have anything more to share at this time; as a general rule, we don't comment on our future plans. However, we will keep this issue updated as more information comes available.
Health check failures for non-essential containers is undefined; would we want to kill those containers or let them continue to run as the rest of the containers in the task are unaffected?
essential, they should probably be left to run? essential containers on the other hand should force marking the entire task as bad or something.I haven't heard whether configuration via the HEALTHCHECK command in a Dockerfile is sufficient, or whether this should be part of the task definition or service. Feedback on this would be greatly appreciated.
I think just the HEALTHCHECK attribute in a Dockerfile would be sufficient. I guess one option would be a toggle/feature flag in the ECS task definition to say "honor Docker Health Check" which defaults to true. And if there are any edge cases to deal with, these could be feature flagged in the task definition/s also.
Should all containers with a HEALTHCHECK defined in the Dockerfile have this behavior? Should it be enabled in the task definition? Note that if you inherit (use FROM) from an image that has a HEALTHCHECK defined in its Dockerfile, your image will inherit that setting as well.
See above, feature flagged, default on I think feels like the right answer :) To cover that scenario of an upstream FROM giving unintended consequences or something, allowing you to disable the behaviour.
Because the Docker health check settings are missing thresholds and grace periods, I'd appreciate feedback on whether these settings are important for Docker health checks.
Having a grace period might be useful potentially for initial startup. The only real option there would be to let Docker do it's checks natively but have the ECS agent ignore the results of them for the duration of the grace period on container start I suspect. Then once the startup grace period has passed, the ECS agent would act on the results of the health check.
The retries logic feels sufficient in terms of thresholds, open to further feedback from others if that needs to be more detailed though.
At least for our use case, adding grace-period support to ECS's existing integration with ELB/ALB health checks would definitely work.
Failing that, the docker health checks would also work for us as long as containers aren't registered on the ELB/ALB until the docker health check succeeds. I agree with @CpuID that the retries logic would cover our use case as well in terms of thresholds/etc.
I guess one option would be a toggle/feature flag in the ECS task definition to say "honor Docker Health Check" which defaults to true.
The risk with defaulting to true here is that it would represent a behavioral change for anyone whose images have HEALTHCHECK defined today. We'd likely need to make this an opt-in change for that reason.
The retries logic feels sufficient in terms of thresholds, open to further feedback from others if that needs to be more detailed though.
I missed this, thanks! It looks like retries covers an unhealthy threshold, but not a healthy threshold; it'd be good to understand if that gap is meaningful to anyone.
As least for my team's use case, the healthy threshold is not very meaningful. The types of health checks we'd be using at a container level wouldn't flap. Our ALB/ELB checks, on the other hand, are where we consider other external service availability and since those could possibly flap, the healthy threshold is more useful for us. So in short, retries would be sufficient for us and even there we'd effectively be using it as an initial startup grace period instead. So maybe I should say a grace period is most meaningful for our use case, but we could achieve something similar with retries.
We are basically running the same setup as @gunzy83 and would also benefit from this feature.
My question is, if some of you have workarounds for the time being? We have Java Services that take >60s until they are up and running. ECS reports them as steady as soon as the container runs.
We integrated https://github.com/knatsakis/tc-init-health-check-listener to get a clean Tomcat shutdown if the application doesn't start.
@samuelkarp responses inline below:
The risk with defaulting to true here is that it would represent a behavioral change for anyone whose images have HEALTHCHECK defined today. We'd likely need to make this an opt-in change for that reason.
That's fair enough, avoiding breakage for existing tasks is reasonable to make it an opt-in change.
I missed this, thanks! It looks like retries covers an unhealthy threshold, but not a healthy threshold; it'd be good to understand if that gap is meaningful to anyone.
Non-issue here I think, waiting before marking an instance unhealthy (to avoid excessive flapping) feels more important than waiting to mark it healthy for an excessive period. Chances are the only times this would have an impact is if a service is not available more than it's available, and as such would get a healthy check through but then fail multiple checks thereafter, causing degradation potentially (depending on the timeouts).
So in short, retries would be sufficient for us and even there we'd effectively be using it as an initial startup grace period instead. So maybe I should say a grace period is most meaningful for our use case, but we could achieve something similar with retries.
I would agree having a startup grace period would be advantageous, either separately or in addition to startup health checks. I have a use case upcoming which may benefit from startup grace periods...
+1 for startup grace period. But could also use an initial health check separated from the healthy health check (a bit more confusing than the grace period).
I'm going to solve this now by tweaking the health check values before deploy, and put it back after deploy is done.
@samuelkarp bump any further feedback on the above? hoping to keep the conversation moving so we get to a point where you have a complete enough spec to add it to your implementation roadmap, or someone else can attack it and PR :)
Docker 17.05 already implemented healthcheck grace periods
https://github.com/moby/moby/pull/28938
Our use case is ECS services using gRPC, which runs on HTTP/2 and so can't be attached to an ELB/ALB. I feel this functionality is essential and very important during deployments in a no-LB setup to ensure that the new version is actually running and serving responses before turning the old version off. It's also generally important in order to maintain a healthy system, by killing unhealthy tasks that are running as they'll keep running forever despite not being serving any responses.
Our usec ase is ECS services using gRPC, which runs on HTTP/2 and so can't be attached to an ELB/ALB.
@dario-simonetti Since you specifically mentioned a lack of HTTP/2 support as being problematic, I think it would be useful to note that Application Load Balancers (ALB) do support HTTP/2. You can see the feature list and comparison between Application Load Balancers and Classic Load Balancers here.
@samuelkarp they do, but they forward the request to the origin converting it to HTTP/1.1
You can send up to 128 requests in parallel using one HTTP/2 connection. The load balancer converts these to individual HTTP/1.1 requests and distributes them across the healthy targets in the target group using the round robin routing algorithm
(from http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html)
@samuelkarp Bump from our end as well. This would be very useful. I'm not committing to anything, but if we were to submit a PR that implemented this on the agent side, would that have any chance of being accepted and/or helping to expedite things? It seems like this would also require support on the Amazon service side to specify the health check attributes, so I don't know whether it helps to receive PRs that cover the agent side implementation or not.
@dario-simonetti Thanks, that clarifies the request a good deal.
[If] we were to submit a PR that implemented this on the agent side, would that have any chance of being accepted and/or helping to expedite things? It seems like this would also require support on the Amazon service side to specify the health check attributes, so I don't know whether it helps to receive PRs that cover the agent side implementation or not.
@bobziuchkovski I think we'd be open to a contribution for the agent-side of the feature, but as you've identified it does require work on the ECS service side as well to fully implement the feature. If you're still interested in a contribution, would you mind opening a new issue so we can talk through the design before writing code?
@samuelkarp Thanks. No promises on my end, but given how much this would help us, I'm inclined to pursue it. I'm tied up with other work for a couple days, but I'll take a look through the codebase this weekend to familiarize myself. After I do that, if it's something I feel I can tackle, I'll open that separate issue to discuss design.
+1 on having a configurable grace period for ECS deploys (initial healthcheck time)
+1
+1
(both docker healthcheck and on having a configurable grace period for ECS deploys (initial healthcheck time))
We have a similar problem to @gunzy83 and would benefit from the following feature:
"
The ECS backend will need to consider the health status of newly-started tasks during a service update prior to stopping the old tasks.
"
We don't need atm the docker healthcheck feature after the healthcheck status reports that the container is healthy once the application was completely initialized. I should also mention that we don't use an ELB/ALB.
+1 yes please... has there been any movement on this yet?
Like @EugeniuZ and others have said, just having some mechanism to delay putting a container in the "RUNNING" state would be extremely useful. This could effectively be the "grace period" concept that's been discussed here or the "start period" now built into Docker (or are we supposed to call it Moby now?).
@samuelkarp Perhaps we should first support this delay (#894), which will solve most people's problems, and then reconsider supporting Docker health checks.
Can we put the start period on each container? Then we can add a new "Starting" ContainerStatus between "Created" and "Running". For now this "Starting" container status would literally only be active while the start period is ticking. If docker health checks are implemented later, though, it could also move to "Running" when the docker health check first passes. For now, this "Starting" container status could map to the "Created" task status, which should greatly limit the scope of this change.
Assuming the ECS service will magically accept this new "Starting" container status, you could conceivably make this change in just the ecs-agent by having it use the docker api to get the container's Healthcheck.StartPeriod, right? We'd obviously rather have this be definable in the task-definition, but this is still one way to get it working.
I find it really disturbing that this issue is not prioritized as I see it. We are going live with ECS soon, and I really hoped that AWS picks this up. This is an essential feature, and there is not even a workaround for it yet.
This is actually one of our bigger reasons switching to Kubernetes, finally, after a few years with ECS.
This has caused some big issues for us in prod. Can you please prioritize this?
bump @samuelkarp any feedback on where this stands?
We have the same pb here.
We also have the same problem. This is really critical one, don't understand why it's still not done since one year.
Really do need a grace period configuration please. As a workaround I'm having to slow-down the interval for our health checks and increase the unhealthy count so that the instances do not fail during startup ... which is not ideal. What's even worse, however, is that if I tune this wrong and the instances do fail during startup they keep failing during startup over and over again, and the stopped containers do not get removed so the disk fills up, and the whole instance goes broke.
+1 for this feature. We have Java apps which on start up bootstrap their caches from a rabbitmq server and it takes a while until the container is actually ready to serve traffic. grace period will definitely help with this.
This issue is been there from 1 year, and initial grace period for healthcheck is critical for us, is there any other tracking jira for initial grace period alone? is it even planned to fix or we will look of alternatives if it is not even present in roadmap.
This is absolutely critical for java based services whose startup time is high.
looks like it might be never resolved since AWS announced Kubernetes support.
https://aws.amazon.com/blogs/aws/amazon-elastic-container-service-for-kubernetes/
@ozonni To set the record straight, EKS is by no means intended to replace ECS. In fact, the AWS Fargate service, also announced at re:Invent, is essentially a managed ECS. ECS will continue to see investment, and we do have this item on our backlog. Unfortunately, it's Amazon's policy to avoid discussing roadmap details publicly, so I can't provide a date for when we might implement it.
Thank you all for commenting on the details, we are actively working on this feature request in #1141.
Oh wow! 馃槺
Such a sad to find out that there's no way to define initial grace period before container is ready (aka. readiness in Kubernetes). Came from Kubernetes world to customer who is using ECS and shocked that this is not available. It's really common to run migrations, warm up cache, etc. at boot and playing with long health checks is not an option (ugly hack, please stop suggesting that!) and also looks like that the #1141 doesn't cover this.
Sad day... saaad saad day for me 馃槩
It is now possible to set a health check grace period when using ELB health checks. Please see the announcement and documentation for further details.
I'm keeping this issue open to track Docker health checks, which are currently in progress in #1141.
Having Load Balancer health checks is far from being a good enough solution. It's just a band-aid.
It partially solves the issue, in the case when you have a monolithic deployment, or if you only use server side load balancing.
We have microservices which are using service discovery and client side load balancing. This means that most of our containers do not register themselves into load balancers.
In fact only our API gateways are directly receiving traffic from ELBs.
How are we supposed to update our services without downtime if the ECS itself does not check whether a process in a container has indeed started, connected to downstream resources, and fit to receive traffic?
I think ECS without this feature is good for nothing. Schedulers are supposed to ease deployments and make operating containers safe and easy. ECS in its current state only adds uncertainty to the ops mix (unresponsive/disconnecting agent, zombie tasks, no support for swap, no support for Docker Health API).
We are seriously considering migrating to something else for the lack of this feature. :angry:
@richardpen now that https://github.com/aws/amazon-ecs-agent/pull/1141 is merged and released as of 1.17.0, can you confirm if the public documentation has been updated to reflect how to use the feature? Does this require any special configuration?
Thanks again for knocking this feature out by the way :) Much appreciated.
@richardpen now that #1141 is merged and released as of 1.17.0, can you confirm if the public documentation has been updated to reflect how to use the feature? Does this require any special configuration?
@CpuID, thanks for checking in on this feature. So far the ECS agent side changes are out with the 1.17.0 release, but we are still waiting on this feature to be supported on the ECS service side.
@adnxn do we know a timeline on when this feature will be fully supported?
@matelang, sorry we don't have a publicly available timeline for this.
thanks for checking in on this feature. So far the ECS agent side changes are out with the 1.17.0 release, but we are still waiting on this feature to be supported on the ECS service side.
@adnxn ah ok thx, hopefully soon :)
@CpuID, We've added task def support for the health check command and associated configuration parameters for the container. This parameter maps to HealthCheck in the Create a container section of the Docker Remote API and the HEALTHCHECK parameter of docker run.
@adnxn does CloudFormation already support the HealthCheck property for container definitions? The documentation at https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-containerdefinitions.html has not been updated, yet.
I've had a look at this today and it doesn't look like ECS observes the health status during deployments. Is this by design?
For e.g. create a service with one healthy container and perform a deployment (min 100% max 200%) that's broken and goes to UNHEALTHY, the healthy container (old version) is stopped, the deployment completes and the unhealthy container (new version) remains and is then terminated over and over as it's unhealthy.
Further (and I don't mean to go too off topic), but ECS doesn't even observe DRAINING status. It regularly shuts down active instances and leaves draining ones up when I resize clusters.
@hwatts Thanks for bringing this to us, we are currently working on fixing this, will let you know when there is an update.
Since the container health check feature has been released, I'm closing this feature request. For other related issues, I have created #1298 and #1297 for tracking purpose. Feel free to create a new issue if it's not tracked anywhere in the future, thanks.
+1
Most helpful comment
I'd like to see support for this, because I'd like to be able to monitor the memory usage of my container. If it climbs above a certain threshold (i.e. due to a memory leak), I'd like the system to gracefully restart my container. The Docker health check seems like a good way to check this.
IMO the Docker health check should take precedence over a LB health check, but they should both be used. The LB health check is useful for checking external indicators (container accepts HTTP requests on port 80), but in the scenario where a memory leak has caused memory usage to jump, the container may still be working fine (for now...)
I'd say that if either health check fails, the container should be considered unhealthy and gracefully restarted.