We already opened a support ticket, but just for visibility to gauge if this is a feature that has been used a lot or not.
With Fargate 1.4.0 container health checks do not work anymore.
Thanks for reporting the issue @lifeofguenter. We are aware of the problem (also on StackOverflow) and are working on fixing it.
Hi G眉nter,
The Fargate team has recently identified and fixed a container health check issue where if your container health check command outputs to stderr at all, we would incorrectly classify it as a health check failure. This was most apparent for customers who use curl in their health check commands and did not specify the -s flag.
Can you please verify if this was the issue you were encountering and whether or not it has been resolved on your end?
Thank you!
Derek
@lifeofguenter - checking back again to see if your issue has been addressed based on @ddyzhang 's comment above?
Thank you @SaloniSonpal and apologies for the late reply. We upgraded to 1.4.0 _without_ changing anything on our side and its working now.
So the initial bug has been resolved :)
@ddyzhang Do you know if this fix was applied in all AVs across all regions? We seem to be seeing something that looks alot like this behavior in eu-central-1. Services with 1.4.0 set will only ever successfully startup in eu-central-1c. A & B never succeed.
Hi @dekimsey,
This fix should have been rolled out to all regions and availability zones where Fargate is available. Can you please email me at [email protected] with some more details? I'd be happy to look into this for you.
Providing a quick update on this issue from some offline discussions with @dekimsey:
After further investigation, I think we have an idea of what's going on here. There is a known bug in Fargate PV 1.4 where the
startTimeoutfor container dependencies is not being handled correctly. This bug causes premature task termination even when customer containers come online before the configured timeout. The fix for this issue is currently pending production release.As to why this issue is isolated to eu-central-1a and 1b, our best theory right now is that perhaps eu-central-1c exhibits slightly lower network latency than the other two availability zones.
In the meantime, those affected by this issue can temporarily mitigate the problem by bumping up any startTimeout values in the ECS task definition.
@ddyzhang : I expected this healthCheck command to work with Fargate platform version 1.4.0:
["CMD-SHELL", "curl -s http://localhost/health || exit 1"]
All other healthCheck attributes use the default values.
Am I correct that this command should work?
Most helpful comment
Providing a quick update on this issue from some offline discussions with @dekimsey:
In the meantime, those affected by this issue can temporarily mitigate the problem by bumping up any
startTimeoutvalues in the ECS task definition.