Containers-roadmap: [Fargate] [bug/regression]: Container health checks not working anymore with 1.4.0

Created on 15 May 2020 · 8Comments · Source: aws/containers-roadmap

We already opened a support ticket, but just for visibility to gauge if this is a feature that has been used a lot or not.

With Fargate 1.4.0 container health checks do not work anymore.

Fargate Fargate PV1.4 Work in Progress

Source

lifeofguenter

👍9

Most helpful comment

Providing a quick update on this issue from some offline discussions with @dekimsey:

After further investigation, I think we have an idea of what's going on here. There is a known bug in Fargate PV 1.4 where the startTimeout for container dependencies is not being handled correctly. This bug causes premature task termination even when customer containers come online before the configured timeout. The fix for this issue is currently pending production release.

As to why this issue is isolated to eu-central-1a and 1b, our best theory right now is that perhaps eu-central-1c exhibits slightly lower network latency than the other two availability zones.

In the meantime, those affected by this issue can temporarily mitigate the problem by bumping up any startTimeout values in the ECS task definition.

ddyzhang on 17 Jul 2020

👍2

All 8 comments

Thanks for reporting the issue @lifeofguenter. We are aware of the problem (also on StackOverflow) and are working on fixing it.

SaloniSonpal on 18 May 2020

❤1

Hi Günter,

The Fargate team has recently identified and fixed a container health check issue where if your container health check command outputs to stderr at all, we would incorrectly classify it as a health check failure. This was most apparent for customers who use curl in their health check commands and did not specify the -s flag.

Can you please verify if this was the issue you were encountering and whether or not it has been resolved on your end?

Thank you!

Derek

ddyzhang on 29 May 2020

@lifeofguenter - checking back again to see if your issue has been addressed based on @ddyzhang 's comment above?

SaloniSonpal on 2 Jun 2020

👍1

Thank you @SaloniSonpal and apologies for the late reply. We upgraded to 1.4.0 _without_ changing anything on our side and its working now.

So the initial bug has been resolved :)

lifeofguenter on 30 Jun 2020

@ddyzhang Do you know if this fix was applied in all AVs across all regions? We seem to be seeing something that looks alot like this behavior in eu-central-1. Services with 1.4.0 set will only ever successfully startup in eu-central-1c. A & B never succeed.

dekimsey on 17 Jul 2020

Hi @dekimsey,

This fix should have been rolled out to all regions and availability zones where Fargate is available. Can you please email me at [email protected] with some more details? I'd be happy to look into this for you.

ddyzhang on 17 Jul 2020

Providing a quick update on this issue from some offline discussions with @dekimsey:

After further investigation, I think we have an idea of what's going on here. There is a known bug in Fargate PV 1.4 where the startTimeout for container dependencies is not being handled correctly. This bug causes premature task termination even when customer containers come online before the configured timeout. The fix for this issue is currently pending production release.

As to why this issue is isolated to eu-central-1a and 1b, our best theory right now is that perhaps eu-central-1c exhibits slightly lower network latency than the other two availability zones.

In the meantime, those affected by this issue can temporarily mitigate the problem by bumping up any startTimeout values in the ECS task definition.

ddyzhang on 17 Jul 2020

👍2

@ddyzhang : I expected this healthCheck command to work with Fargate platform version 1.4.0:
["CMD-SHELL", "curl -s http://localhost/health || exit 1"]
All other healthCheck attributes use the default values.
Am I correct that this command should work?