Today, our ECS clusters have been having issues with deployments hanging, tasks taking a long time to start before being randomly killed, and eventually reaching a steady state where they claim to be running but the load balancers are returning 503.
It seems like this is the same issue as #515, and we just upgraded to the latest ECS optimized AMI, so I wonder if this is a bug in Docker 18.03.1-ce?
We have the DataDog agent (v6) running in daemon mode, and it talks to docker.sock, so that might be related (are daemon instances forcibly terminated)?
A new task should go from pending to running fairly quickly and then stay that way.
New tasks are pending for several minutes and then get killed for failing ELB health checks.
[ec2-user@ip-10-1-5-182 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 988M 176K 988M 1% /dev
tmpfs 997M 0 997M 0% /dev/shm
/dev/xvda1 7.8G 793M 6.9G 11% /
docker info just hangs
We are using t2 instances, but I confirmed that we have plenty of CPU and IO credits. We're also only using < 25% of the memory on the instances.
The ecs agent logs look like:
2018-07-30T22:50:36Z [INFO] DockerGoClient: Unable to retrieve stats for container 3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b: inactivity time exceeded timeout
2018-07-30T22:55:23Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/3f0f4b6b-6920-457a-a5c2-5ff6f6635b02]: task at steady state: RUNNING
2018-07-30T22:55:28Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/2e279aad-9bb4-4719-8f58-c16565cc6e33]: task at steady state: RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/3f0f4b6b-6920-457a-a5c2-5ff6f6635b02]: redundant container state change. https-redirect to RUNNING, but already RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/3f0f4b6b-6920-457a-a5c2-5ff6f6635b02]: task at steady state: RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/2e279aad-9bb4-4719-8f58-c16565cc6e33]: redundant container state change. datadog-agent to RUNNING, but already RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/2e279aad-9bb4-4719-8f58-c16565cc6e33]: task at steady state: RUNNING
2018-07-30T22:56:05Z [INFO] DockerGoClient: Unable to retrieve stats for container 3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b: inactivity time exceeded timeout
2018-07-30T22:58:39Z [INFO] DockerGoClient: Unable to retrieve stats for container 2d2821c2e8a5fdba130aeeac70f29513e5638311f68bf4ee8e56545dc7768451: inactivity time exceeded timeout
2018-07-30T22:58:40Z [INFO] DockerGoClient: Unable to retrieve stats for container 3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b: inactivity time exceeded timeout
/var/log/docker:
time="2018-07-30T22:57:55.177362569Z" level=error msg="stream copy error: reading from a closed fifo"
time="2018-07-30T22:57:56.412984925Z" level=warning msg="Health check for container 2d2821c2e8a5fdba130aeeac70f29513e5638311f68bf4ee8e56545dc7768451 error: context deadline exceeded: unknown"
time="2018-07-30T22:58:05.032282275Z" level=error msg="Handler for GET /v1.25/containers/c6c45502b3ac7d25ad44a78b5175a0de5d446f84227b6fd4f6cba4fd634ef8e4/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:05.032337 I | http: multiple response.WriteHeader calls
time="2018-07-30T22:58:11.969442928Z" level=error msg="Handler for GET /v1.25/containers/3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:11.969502 I | http: multiple response.WriteHeader calls
time="2018-07-30T22:58:23.832327176Z" level=error msg="Handler for GET /v1.25/containers/3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:23.833074 I | http: multiple response.WriteHeader calls
time="2018-07-30T22:58:41.180305863Z" level=error msg="Handler for GET /v1.25/containers/c6c45502b3ac7d25ad44a78b5175a0de5d446f84227b6fd4f6cba4fd634ef8e4/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:41.180355 I | http: multiple response.WriteHeader calls
This person seems to be having the same issue so I don't think it's just me: https://stackoverflow.com/questions/51470492/aws-ecs-starting-a-container-is-very-slow-after-1-day
I'm reverting our instances to 2018.03.a and I'll let you know if I see the same problem there.
+1
I ran into the same problem with 2018.03.a so now I'm trying 2017.09.i since we know it worked for us before.
Seems to happen with older versions of the ECS AMI too. I wonder if there's something weird going on with DataDog with scheduling=DAEMON. I'm going to try just running it outside of Docker on the EC2 instance :\
@brendanlong: Thanks for the updates! Please let us know as you collect more data.
Also fwiw,
We are using t2 instances
we've seen issues where Docker will hang on arbitrary API calls during a spike in the CPU usage and I suspect one could be in that situation without actually exhausting all CPU credits.
Hm we were going to upgrade to m4 instances at some point soon anyway so I'll try doing that and see if it helps.
I'm using m4.large instance with the latest ECS AMI and having the same issue.
It's possible I'm running into this, although CloudWatch says I have plenty of IO credits: https://github.com/DataDog/datadog-agent/issues/1052
I'm fairly certain we were running into an issue with DataDog's collect_container_size. We turned that option off and things seem to be fine now. If people are seeing similar problems without DataDog, another thing to try would be using a bigger EBS image so you have more IO credits, or maybe try an instance with local storage (m5d instances).
I don't think this is an ecs-agent bug, but there are things you could do to make this bug less painful for ECS customers:
@brendanlong: Thanks for doing the investigation on this! Much appreciated.
When errors like this occur, show some sort of error message in ECS instead of having everything inexplicably time out
Do you mind further explaining what an ideal situation would be for you here? Did you mean that if Docker gets into a state where API calls are timing out due to IO credits, then ECS should somehow communicate that?
Showing DataDog-specific errors may be overkill, but it would be nice to:
I don't think the ECS agent necessarily needs to be able to explain what the error is, but just letting the user know (1) you probably need to debug the EC2 instance configuration, not the ECS task and (2) how to find the logs.
It would definitely be nice if you detect certain kinds of errors (docker.sock is broken -> the issue might be IO or DataDog config), but I realize that may be too deep of a rabbit hole.
(I realize the console is a different piece of software than this app, but presumably the agent is what's sending it the info it displays)
Indicate that an error is occurring in the console (show a ! icon by the EC2 instance ID with hover text saying the ECS agent is having issues?)
The prerequisite for that would be defining some concept of instance or agent health (to differentiate between an unhealthy task and an unhealthy agent or instance). We've been exploring some concept of instance health, but it is still poorly defined. You're more than welcome to submit a proposal to give us more ideas though! 馃榾
It would definitely be nice if you detect certain kinds of errors (docker.sock is broken -> the issue might be IO or DataDog config), but I realize that may be too deep of a rabbit hole.
Yea, this is roughly the conclusion we've arrived at as well wrt bucketing docker errors.
Make the ECS agent logs easily accessible in the console
Have you looked at cloudwatch logging for the container instances?
closing this issue for now, feel free to update if you'd like.