Amazon-ecs-agent: [datadog] Tasks pending/restarting and write unix /var/run/docker.sock->@: write: broken pipe"

Created on 31 Jul 2018 · 14Comments · Source: aws/amazon-ecs-agent

Summary

Today, our ECS clusters have been having issues with deployments hanging, tasks taking a long time to start before being randomly killed, and eventually reaching a steady state where they claim to be running but the load balancers are returning 503.

Description

It seems like this is the same issue as #515, and we just upgraded to the latest ECS optimized AMI, so I wonder if this is a bug in Docker 18.03.1-ce?

We have the DataDog agent (v6) running in daemon mode, and it talks to docker.sock, so that might be related (are daemon instances forcibly terminated)?

Expected Behavior

A new task should go from pending to running fairly quickly and then stay that way.

Observed Behavior

New tasks are pending for several minutes and then get killed for failing ELB health checks.

Environment Details

[ec2-user@ip-10-1-5-182 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        988M  176K  988M   1% /dev
tmpfs           997M     0  997M   0% /dev/shm
/dev/xvda1      7.8G  793M  6.9G  11% /

docker info just hangs

We are using t2 instances, but I confirmed that we have plenty of CPU and IO credits. We're also only using < 25% of the memory on the instances.

Supporting Log Snippets

The ecs agent logs look like:

2018-07-30T22:50:36Z [INFO] DockerGoClient: Unable to retrieve stats for container 3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b: inactivity time exceeded timeout
2018-07-30T22:55:23Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/3f0f4b6b-6920-457a-a5c2-5ff6f6635b02]: task at steady state: RUNNING
2018-07-30T22:55:28Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/2e279aad-9bb4-4719-8f58-c16565cc6e33]: task at steady state: RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/3f0f4b6b-6920-457a-a5c2-5ff6f6635b02]: redundant container state change. https-redirect to RUNNING, but already RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/3f0f4b6b-6920-457a-a5c2-5ff6f6635b02]: task at steady state: RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/2e279aad-9bb4-4719-8f58-c16565cc6e33]: redundant container state change. datadog-agent to RUNNING, but already RUNNING
2018-07-30T22:55:38Z [INFO] Managed task [arn:aws:ecs:us-east-1:327068243994:task/2e279aad-9bb4-4719-8f58-c16565cc6e33]: task at steady state: RUNNING
2018-07-30T22:56:05Z [INFO] DockerGoClient: Unable to retrieve stats for container 3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b: inactivity time exceeded timeout
2018-07-30T22:58:39Z [INFO] DockerGoClient: Unable to retrieve stats for container 2d2821c2e8a5fdba130aeeac70f29513e5638311f68bf4ee8e56545dc7768451: inactivity time exceeded timeout
2018-07-30T22:58:40Z [INFO] DockerGoClient: Unable to retrieve stats for container 3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b: inactivity time exceeded timeout

/var/log/docker:

time="2018-07-30T22:57:55.177362569Z" level=error msg="stream copy error: reading from a closed fifo"
time="2018-07-30T22:57:56.412984925Z" level=warning msg="Health check for container 2d2821c2e8a5fdba130aeeac70f29513e5638311f68bf4ee8e56545dc7768451 error: context deadline exceeded: unknown"
time="2018-07-30T22:58:05.032282275Z" level=error msg="Handler for GET /v1.25/containers/c6c45502b3ac7d25ad44a78b5175a0de5d446f84227b6fd4f6cba4fd634ef8e4/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:05.032337 I | http: multiple response.WriteHeader calls
time="2018-07-30T22:58:11.969442928Z" level=error msg="Handler for GET /v1.25/containers/3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:11.969502 I | http: multiple response.WriteHeader calls
time="2018-07-30T22:58:23.832327176Z" level=error msg="Handler for GET /v1.25/containers/3c97fb747b7578b3ceaf4d02f47971bce9909dd4b07b0b0fbbf49ad9fd05952b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:23.833074 I | http: multiple response.WriteHeader calls
time="2018-07-30T22:58:41.180305863Z" level=error msg="Handler for GET /v1.25/containers/c6c45502b3ac7d25ad44a78b5175a0de5d446f84227b6fd4f6cba4fd634ef8e4/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-07-30 22:58:41.180355 I | http: multiple response.WriteHeader calls

more info needed

Source

brendanlong

👍1

All 14 comments

This person seems to be having the same issue so I don't think it's just me: https://stackoverflow.com/questions/51470492/aws-ecs-starting-a-container-is-very-slow-after-1-day

brendanlong on 31 Jul 2018

👍1

I'm reverting our instances to 2018.03.a and I'll let you know if I see the same problem there.

brendanlong on 31 Jul 2018

👍1

aaronlifton on 31 Jul 2018

I ran into the same problem with 2018.03.a so now I'm trying 2017.09.i since we know it worked for us before.

brendanlong on 31 Jul 2018

Seems to happen with older versions of the ECS AMI too. I wonder if there's something weird going on with DataDog with scheduling=DAEMON. I'm going to try just running it outside of Docker on the EC2 instance :\

brendanlong on 31 Jul 2018

@brendanlong: Thanks for the updates! Please let us know as you collect more data.

Also fwiw,

We are using t2 instances

we've seen issues where Docker will hang on arbitrary API calls during a spike in the CPU usage and I suspect one could be in that situation without actually exhausting all CPU credits.

adnxn on 31 Jul 2018

Hm we were going to upgrade to m4 instances at some point soon anyway so I'll try doing that and see if it helps.

brendanlong on 1 Aug 2018

I'm using m4.large instance with the latest ECS AMI and having the same issue.

shakyaabiral on 1 Aug 2018

It's possible I'm running into this, although CloudWatch says I have plenty of IO credits: https://github.com/DataDog/datadog-agent/issues/1052

brendanlong on 1 Aug 2018

I'm fairly certain we were running into an issue with DataDog's collect_container_size. We turned that option off and things seem to be fine now. If people are seeing similar problems without DataDog, another thing to try would be using a bigger EBS image so you have more IO credits, or maybe try an instance with local storage (m5d instances).

I don't think this is an ecs-agent bug, but there are things you could do to make this bug less painful for ECS customers:

Upload ECS agent logs to CloudWatch and expose them in the console, like you do with tasks
When errors like this occur, show some sort of error message in ECS instead of having everything inexplicably time out

brendanlong on 2 Aug 2018

👍1

@brendanlong: Thanks for doing the investigation on this! Much appreciated.

When errors like this occur, show some sort of error message in ECS instead of having everything inexplicably time out

Do you mind further explaining what an ideal situation would be for you here? Did you mean that if Docker gets into a state where API calls are timing out due to IO credits, then ECS should somehow communicate that?

adnxn on 2 Aug 2018

Showing DataDog-specific errors may be overkill, but it would be nice to:

Indicate that an error is occurring in the console (show a ! icon by the EC2 instance ID with hover text saying the ECS agent is having issues?)
Make the ECS agent logs easily accessible in the console

I don't think the ECS agent necessarily needs to be able to explain what the error is, but just letting the user know (1) you probably need to debug the EC2 instance configuration, not the ECS task and (2) how to find the logs.

It would definitely be nice if you detect certain kinds of errors (docker.sock is broken -> the issue might be IO or DataDog config), but I realize that may be too deep of a rabbit hole.

(I realize the console is a different piece of software than this app, but presumably the agent is what's sending it the info it displays)

brendanlong on 3 Aug 2018

Indicate that an error is occurring in the console (show a ! icon by the EC2 instance ID with hover text saying the ECS agent is having issues?)

The prerequisite for that would be defining some concept of instance or agent health (to differentiate between an unhealthy task and an unhealthy agent or instance). We've been exploring some concept of instance health, but it is still poorly defined. You're more than welcome to submit a proposal to give us more ideas though! 😀

It would definitely be nice if you detect certain kinds of errors (docker.sock is broken -> the issue might be IO or DataDog config), but I realize that may be too deep of a rabbit hole.

Yea, this is roughly the conclusion we've arrived at as well wrt bucketing docker errors.