Amazon-ecs-agent: 1.14.2 causing container instances to grind to a halt

Created on 8 Jun 2017 · 30Comments · Source: aws/amazon-ecs-agent

1.14.2, which was just released a few hours ago, appears to be causing major issues in one of our ECS clusters. We have a large number of instances running 1.14.1 and 1.14.2. Many of the instances running 1.14.2 are exhibiting an issue where the instance accepts as many tasks as possible but is never able to transition them from the "PENDING" state to the "RUNNING" state.

We've opened an AWS Support ticket as well but I wanted to post this here in case anyone else is encountering the same issue. We're digging through logs to see what's going on now.

kinbug scopECS Agent

Source

tysonmote

👍2

Most helpful comment

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

vsiddharth on 14 Jun 2017

👍4

All 30 comments

Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method:

https://github.com/aws/amazon-ecs-agent/blob/35acca340e91f15ec3b0679064e31ad21f0380d7/agent/engine/docker_task_engine.go#L512-L524

Specifically, we're blocked on ImagePullDeleteLock.Lock(). To confirm this, we killed the ECS agent with the ABRT signal to get a full dump of all goroutines, which showed that we were blocked on that lock.

Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock.

tysonmote on 8 Jun 2017

(I'm investigating this issue with @tysonmote)

See the stack of one of our (many) ECS agents facing the issue here.

tejasmanohar on 8 Jun 2017

Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own.

nmeyerhans on 8 Jun 2017

Here's some details about the machines we're running:

Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.10.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.308 GiB
...

We're using AMI CoreOS-stable-1068.10.0-hvm (ami-7d11c51d). I'll email you the logs from one of our instances that ran in to the problem, as well.

Thanks!

tysonmote on 8 Jun 2017

(Unfortunately, CoreOS is not supported by the log collector.)

Edit: my bad, there was previously an operating system check that would fail.

tejasmanohar on 8 Jun 2017

Thanks, I've received the logs.

nmeyerhans on 8 Jun 2017

@tejasmanohar I was able to run it. I've sent that log archive as well.

tysonmote on 8 Jun 2017

I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that.

tysonmote on 8 Jun 2017

Hi @tysonmote and @tejasmanohar,

We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. ~We believe that the race condition is more likely in 1.14.2, but exists in 1.14.1 as well.~ We're actively working to fix this race condition and deadlock.

The specific race condition involves two locks that both relate to images: the ImagePullDeleteLock and imageManager.updateLock. The ImagePullDeleteLock is used for serializing image pulls (on older versions of Docker) and for ensuring that we don't delete an image at the same time we're pulling it (this is important for the task launch workflow, since we don't want to successfully pull the image and then fail to create the container because it has gone missing between when pull finished and container creation started). The imageManager.updateLock is used for synchronization of the internal data structure of the imageManager, which tracks images and selects images for deletion. When we're pulling an image, the ImagePullDeleteLock is acquired first for the actual pull and then the imageManager.updateLock is acquired when recording the metadata about the image. When we're deleting an image, the reverse lock acquisition order occurs. This mismatch in lock acquisition order can create a deadlock, since one goroutine can hold ImagePullDeleteLock and block while trying to acquire imageManager.updateLock while the other goroutine can hold imageManager.updateLock and block while trying to acquire ImagePullDeleteLock. This became significantly more likely to occur in 1.14.2 as the scope of imageManager.updateLock was modified to fix a different race condition involving concurrent writes to the imageManager's internal data structure (see https://github.com/aws/amazon-ecs-agent/issues/707 and https://github.com/aws/amazon-ecs-agent/pull/743).

We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released.

Thanks,
Sam

samuelkarp on 8 Jun 2017

👍1

Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in https://github.com/aws/amazon-ecs-agent/issues/707.

samuelkarp on 8 Jun 2017

@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, https://github.com/aws/amazon-ecs-agent/issues/834 may be helpful.

tejasmanohar on 8 Jun 2017

@tejasmanohar Yep, we're exploring both options right now.

samuelkarp on 8 Jun 2017

👍1

amazon/amazon-ecs-agent:latest on DockerHub now points to 1.14.1 again. We have a fix forward in progress here: https://github.com/aws/amazon-ecs-agent/pull/836.

samuelkarp on 9 Jun 2017

🎉2

Good to hear. Thanks for acting quickly, @samuelkarp & team!

tejasmanohar on 9 Jun 2017

Hello @samuelkarp ,

Any ETA for that fix be available and new version is released?

Thanks!

sergiofigueras on 12 Jun 2017

@sergiofigueras We already have a fix merged on the dev branch.
The release process is under way and the new version should be out as soon as our testing and release process is complete.

vsiddharth on 12 Jun 2017

I am still getting 1.14.2 while fetching latest

[ec2-user@ip-172-31-91-192 ~]$ docker run amazon/amazon-ecs-agent:latest version
2017-06-12T23:59:05Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)

sharmaansh21 on 13 Jun 2017

@optimisticanshul You would need to re-pull it:

ec2-user@ip-172-30-83-142 ~ » docker pull amazon/amazon-ecs-agent:latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest
ec2-user@ip-172-30-83-142 ~ » docker run -it amazon/amazon-ecs-agent:latest                                                                                          
2017-06-13T00:01:46Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

nehalrp on 13 Jun 2017

@nehalrp It is a new ec2 instance

sharmaansh21 on 13 Jun 2017

@optimisticanshul

I am still getting 1.14.2 while fetching latest

Also me too. So I've added the following workaround to userdata.

ECS_AGENT_VERSION=v1.14.1
docker pull amazon/amazon-ecs-agent:$ECS_AGENT_VERSION
docker tag amazon/amazon-ecs-agent:$ECS_AGENT_VERSION amazon/amazon-ecs-agent:latest

I've avoided automatically update ecs-agent.

masutaka on 13 Jun 2017

We have tagged v1.14.1 as latest on docker hub.

If you pull amazon/amazon-ecs-agent:latest you should get v1.14.1

docker pull amazon/amazon-ecs-agent                                                                                                                                                                                                    
Using default tag: latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest

docker run amazon/amazon-ecs-agent                                                                                                                                                                                                     
2017-06-13T15:57:32Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

vsiddharth on 13 Jun 2017

I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log:

[root@ip-10-0-86-11 ecs]# more ecs-init.log
2017-06-13T14:30:03Z [INFO] pre-start
2017-06-13T14:30:03Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-13T14:30:03Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5

I guess whatever the AMI does isn't just "pull latest."

ernestm on 13 Jun 2017

I guess whatever the AMI does isn't just "pull latest."

That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest".

samuelkarp on 13 Jun 2017

Right... But on
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI_launch_latest.html and http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html it says this AMI should be getting me 1.14.1, and it's not not, so I'm not sure as an ECS AMI consumer what I'm supposed to do to downgrade agent versions. Though maybe some of these recent commits are for that?

ernestm on 13 Jun 2017

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

vsiddharth on 14 Jun 2017

👍4

@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours.

sharmaansh21 on 29 Jun 2017

😕2

@optimisticanshul The issue discovered with agent versionv1.14.2 should be mitigated by v1.14.3. Would it be possible for you to share the logs for further investigation ?

Please have a look here for the logs we require.

vsiddharth on 29 Jun 2017

Definitely where i send them ?

sharmaansh21 on 30 Jun 2017

Steps to reproduce:
1) ECS Cluster 2 instances.
2) Deploy a sample service.
3) Redeploy

In ecs agent log

./ecs-agent.log.2017-06-30-03:2017-06-30T03:55:43Z [INFO] Stopping container module="TaskEngine" task="prod-whatever:108 arn:aws:ecs:us-east-1:ACCOUNT-ID:task/1730060e-66e6-4795-ac20-551f4d09dd7a, Status: (RUNNING->STOPPED) Containers: [prod-whatever (RUNNING->STOPPED),]" container="prod-whatever(ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME) (RUNNING->STOPPED)"

But output of docker ps

274ba55bfca5        ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME   "/bin/sh -c 'bundl..."   25 minutes ago      Up 25 minutes       80/tcp, 0.0.0.0:10283->8080/tcp   ecs-prod-whatever-108-prod-whatever-feb2cdae8894f389d001

See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks.

I am not sure whether it's related to this issue or something else.

sharmaansh21 on 30 Jun 2017

@optimisticanshul you could send them over to sidvin at amazon.com.

vsiddharth on 5 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings