Amazon-ecs-agent: 1.14.2 causing container instances to grind to a halt

Created on 8 Jun 2017  路  30Comments  路  Source: aws/amazon-ecs-agent

1.14.2, which was just released a few hours ago, appears to be causing major issues in one of our ECS clusters. We have a large number of instances running 1.14.1 and 1.14.2. Many of the instances running 1.14.2 are exhibiting an issue where the instance accepts as many tasks as possible but is never able to transition them from the "PENDING" state to the "RUNNING" state.

We've opened an AWS Support ticket as well but I wanted to post this here in case anyone else is encountering the same issue. We're digging through logs to see what's going on now.

kinbug scopECS Agent

Most helpful comment

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

All 30 comments

Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method:

https://github.com/aws/amazon-ecs-agent/blob/35acca340e91f15ec3b0679064e31ad21f0380d7/agent/engine/docker_task_engine.go#L512-L524

Specifically, we're blocked on ImagePullDeleteLock.Lock(). To confirm this, we killed the ECS agent with the ABRT signal to get a full dump of all goroutines, which showed that we were blocked on that lock.

Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock.

(I'm investigating this issue with @tysonmote)

See the stack of one of our (many) ECS agents facing the issue here.

Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own.

Here's some details about the machines we're running:

Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.10.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.308 GiB
...

We're using AMI CoreOS-stable-1068.10.0-hvm (ami-7d11c51d). I'll email you the logs from one of our instances that ran in to the problem, as well.

Thanks!

(Unfortunately, CoreOS is not supported by the log collector.)

Edit: my bad, there was previously an operating system check that would fail.

Thanks, I've received the logs.

@tejasmanohar I was able to run it. I've sent that log archive as well.

I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that.

Hi @tysonmote and @tejasmanohar,

We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. ~We believe that the race condition is more likely in 1.14.2, but exists in 1.14.1 as well.~ We're actively working to fix this race condition and deadlock.

The specific race condition involves two locks that both relate to images: the ImagePullDeleteLock and imageManager.updateLock. The ImagePullDeleteLock is used for serializing image pulls (on older versions of Docker) and for ensuring that we don't delete an image at the same time we're pulling it (this is important for the task launch workflow, since we don't want to successfully pull the image and then fail to create the container because it has gone missing between when pull finished and container creation started). The imageManager.updateLock is used for synchronization of the internal data structure of the imageManager, which tracks images and selects images for deletion. When we're pulling an image, the ImagePullDeleteLock is acquired first for the actual pull and then the imageManager.updateLock is acquired when recording the metadata about the image. When we're deleting an image, the reverse lock acquisition order occurs. This mismatch in lock acquisition order can create a deadlock, since one goroutine can hold ImagePullDeleteLock and block while trying to acquire imageManager.updateLock while the other goroutine can hold imageManager.updateLock and block while trying to acquire ImagePullDeleteLock. This became significantly more likely to occur in 1.14.2 as the scope of imageManager.updateLock was modified to fix a different race condition involving concurrent writes to the imageManager's internal data structure (see https://github.com/aws/amazon-ecs-agent/issues/707 and https://github.com/aws/amazon-ecs-agent/pull/743).

We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released.

Thanks,
Sam

Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in https://github.com/aws/amazon-ecs-agent/issues/707.

@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, https://github.com/aws/amazon-ecs-agent/issues/834 may be helpful.

@tejasmanohar Yep, we're exploring both options right now.

amazon/amazon-ecs-agent:latest on DockerHub now points to 1.14.1 again. We have a fix forward in progress here: https://github.com/aws/amazon-ecs-agent/pull/836.

Good to hear. Thanks for acting quickly, @samuelkarp & team!

Hello @samuelkarp ,

Any ETA for that fix be available and new version is released?

Thanks!

@sergiofigueras We already have a fix merged on the dev branch.
The release process is under way and the new version should be out as soon as our testing and release process is complete.

I am still getting 1.14.2 while fetching latest

[ec2-user@ip-172-31-91-192 ~]$ docker run amazon/amazon-ecs-agent:latest version
2017-06-12T23:59:05Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)

@optimisticanshul You would need to re-pull it:

ec2-user@ip-172-30-83-142 ~ 禄 docker pull amazon/amazon-ecs-agent:latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest
ec2-user@ip-172-30-83-142 ~ 禄 docker run -it amazon/amazon-ecs-agent:latest                                                                                          
2017-06-13T00:01:46Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

@nehalrp It is a new ec2 instance

@optimisticanshul

I am still getting 1.14.2 while fetching latest

Also me too. So I've added the following workaround to userdata.

ECS_AGENT_VERSION=v1.14.1
docker pull amazon/amazon-ecs-agent:$ECS_AGENT_VERSION
docker tag amazon/amazon-ecs-agent:$ECS_AGENT_VERSION amazon/amazon-ecs-agent:latest

I've avoided automatically update ecs-agent.

We have tagged v1.14.1 as latest on docker hub.

If you pull amazon/amazon-ecs-agent:latest you should get v1.14.1

docker pull amazon/amazon-ecs-agent                                                                                                                                                                                                    
Using default tag: latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest

docker run amazon/amazon-ecs-agent                                                                                                                                                                                                     
2017-06-13T15:57:32Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log:

[root@ip-10-0-86-11 ecs]# more ecs-init.log
2017-06-13T14:30:03Z [INFO] pre-start
2017-06-13T14:30:03Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-13T14:30:03Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5

I guess whatever the AMI does isn't just "pull latest."

I guess whatever the AMI does isn't just "pull latest."

That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest".

Right... But on
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI_launch_latest.html and http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html it says this AMI should be getting me 1.14.1, and it's not not, so I'm not sure as an ECS AMI consumer what I'm supposed to do to downgrade agent versions. Though maybe some of these recent commits are for that?

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours.

@optimisticanshul The issue discovered with agent versionv1.14.2 should be mitigated by v1.14.3. Would it be possible for you to share the logs for further investigation ?

Please have a look here for the logs we require.

Definitely where i send them ?

Steps to reproduce:
1) ECS Cluster 2 instances.
2) Deploy a sample service.
3) Redeploy

In ecs agent log

./ecs-agent.log.2017-06-30-03:2017-06-30T03:55:43Z [INFO] Stopping container module="TaskEngine" task="prod-whatever:108 arn:aws:ecs:us-east-1:ACCOUNT-ID:task/1730060e-66e6-4795-ac20-551f4d09dd7a, Status: (RUNNING->STOPPED) Containers: [prod-whatever (RUNNING->STOPPED),]" container="prod-whatever(ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME) (RUNNING->STOPPED)"

But output of docker ps

274ba55bfca5        ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME   "/bin/sh -c 'bundl..."   25 minutes ago      Up 25 minutes       80/tcp, 0.0.0.0:10283->8080/tcp   ecs-prod-whatever-108-prod-whatever-feb2cdae8894f389d001

See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks.

I am not sure whether it's related to this issue or something else.

@optimisticanshul you could send them over to sidvin at amazon.com.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

soumyasmruti picture soumyasmruti  路  5Comments

pspanchal picture pspanchal  路  3Comments

melo picture melo  路  5Comments

cjbottaro picture cjbottaro  路  4Comments

dcosson picture dcosson  路  3Comments