Amazon-ecs-agent: Tasks get frequently stuck in "PENDING" state since updating from v1.17.0 to v1.17.2

Created on 13 Mar 2018  路  15Comments  路  Source: aws/amazon-ecs-agent

Summary


Tasks are occasionally stuck (~1/7 or ~1/8 roughly) in a PENDING state on the ECS console.

Description


Ever since updating our ECS-optimized AMI from amzn-ami-2017.09.h-amazon-ecs-optimized to amzn-ami-2017.09.j-amazon-ecs-optimized (and therefore, the ECS agent from v1.17.0 to v1.17.2), tasks apparently randomly get stuck in a PENDING state on the ECS console. The task itself that is trying to start does not appear to be relevant, as it happens on multiple unrelated services that were previously running smoothly, and on multiple AWS accounts where the only change in common was the AMI update.

Going onto any instance where a PENDING task is supposed to be starting on and running a docker ps -a shows no sign of the task having ever started, though other tasks that started up successfully are there.

Manually stopping the task and letting ECS re-create a new task does usually fix the issue, though sometimes the new task can also get stuck, even if it is on a different instance.

Expected Behavior

Tasks are able to move from a "PENDING" state to a "RUNNING" state consistently

Observed Behavior

They do not.

Environment Details

AWS region - eu-west-1
Spot fleet instances of various sizes.

Supporting Log Snippets

The following aws agent logs are related to a stuck task with id: 0341c73f-d125-4920-bbe2-2ceb7b3fe00a, which the console suggests was created at 2018-03-13 14:45:52 +0000.

A look into other instances with the new AMI show similar sets of warnings and errors, but I'm happy to provide more if it helps figure out what is going on!

2018-03-13T14:45:42Z [INFO] Connected to TCS endpoint
2018-03-13T14:45:42Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:45:42Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:45:42Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:45:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:45:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:45:56Z [INFO] Saving state! module="statemanager"
2018-03-13T14:45:56Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/0341c73f-d125-4920-bbe2-2ceb7b3fe00a]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
2018-03-13T14:45:56Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/0341c73f-d125-4920-bbe2-2ceb7b3fe00a]: waiting for any previous stops to complete. Sequence number: 1200
2018-03-13T14:45:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:46:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:02Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:46:02Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:46:02Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:46:02Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:46:06Z [INFO] Saving state! module="statemanager"
2018-03-13T14:46:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:22Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:46:22Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:46:22Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:46:22Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:46:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:46:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:46:27Z [INFO] Saving state! module="statemanager"
2018-03-13T14:46:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: Cgroup resource set up for task complete
2018-03-13T14:46:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: stopping container [FelixContainer]
2018-03-13T14:46:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:37Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-03-13T14:46:37Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-03-13T14:46:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:47:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:47:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:47:27Z [INFO] Saving state! module="statemanager"
2018-03-13T14:47:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: Cgroup resource set up for task complete
2018-03-13T14:47:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: stopping container [FelixContainer]
2018-03-13T14:47:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:39Z [INFO] Connected to TCS endpoint
2018-03-13T14:47:39Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:47:39Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:47:39Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:47:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:48Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/a58439f2-352b-4e7e-b99d-01a0a17521aa]: redundant container state change. ArticleMetadataAPI to RUNNING, but already RUNNING
2018-03-13T14:47:48Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/1beac7a2-93d6-4665-8fb4-5220f654fbc6]: redundant container state change. ElasticContainer to RUNNING, but already RUNNING
2018-03-13T14:47:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:47:59Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:47:59Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:47:59Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:47:59Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:48:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:19Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:48:19Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:48:19Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:48:19Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:48:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:48:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:48:27Z [INFO] Saving state! module="statemanager"
2018-03-13T14:48:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: Cgroup resource set up for task complete
2018-03-13T14:48:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: stopping container [FelixContainer]
2018-03-13T14:48:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:34Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-03-13T14:48:34Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-03-13T14:48:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:49:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:49:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:49:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:49:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:49:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:49:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:49:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:49:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:49:27Z [INFO] Saving state! module="statemanager"
kinbug scopDependency workaround available

Most helpful comment

Are there any plans to release a new, official AMI that downgrades the docker version but keeps the new ecs agent version until a proper fix can be found?

All 15 comments

Just to update, when I roll back all my instances to amzn-ami-2017.09.h-amazon-ecs-optimized (v1.17.0), the problem immediately goes away again.

@JeffDownie We're sorry to hear that you're running into this issue.
From the logs, it looks like agent is trying to stop a previous task before starting the task 0341c73f-d125-4920-bbe2-2ceb7b3fe00a. But it gets Docker timeout errors when trying to inspect or stop containers from the previous task.
To debug more, could you please enable debug level logs for ECS agent and Docker, collect the logs and send them over to sharanyd at amazon.com?
Thanks.

Also, it looks like inspect hang error is a known issue in Docker 17.12.0: https://github.com/docker/for-mac/issues/2378

It looks like that the docker inspect hang error could be right - I tried manually upgrading the ecs agent (using sudo yum update -y ecs-init && sudo ecs stop && sleep 5 && sudo ecs start) without updating docker on the instances using the AMI amzn-ami-2017.09.h-amazon-ecs-optimized (v1.17.0) to v1.17.2, and the issue does not return.

Do you know of any other changes between the AMI's - I'm not sure how I'd go about finding any other differences.

having this issue with docker 17.02 and latest agent. trying upgrading to the docker that supposed to match the new agent

The latest ECS Optimized AMI (2017.09.j) includes an update to the Docker daemon bumping it to 17.12.0-ce. The Docker upstream maintainers are aware of the issue and are tracking it in https://github.com/moby/moby/issues/35933 and they are cutting a release to address the bug.

Edit: to clarify, the upstream has what looks like a partial fix, we will need to test and confirm.

I'm adding a 'workaround available' label for this issue as downgrading to 17.09.1ce version of Docker should make this issue go away till we figure out a fix for this.

I hope that won't down prioritize this because the official Amazon images come with a set version of Docker. It doesn't really make sense to ask users to start baking their own images to work around Amazon's own problem. Doesn't Fargate use the ECS images too?

I've been waiting until this is fixed before updating the ECS images we use as this sounds like a show-stopper really.

We have reported the issue to upstream in an issue and are exploring the problem now: https://github.com/moby/moby/issues/36661

Cool, thanks. My concern is just that Amazon is shipping ECS AMIs with known issues with the Docker version. Releasing new AMIs with a lower-pinned version could be messy I suppose. I hope that this can be resolved fairly quickly from the Docker side but that would still depend on a release on both sides.

Are there any plans to release a new, official AMI that downgrades the docker version but keeps the new ecs agent version until a proper fix can be found?

Hi @JeffDownie, we've released a new AMI, which should mitigate this issue. Please refer to https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html for more details.

@aaithal I am having the same issue with the latest AMI (amzn-ami-2017.09.l-amazon-ecs-optimized (ami-69677709)), in that the task is continually PENDING.

Here is the /var/log/ecs/ecs-agent.log.2018-04-26-01 of the continually PENDING task:

2018-04-26T01:33:51Z [INFO] Managed task [arn:aws:ecs:us-west-1:689450420564:task/5bbad62b-c7eb-4f1d-96e3-9feb96f4d816]: wait over; ready to move towards status: RUNNING
2018-04-26T01:33:51Z [INFO] Creating cgroup /ecs/a7f5c0f3-a535-417c-b567-e2d60d296ba8
2018-04-26T01:33:51Z [INFO] Creating cgroup /ecs/5bbad62b-c7eb-4f1d-96e3-9feb96f4d816
2018-04-26T01:33:51Z [INFO] Managed task [arn:aws:ecs:us-west-1:689450420564:task/a7f5c0f3-a535-417c-b567-e2d60d296ba8]: Cgroup resource set up for task complete
2018-04-26T01:33:51Z [INFO] Managed task [arn:aws:ecs:us-west-1:689450420564:task/5bbad62b-c7eb-4f1d-96e3-9feb96f4d816]: Cgroup resource set up for task complete
2018-04-26T01:33:51Z [INFO] Task engine [arn:aws:ecs:us-west-1:689450420564:task/5bbad62b-c7eb-4f1d-96e3-9feb96f4d816]: pulling container stratos_engine_reprocess-ro concurrently
2018-04-26T01:33:51Z [INFO] Task engine [arn:aws:ecs:us-west-1:689450420564:task/5bbad62b-c7eb-4f1d-96e3-9feb96f4d816]: Recording timestamp for starting image pulltime: 2018-04-26 01:33:51.925469694 +0000 UTC m=+4.414380268
2018-04-26T01:33:51Z [INFO] Task engine [arn:aws:ecs:us-west-1:689450420564:task/a7f5c0f3-a535-417c-b567-e2d60d296ba8]: pulling container stratos_engine_reprocess-ro concurrently
2018-04-26T01:33:51Z [INFO] Task engine [arn:aws:ecs:us-west-1:689450420564:task/a7f5c0f3-a535-417c-b567-e2d60d296ba8]: Recording timestamp for starting image pulltime: 2018-04-26 01:33:51.92560392 +0000 UTC m=+4.414514492
2018-04-26T01:33:58Z [INFO] Saving state! module="statemanager"
2018-04-26T01:45:53Z [WARN] DockerGoClient: failed to pull image 689450420564.dkr.ecr.us-west-1.amazonaws.com/boulder-stratos-engine:v4.2.6: inactivity time exceeded timeout
2018-04-26T01:45:53Z [WARN] DockerGoClient: failed to pull image 689450420564.dkr.ecr.us-west-1.amazonaws.com/boulder-stratos-engine:v4.2.6: inactivity time exceeded timeout
2018-04-26T01:46:48Z [INFO] TCS Websocket connection closed for a valid reason
2018-04-26T01:46:48Z [INFO] Connected to TCS endpoint
2018-04-26T01:56:50Z [INFO] ACS Websocket connection closed for a valid reason
2018-04-26T01:56:51Z [INFO] Connected to ACS endpoint
2018-04-26T01:56:54Z [INFO] Saving state! module="statemanager"
2018-04-26T01:56:54Z [INFO] Managed task [arn:aws:ecs:us-west-1:689450420564:task/a7f5c0f3-a535-417c-b567-e2d60d296ba8]: Cgroup resource set up for task complete
2018-04-26T01:56:54Z [INFO] Managed task [arn:aws:ecs:us-west-1:689450420564:task/5bbad62b-c7eb-4f1d-96e3-9feb96f4d816]: Cgroup resource set up for task complete
2018-04-26T01:57:04Z [INFO] Saving state! module="statemanager"
2018-04-26T01:57:48Z [INFO] TCS Websocket connection closed for a valid reason
2018-04-26T01:57:48Z [INFO] Connected to TCS endpoint

Any advice?

Hi @timduly4, are you still seeing the same symptoms described earlier in this thread, where you see docker inspect command for a particular container just 'hanging' (See https://github.com/aws/amazon-ecs-agent/issues/1294#issuecomment-372750965)? It might be a different issue that you're running into. Can you please create a new issue for this with relevant log files and the task id that's stuck in PENDING? You can send these to aithal at amazon dot com. Instructions for gathering logs can be found here.

Thanks,
Anirudh

@aaithal , I am not seeing this issue this morning, although I wrestled with it all of yesterday (2018-04-25).

Regarding:

where you see docker inspect command for a particular container just 'hanging' (See #1294 (comment))?

I don't quite understand-- can you detail more of what I am supposed to provide?

If this problem appears again, I will start a new issue and provide the logs that you described. Thanks.

Was this page helpful?
0 / 5 - 0 ratings