Amazon-ecs-agent: Agent 1.20.0 not starting after upgrade from 1.19.1

Created on 8 Aug 2018  路  6Comments  路  Source: aws/amazon-ecs-agent

Summary


Agent 1.20.0 not starting with saved state from 1.19.1

Description


restarting docker doesn't help
removing all running containers and issuing a start ecs doesnt help
the only thing that works is to purge /var/lib/ecs/data/ecs_agent_data.json
and then
start ecs
however this then registers as a new ecs container instance and loses previous container instance arn

Expected Behavior

after upgrade expected ecs-agent to start

Observed Behavior

after ecs-agent update from 1.19.1 to 1.20.0
docker ecs-agent fails to start

Environment Details

Supporting Log Snippets

1.19.1 shutdown
2018-08-07T20:36:07Z [INFO] Saving state! module="statemanager"
2018-08-07T20:37:42Z [INFO] Saving state! module="statemanager"
2018-08-07T20:37:43Z [INFO] Loading configuration
2018-08-07T20:37:43Z [INFO] Amazon ECS agent Version: 1.20.0, Commit: cd331230
2018-08-07T20:37:43Z [INFO] Creating root ecs cgroup: /ecs
2018-08-07T20:37:43Z [INFO] Creating cgroup /ecs
2018-08-07T20:37:43Z [INFO] Loading state! module="statemanager"
2018-08-07T20:37:43Z [INFO] Event stream ContainerChange start listening...
2018-08-07T20:37:43Z [CRITICAL] Error loading previously saved state: invalid Volume: must include a type

1.20.0 startup

docker logs ecs-agent
2018-08-08T12:33:52Z [INFO] Loading configuration
2018-08-08T12:33:52Z [INFO] Amazon ECS agent Version: 1.20.0, Commit: cd331230
2018-08-08T12:33:52Z [INFO] Creating root ecs cgroup: /ecs
2018-08-08T12:33:52Z [INFO] Creating cgroup /ecs
2018-08-08T12:33:52Z [INFO] Loading state! module="statemanager"
2018-08-08T12:33:52Z [INFO] Event stream ContainerChange start listening...
2018-08-08T12:33:52Z [CRITICAL] Error loading previously saved state: invalid Volume: must include a type

after purging /var/lib/ecs/data/ecs_agent_data.json

docker logs ecs-agent
2018-08-08T13:30:49Z [INFO] Loading configuration
2018-08-08T13:30:49Z [INFO] Amazon ECS agent Version: 1.20.0, Commit: cd331230
2018-08-08T13:30:49Z [INFO] Creating root ecs cgroup: /ecs
2018-08-08T13:30:49Z [INFO] Creating cgroup /ecs
2018-08-08T13:30:49Z [INFO] Loading state! module="statemanager"
2018-08-08T13:30:49Z [INFO] Event stream ContainerChange start listening...
2018-08-08T13:30:49Z [INFO] Registering Instance with ECS
2018-08-08T13:30:49Z [INFO] Registered container instance with cluster!
2018-08-08T13:30:49Z [INFO] Registration completed successfully. I am running as ''
2018-08-08T13:30:49Z [INFO] Saving state! module="statemanager"
2018-08-08T13:30:49Z [INFO] Beginning Polling for updates
2018-08-08T13:30:49Z [INFO] Event stream DeregisterContainerInstance start listening...
2018-08-08T13:30:49Z [INFO] Initializing stats engine
2018-08-08T13:30:49Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2018-08-08T13:30:49Z [INFO] Connected to TCS endpoint
2018-08-08T13:30:49Z [INFO] Connected to ACS endpoint

kinbug workaround available

All 6 comments

We are facing the same issue here on prod. We managed to start the agent on version 1.20.0 but it doesn't send the proper info. On ECS instances page, it shows 0 tasks running, but when we run docker container list, it shows 18 containers running.

tailf /var/log/ecs/ecs-agent.log.2018-08-08-14:

2018-08-08T14:48:24Z [DEBUG] No container health metrics to report
2018-08-08T14:48:24Z [DEBUG] Instance is idle. No task metrics to report
2018-08-08T14:48:24Z [DEBUG] TCS client sending payload: {"type":"PublishMetricsRequest","message":{"metadata":{"cluster":"production-ecs","containerInstance":"arn:aws:ecs:eu-west-1:[removed]:container-instance/[removed]","fin":true,"idle":true,"messageId":"[removed]"},"timestamp":1533739704}}
2018-08-08T14:48:24Z [DEBUG] Received message of type: AckPublishMetric
2018-08-08T14:48:24Z [DEBUG] Received AckPublishMetric from tcs

Same issue with the upgrade. After deleting /var/lib/ecs/data it works, we're not experiencing other issues like @manoelhc does.

@marksullivancrowd, @vad: We're looking into the issue related to the data file.

@manoelhc: This appears to be unrelated. Do you mind cutting a new issue for this?

@adnxn nope, will create a ticket for my issue.

This issue surfaces when upgrading an agent to 1.20.0 on an instance that is managing tasks that use volumes.

Right now we have work arounds while we wait for the fix to be released.

  1. Roll back to the older version of the agent.

  2. Terminate the instance and launch a new one.

  3. Modify the state file to include additional fields expected by the agent.

On an instance where agent ran into this issue.
There should be multiple "volumes" in the state file `/var/lib/ecs/data/ecs_agent_data.json`.

eg: ` "volumes":[{"host":{"sourcePath":"/home/ec2-user"},"name":"volume1"},{"host":{"sourcePath":"/home/ec2-user"},"name":"volume2"}]`

Adding the `, "type": "host"` in each of the volume blob like this: 
`"volumes":[{"host":{"sourcePath":"/home/ec2-user"},"name":"volume1","type":"host"},{"host":{"sourcePath":"/home/ec2-user"},"name":"volume2","type":"host"}]`

And then run sudo start ecs should bring the agent back to 1.20.0

We've released agent v1.20.1 that includes a fix for this issue.

Was this page helpful?
0 / 5 - 0 ratings