Amazon-ecs-agent: ECS agent disconnected when a container without a hard memory limit uses all available memory

Created on 5 Apr 2017  路  11Comments  路  Source: aws/amazon-ecs-agent

We've noticed that the ecs agent on our instances gets disconnected permanently (and new tasks cannot be assigned to it) when a running container (with a memoryReservation set only) uses up all the available memory on the instance. Our workaround will be to set hard memory limits on each container, and i plan to write a monitoring process which can detect the disconnected instances and restart them - though it would be great to have a built in solution.

kinbug scopECS Agent

Most helpful comment

@OscarBarrett Thanks! We'll look at those and update this as we have more information.

All 11 comments

@krishan-carbon Just to clarify, does the agent recover after the container that uses all available memory is stopped?

Additionally, if you can collect logs from both Docker and the ECS agent when this is happening, we can debug more effectively. The easiest way to collect the relevant information is with the ECS logs collector. If you're not comfortable sharing this information publicly, please feel free to send it to me directly at skarp at amazon.com.

@samuelkarp no, the agent does not recover, it is constantly registered as disconnected and new containers can't be allocated to it. We have to restart the instance to get it back. I will work on getting logs to you, we have now set hard memory limits on all containers so will have to set up a new cluster + container to simulate this.

@krishan-carbon Were you able to collect any logs?

We've run into this a couple of times. The agent disconnects permanently and the instance becomes unresponsive. With the latest occurrence, the task was still shown as running in the console.

@samuelkarp I have sent you some logs.

@OscarBarrett Thanks! We'll look at those and update this as we have more information.

@krishan-carbon Sorry for the late response, after looking through the logs it appears that the agent has some problem connect to the internet indicated by the error in the logs. So by the time where the agent was disconnected, does the instance has any problem to connect to the internet?

./ecs-agent.log.2017-07-17-19:2017-07-17T19:16:11Z [WARN] Error creating a websocket client: dial tcp: i/o timeout

I also tried to reproduce the issue with stress command to consume all the available memory on the instance, where only soft memory limit was set for the container. After that all the command I ran will hang, but the agent still shows connected in the ecs console and was able to stop task.

@OscarBarrett @krishan-carbon If possible could you share the steps/images to reproduce this issue, or you can also reproduce in your side and send us the debug level logs and stack trace. The details steps to collect are:

  1. Set ECS_LOGLEVEL=debug in /etc/ecs/ecs.config
  2. Start the agent and try to reproduce the scenario.
  3. Waiting for the agent to be in disconnected for at least 3 minutes, as agent can periodically connect/disconnect from backend. But the disconnection period shouldn't be longer than 2 mins.
  4. Send SIGUSER to agent by docker kill -s USR1 ecs-agent, this will generate the stack trace in agent logs.
  5. Collect the agent logs, docker logs and record the disconnection timestamps in the console, this will help us root cause the problem. The logs can be collected by ECS logs collector.

Below is the task definition I used to reproduce this, but seems agent works well:

{
    "family": "memory-stress",
    "containerDefinitions": [
    {
        "name": "ubuntu",
        "image": "ubuntu",
        "cpu": 300,
        "memoryReservation": 1000,
        "essential": true,
        "command": ["bash", "-c", "apt-get update && apt-get install stress && stress -c 2 -i 1 -m 1 --vm-bytes 1200M -v --vm-keep"]
    }
    ]
}

Thanks,
Peng

I don't have any spare time to reproduce this right now unfortunately.

RE steps to reproduce, our task was running a periodic cronjob several times a day that could hit the memory limit (some inherited code that we've now improved). Sometimes it would OOM and be killed prematurely, without the agent disconnecting. Over a period of weeks the agent would eventually disconnect (assumedly after one of these events).
Looking at some of the logs I have, one of these disconnections was triggered after ~75 memory allocation failures.

@OscarBarrett Are you still seeing this issue? The repro done by @richardpen with the soft memory limit shows that the agent behaves as expected.

Over a period of weeks the agent would eventually disconnect (assumedly after one of these events). Looking at some of the logs I have, one of these disconnections was triggered after ~75 memory allocation failures.

Does the agent stay disconnected?

Hi @OscarBarrett, I'm closing this issue for now since it seems that agent was behaving as expected. Please let us know if you run into this again and we can re-open the issue.

Thanks,
Anirudh

I noticed this in my own clusters recently. I figured that this was what was happening, and am glad I found this to confirm it's "behaving as expected."

It's concerning that the instance becomes unresponsive, leaving termination the only recovery path. I've since added hard limits to the task definitions, since they're not really optional.

Was this page helpful?
0 / 5 - 0 ratings