Amazon-ecs-agent: Agent randomly disconnected from ECS, flapping (>= v1.11)

Created on 19 Aug 2016 · 36Comments · Source: aws/amazon-ecs-agent

After upgrade to v1.11.0 (from docker hub) ECS cluster is unstable, with agents disconnecting and reconnecting after a while, while still running, similar to #316.
Running tasks are fine but agent connection flapping prevents scheduling of new tasks.
Cluster is not under big load, with only 2 single-container tasks running besides ecs-agent.

Same behaviour happens with v1.12.0.
Hosts have been upgraded to Docker 1.11.2 at the same time, but work fine with agent v1.10.

Here is the agent log file, with a stacktrace while disconnected: https://gist.github.com/josqu4red/fc1bc11b2822b5050b7d4850c5e18f71

And agent config:

ECS_LOGFILE = "/log/ecs-agent.log" 
ECS_DATADIR ="/data"
ECS_LOGLEVEl = "debug"
ECS_AVAILABLE_LOGGING_DRIVERS = ["json-file","fluentd"]
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION = "10m"

more info needed

Source

josqu4red

All 36 comments

@josqu4red Thank you for providing logs!

I've taken a look at your logs and also at our server logs. What I'm seeing in the server logs is that the agent is successfully connecting each time it attempts (requests get through, correctly upgrade to a websocket), but after our server attempts to send a payload down over the websocket the connection appears to be interrupted.

In the agent logs you provided, I see Error from acs; backing off, err: unexpected EOF which seems to indicate that the agent did not intentionally disconnect. I also see that you have an HTTP proxy configured (proxy:3128).

Do you still see this behavior without using a proxy? It's possible that your proxy is interrupting the websocket connection that the agent needs.

samuelkarp on 20 Aug 2016

I see this behavior as well, but don't have any of the same precursors as OP, and suspect it's related to #488 since I haven't seen it after changing the default container nofiles ulimit to an excessively high number (65,536) via /etc/sysconfig/docker.

ziggythehamster on 22 Aug 2016

@ziggythehamster We've just released 1.12.1. Can you let us know if you continue to see this problem on the new version?

samuelkarp on 22 Aug 2016

I will - but since I also changed the ulimits, it may be some time until the problem reoccurs, if it even does.

ziggythehamster on 22 Aug 2016

Thanks for the analysis, @samuelkarp.
I'll try to set up a cluster in a public subnet to compare behaviors. I've not mentionned that the disconnection does not seem to happen on an idle cluster (e.g with only agent container running).
Please note that the ECS cluster always ran behind proxy without a problem.

josqu4red on 23 Aug 2016

@josqu4red If you're not seeing this happen on an idle cluster, @ziggythehamster's suggestion that this might be related to #488 could be right. Can you try with the new 1.12.1 version we just released?

samuelkarp on 23 Aug 2016

👍1

@samuelkarp Problem indeed seems to come from the proxy, as agents on public subnets' container instances look stable.
I'm curious about what change from v1.10 could have caused this, since it worked well before.
Do you have suggestion on how to fix this, is there anything doable agent-side ?
Because squid still don't support websocket apparently...
Thanks.

EDIT: tested with v1.12.1

josqu4red on 24 Aug 2016

Hi,

I'm seeing this issue too. I've updated to agent v1.12.1 and am still seeing issues. I'm also running behind a proxy.

Here's some logs

2016-08-24T23:18:05Z [INFO] Creating poll dialer, proxy: proxy.<XXXX>:8080
2016-08-24T23:19:05Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-08-24T23:21:11Z [INFO] Creating poll dialer, proxy: proxy.<XXXX>:8080
2016-08-24T23:22:11Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-08-24T23:24:18Z [INFO] Creating poll dialer, proxy: proxy.<XXXX>:8080

I see the agent coming and going in step with the errors in the log.

As above I've noticed that if I increate the nofiles ulimit it still gives similar errors but seems to not be flagged as disconnected so often. This is purely subjective.

All this was working perfectly well until we moved the the 1.12.X agent.

gazmania on 25 Aug 2016

@josqu4red @gazmania We haven't changed anything in the code to use a proxy in recent versions. Can you re-test with 1.12.1 behind your proxy to determine if it's related to #488?

@gazmania In order for me to correlate anything with our server logs I'd need the full copy of your logs. If you're not comfortable posting publicly, you can send them to me at skarp (at) amazon.com. However, if you're seeing the exact same behavior as @josqu4red you might also look at your proxy to see if it's causing the disconnection.

samuelkarp on 25 Aug 2016

@samuelkarp Thanks. I'll chase up the proxy guys first as on the surface it does look to be something to do with that. I'll let you know what the outcome is

gazmania on 25 Aug 2016

@samuelkarp Sorry if I wasnt clear.
The ECS cluster used for my previous test spans on 2 ASG - one behind proxy, one with direct access, all agents are 1.12.1. Agents behind proxy are still experiencing the problem.
I don't see anything relating to #488.
Here is a new log extract with a stacktrace, if you want to look into it: https://gist.github.com/josqu4red/4a21f118e576fb9e98cebb780bfc59e0
Thanks!

josqu4red on 25 Aug 2016

Thanks for clarifying! I don't see anything in the logs/stacktrace that indicate why it's failing. Can you check into the proxy you're using to see if it has logs for the disconnects? We have a test for proxying using Squid that has been passing with all versions since we added proxy support.

samuelkarp on 25 Aug 2016

Unfortunately, nothing interesting in proxy logs. Also I did not mention there's an ELB in TCP mode between agents and proxies.
I'm capturing network traffic to locate origin of disconnection

josqu4red on 30 Aug 2016

It appears that ELB is closing the agent's connection. Raising the idle timeout from 60 to 120 seconds fixes the issue.
We still think that a change from v1.10 introduced the problem, since we didn't change anything in our stack, and v1.10 is running fine. Maybe a change in go standard library, between 1.4 and 1.6 ?

josqu4red on 30 Aug 2016

Hi,

I'm still having issues but have talked to our proxy guys and they think it's to do with TCP packet sizes being bigger than the allowed MTU. I dont really understand this stuff but that was the gist. They are looking to make some changes to not kick these packets out.

Cheers....

gazmania on 1 Sep 2016

I am having the same issue - happened twice again today. I've had it happen with container instances running t2.medium and c4.xlarge. Versions are 1.11.0 for agent and 1.11.2 for Docker.

I happened to be watching the cluster earlier today when it happened, and the cluster failed to launch a new service for about an hour before the instance finally showed "Agent Connected = False". And that seems to be typical when having this agent connectivity issue - i.e. new tasks/services get "stuck" for a while before the agent finally shows disconnected. The service will show as pending while nothing new shows up in the Events tab.

The second time it happened today I noticed that the instance was alternating between connected and not connected for a while, before I finally terminated the instance. Once the instance was replaced, the service started right up.

All of my container instances are running in private subnets - i.e. behind EC2 NAT Gateways. In the past I've had network issues with public vs. private subnets in the way of MTU size, where the private subnet traffic would be very hit-or-miss until I forced MTU size to "normal" - i.e. not the jumbo packet size. In other words, when routed through an EC2 Internet Gateway, traffic from an EC2 instance was fine. When I moved the instance to a private subnet, I started having network issues that went away after forcing MTU size (these were Ubuntu instances) to 1500-ish. So while not directly experienced in an ECS cluster, I am wondering if the same is true here?? Seems like some of the comments above mention public versus private subnets, MTU size, and proxy vs. no proxy.

So my from-the-hip guess is that something is happening with the agent network traffic and MTU when routed through a NAT Gateway, proxy, ELB, etc.

I am more than happy to fetch logs, config, whatever is needed to help. I haven't done so yet, so feel free to point me in the right direction. I also have ECS clusters currently available for testing/experimenting if needed.

EDIT: all my container instances were launched from amzn-ami-2016.03.e-amazon-ecs-optimized (ami-55870742)

jamiekurtz on 7 Sep 2016

As another datapoint, we think we're encountering this as well, behind a EC2 NAT gateway, hosts running CoreOS, Docker version 1.11.2

suan on 9 Sep 2016

Same behaviour as OP described. I'm behind a corporate proxy using agent version 1.12.1.
Any suggestions/workarounds?

justjulian on 21 Sep 2016

@jamiekurtz that's certainly interesting to note. Do you know why jumbo frames mess up the proxy configuration? Can you also let us know the MTU size configured on your instance where you saw this issue?

I went through our release notes again and noticed that for the 1.11.0 of the ECS Init release (corresponds to ECS Agent v1.11.0), we modified the ECS Agent config so that its launched with --net=host. Before that, ECS Agent, which runs as a container itself would be launched with --net=bridge and would get its own net namespace.

The interesting thing here is that docker sets the MTU to 1500 for containers launched using the bridge mode. Here's a dump from such a container:

$ ip netns exec 182de67f30e7 ifconfig eth0
eth0      Link encap:Ethernet  HWaddr XX
          inet addr:XX  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: XX Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:102 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:8236 (8.0 KiB)  TX bytes:648 (648.0 b)

Where as, the MTU size on the host is set to 9001

$ ifconfig eth0
eth0      Link encap:Ethernet  HWaddr XX
          inet addr:XX  Bcast:XX  Mask:XX
          inet6 addr: XX Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:129676 errors:0 dropped:0 overruns:0 frame:0
          TX packets:27794 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:169771443 (161.9 MiB)  TX bytes:5998804 (5.7 MiB)

You could validate this by stopping the ECS Agent and manually running the ECS Agent container with --net=host param on a test instance.

Here's an example for doing the same on the instance launched with the ECS Optimized AMI:

$ stop ecs
$ docker rm -f ecs-agent
$ docker run \
        --name ecs-agent \
        -v /var/run:/var/run \
        -v /var/log/ecs:/log \
        -v /var/lib/ecs/data:/data \
        -v /etc/ecs:/etc/ecs \
        -v /var/cache/ecs:/var/cache/ecs \
        -p 127.0.0.1:51678:51678 \
        -p 127.0.0.1:51679:51679 \
        -e ECS_UPDATES_ENABLED=true \
        -e ECS_AVAILABLE_LOGGING_DRIVERS='["json-file","syslog","awslogs"]' \
        -e ECS_ENABLE_TASK_IAM_ROLE=true \
        -e ECS_UPDATE_DOWNLOAD_DIR=/var/cache/ecs \
        -e ECS_LOGFILE=/log/ecs-agent.log \
        -e ECS_DATADIR=/data \
        -e ECS_LOGLEVEL=debug \
        -e ECS_CLUSTER=proxy \
        -e ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true \
        -e ECS_AGENT_CONFIG_FILE_PATH=/etc/ecs/ecs.config.json \
        -d amazon/amazon-ecs-agent:latest

I am also trying to repro this issue by running the ECS Agent behind a squid proxy.

cc @justjulian @suan @josqu4red

aaithal on 21 Sep 2016

I've just reconfigured out Amz Optimised ECS instances with an MTU of 1500. Still getting the same issue.

2016-09-21T19:32:48Z [INFO] Creating poll dialer, proxy: proxy.####:8080
2016-09-21T19:33:48Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-09-21T19:34:20Z [INFO] Error from tcs; backing offerrunexpected EOF
2016-09-21T19:34:21Z [INFO] Creating poll dialer, proxy: proxy.####:8080
2016-09-21T19:35:56Z [INFO] Creating poll dialer, proxy: proxy.####:8080
2016-09-21T19:36:55Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-09-21T19:39:13Z [INFO] Creating poll dialer, proxy: proxy.####:8080

gazmania on 21 Sep 2016

@josqu4red I took a look at your logs. The acs is disconnected exactly every 1 minute for the first few disconnection. And I found this issue on SO, not sure if you also has the related configuration for your Proxy, this will cause the connection between agent and backend disconnected every minute, please take a look and also we have released v1.13.0 which fixed a potential deadlock that will cause acs disconnected forever, could you also try the new version and see if the problem still exists. thanks.

Syntax: proxy_read_timeout time;
Default: proxy_read_timeout 60s;
Context: http, server, location

richardpen on 5 Oct 2016

@josqu4red @gazmania @aaithal have any of you tried out v1.13.0 in regards to this issue? Very curious to know the results if so

suan on 2 Nov 2016

@suan fyi... I have been using v1.13.0 since it came out and have not seen any agent disconnects yet. And I believe we didn't see the issue with agent v1.12.x either. We'll keep our fingers crossed!

jamiekurtz on 2 Nov 2016

I am using v1.13.1 and was hitting continuous disconnects (every minute) using a proxy with ELB and standard 60 second idle timeout, and this likely was the cause of a deployment that took 30 minutes to complete.

Increasing the idle timeout on the ELB to 120 seconds seems to have resolved the disconnects, early days yet though and will monitor and advise if the problem re-occurs.

mixja on 1 Dec 2016

I am using v1.13.1 and was hitting continuous disconnects (every minute) using a proxy with ELB and standard 60 second idle timeout, and this likely was the cause of a deployment that took 30 minutes to complete.

Increasing the idle timeout on the ELB to 120 seconds seems to have resolved the disconnects, early days yet though and will monitor and advise if the problem re-occurs.

@mixja Thanks for sharing the details of your setup.

There are heartbeat messages sent by ECS when the connection is idle (the current setting is for 60s). You'll want to have an idle timeout on your load balancer/proxy longer than 60s. Please let us know if you continue to run into problems.

samuelkarp on 11 Dec 2016

Raising the ELB idle connection timeout to 120 seconds has definitely resolved the issue

mixja on 12 Dec 2016

@mixja Thanks for confirming, and it sounds like that fixed @josqu4red's problem as well.

@ziggythehamster @gazmania @jamiekurtz @suan @justjulian Are you still running into issues?

samuelkarp on 12 Dec 2016

The 1.12.1 build fixed the problem for me. I'm not using a proxy like the others are.

ziggythehamster on 12 Dec 2016

We are still experiencing this issue. We are going to ask the proxy guys to raise the idle timeout above 60 seconds as suggested. However, I fear this might not be possible due to company restrictions. Is is planned to export the heartbeat interval as config setting?

justjulian on 12 Dec 2016

@justjulian The heartbeat is initiated by ECS rather than by the agent, so it's not configurable right now. Please let us know if you are unable to get your company to modify the idle timeout.

samuelkarp on 12 Dec 2016

@samuelkarp Thanks for asking... since upgrading to 1.12.1 (and now on 1.13.1) I have not observed the issue. I am not using a proxy. My ECS instances are in a private subnet in EC2, if that matters.

jamiekurtz on 13 Dec 2016

fwiw, we had this issue occur again, four times in the last 24 hours... all with agent version 1.31.1, and Docker version 1.12.6. (though I'm not sure if all four failures were on that version of Docker - it seemed to go from 1.11.2 to 1.12.6 somewhere in the midst of those failures). Each time I had to deregister the cluster instance and then terminate it in EC2.

I would sure like to know what's going on, and how to fix it.

As I mentioned in an earlier post, all of my cluster instances are in a private subnet, behind an EC2 NAT Gateway, they are all c4.xlarge, and the failures seem to not prefer any one AZ - i.e. happening in both the AZs were using.

jamiekurtz on 12 Jan 2017

yet another follow-up... one our engineers noticed that the ECS instance most recently losing agent connectivity was also suffering from a docker container that was consuming all available memory (i.e. a bug somewhere in our code). Without a hard memory limit on that particular container/task, it seems to be running the entire instance of memory, at which time the agent disconnects. Eventually the task restarts, but the agent does NOT reconnect. At least that's what it looks like. We're continuing to investigate.

jamiekurtz on 12 Jan 2017

👍1

@jamiekurtz Thanks for the information. Were you able to find out more about what was happening when your task was consuming all available memory?

samuelkarp on 1 Feb 2017

@samuelkarp No, we haven't looked into it much further. We ended up setting hard memory limits on the problematic containers (i.e. that are running code with memory issues). So now at least the container instance and its agent remain online - because the container just get killed/restarted.

jamiekurtz on 3 Feb 2017

@jamiekurtz Thanks for letting us know. Since this isn't causing you problems anymore, I'm going to close this issue.

If anyone else is experiencing problems, please open a new issue. This issue has multiple separate problems; it's easier for us to track each problem separately.

samuelkarp on 12 Feb 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings