Amazon-ecs-agent: v1.26.0 broke the ability to use task endpoint when running agent on bridge network

Created on 8 Mar 2019 · 21Comments · Source: aws/amazon-ecs-agent

Summary

It is no longer possible to connect to task endpoint in v1.26.0 when running amazon/amazon-ecs-agent image with bridge network.

Description

Now, we know what instruction states for few years, that the way to run Agent is:

$ docker run --name ecs-agent \
    --detach=true \
    --restart=on-failure:10 \
    --volume=/var/run/docker.sock:/var/run/docker.sock \
    --volume=/var/log/ecs:/log \
    --volume=/var/lib/ecs/data:/data \
    --net=host \
    --env-file=/etc/ecs/ecs.config \
    --env=ECS_LOGFILE=/log/ecs-agent.log \
    --env=ECS_DATADIR=/data/ \
    --env=ECS_ENABLE_TASK_IAM_ROLE=true \
    --env=ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true \
    amazon/amazon-ecs-agent:latest

yet we decided to disregard --net=host and instead we followed our Docker instincts and did:

$ docker run \
    --volume=/var/run/docker.sock:/var/run/docker.sock
    --publish=127.0.0.1:51679:51679
    ...
  amazon/amazon-ecs-agent:latest

That's because we are pro Docker users and we know better than to trust the manual, don't we? :)

This worked without problems up to and including v1.25.3. Unfortunately in v1.26.0 there's a commit:

a3d87cb45ad6ae62191e563336ea70ee1edb7091 Change task server endpoint to 127.0.0.1

And as now task endpoint is bound to 127.0.0.1 and in bridge network the connections from the host are considered external from the perspective of container (coming from gateway) it stopped responding.

Now still disregarding the README, I consider it more natural when I see EXPOSE in Dockerfile as in:

scripts/dockerfiles/Dockerfile.release:30

to be able to publish the ports with the default network mode of Docker. So it would be nice if you could go back to the way this were before. And if you believe that host network mode is the only one that Agent is designed for then at least it would be great to stress it in documentation with some short explanation why.

Expected Behavior

$ docker run \
    --detach \
    --volume=/var/run/docker.sock:/var/run/docker.sock \
    --publish=127.0.0.1:51679:51679  \
    amazon/amazon-ecs-agent:v1.26.0
...
$ curl http://localhost:51679
404 page not found

The error is to ignore, this just shows that we are getting actual reply. And this is exactly what we get with v1.25.3:

$ docker run \
    --detach \
    --volume=/var/run/docker.sock:/var/run/docker.sock \
    --publish=127.0.0.1:51679:51679  \
    amazon/amazon-ecs-agent:v1.25.3
...
$ curl http://localhost:51679
404 page not found

Observed Behavior

$ docker run \
    --detach \
    --volume=/var/run/docker.sock:/var/run/docker.sock \
    --publish=127.0.0.1:51679:51679  \
    amazon/amazon-ecs-agent:v1.26.0
...
$ curl http://localhost:51679
curl: (52) Empty reply from server

kinfeature request more info needed

Source

lkslawek

👍2

Most helpful comment

@lkslawek: We realise that the change you referenced broke the task endpoint under bridge network mode. Though as you you called out, starting the agent without host networking is a departure from our documentation and I feel this is good opportunity to hear about use cases that would require the agent be started with bridge networking mode.

And if you believe that host network mode is the only one that Agent is designed for then at least it would be great to stress it in documentation with some short explanation why.

Agreed. We'll update our documentation with more details regarding why we need the agent started in host mode.

We document that the agent should be started in host mode to block access to the instance metadata endpoint for containers managed by the ECS Agent. This ensures that containers cant access role credentials from the instance profile and enforces that tasks use only the task role credentials.

But as I was saying - this is a good opportunity to explore the use cases for starting the agent in bridge networking mode.

adnxn on 8 Mar 2019

👍4

All 21 comments

I also experienced this issue in 1.26.0, and not in 1.25.3. Thank you @lkslawek for opening such a detailed issue.

eric-johnson on 8 Mar 2019

And if you believe that host network mode is the only one that Agent is designed for then at least it would be great to stress it in documentation with some short explanation why.

Agreed. We'll update our documentation with more details regarding why we need the agent started in host mode.

But as I was saying - this is a good opportunity to explore the use cases for starting the agent in bridge networking mode.

adnxn on 8 Mar 2019

👍4

I feel this is good opportunity to hear about use cases that would require the agent be started with bridge networking mode.

I don't have any that would strictly require it in my case. But as a default network mode in Docker this is a tempting one to use by... default. In general I also like the ability to have a control on what ports are published by a container. And this is pretty much how we ended up using it. And this ended a bit badly to us with the sudden change in v1.2.60. So for me this is the question of why not to run it in bridge mode.

We document that the agent should be started in host mode to block access to the instance metadata endpoint for containers managed by the ECS Agent. This ensures that containers cant access role credentials from the instance profile and enforces that tasks use only the task role credentials.

Do you mean retrieving for example http://169.254.169.254/latest/meta-data/iam/security-credentials/ROLE-NAME? So running Agent in host mode is for the case that someone follows iptables recommendation from IAM Roles for Tasks documentation and blocks forwarding to this endpoint, to still allow ecs-agent container itself to access it (because in that case its traffic won't go out via FORWARD chain, so it won't be blocked)? In that case I think it would be very good to have such a detailed explanation in documentation, because it took me a moment to understand how this relates.

As a side note I've checked that this isn't set up by default in ECS Optimized AMI 2018.03.o (the latest one, with ecs-agent v1.26.0), so by default containers have this access. This makes me wonder then that if this isn't even a standard behaviour wouldn't it be better to instruct (with explanation) to use host mode for case with access blocked and still allow bridge mode to be operational.

By the way the instructions on Docker Hub are not up to date, i.e. there's no host mode there, so maybe you could limit them to sole link to the maintained documentation?

lkslawek on 11 Mar 2019

👍2

I don't have any that would strictly require it in my case

gotcha. so switching over to host would unblock your use case?

In that case I think it would be very good to have such a detailed explanation in documentation, because it took me a moment to understand how this relates.

This makes me wonder then that if this isn't even a standard behaviour wouldn't it be better to instruct (with explanation) to use host mode for case with access blocked

agreed. we'll be updating the aws docs for this specific set of operating requirements.

By the way the instructions on Docker Hub are not up to date, i.e. there's no host mode there, so maybe you could limit them to sole link to the maintained documentation?

and thanks for this, good catch for documentation that we've overlooked.

adnxn on 12 Mar 2019

I don't have any that would strictly require it in my case

gotcha. so switching over to host would unblock your use case?

Yes, in my case this is enough.

lkslawek on 13 Mar 2019

👍1

For the record, I think this bit us as well. We use bridge mode to send data to a Datadog process running on our ECS instances. It was a very odd situation where none of our ECS service IAM roles worked, so they couldn't access resources like Kinesis, etc.

hylaride on 28 Mar 2019

@adnxn FYI, my above comment as another use case. We use bridge mode networking so that our containers can access Datadog and Hashicorp's Consul (which listen on dummy interfaces on the ecs instance). Upgrading our ecs-agent to 1.26.x broke our IAM rules set on the containers.

hylaride on 29 Mar 2019

Oh! Also, this can break a lot of people that use sidecars for various tasks. They're a very common feature in many docker architectures.

hylaride on 29 Mar 2019

@adnxn Just wondering if there are any plans to change the behavior on bridge mode back to how it was?

hylaride on 16 Apr 2019

@adnxn Just wondering if there are any plans to change the behavior on bridge mode back to how it was?

we have no immediate plans to switch it back, but we are looking at use cases that would be unblocked by maybe making this a configurable option.

adnxn on 18 Apr 2019

@adnxn Can you clarify something? You noted in a comment above that "starting the agent without host networking is a departure from our documentation", however o this page, there's the description for the config variableECS_ENABLE_TASK_IAM_ROLE is Enables IAM roles for tasks for containers with the bridge and default network modes. We followed this documentation when rolling out ECS.

hylaride on 7 May 2019

Any comment here? This is kind of frustrating. Support for bridge mode is documented and now broken.

hylaride on 6 Jun 2019

Any comment here? This is kind of frustrating. Support for bridge mode is documented and now broken.

@hylaride hey just catching up with this.

so the bit you referenced is a typo and thank you for catching that. i realise this isn't a satisfying answer. but that part of the doc should reflect the README's description of ECS_ENABLE_TASK_IAM_ROLE, which is Whether to enable IAM Roles for Tasks on the Container Instance. we'll be updating the docs shortly.

however to reiterate my earlier point -

we have no immediate plans to switch it back, but we are looking at use cases that would be unblocked by maybe making this a configurable option.

we're still looking at use cases that would require this network mode. we've tagged this as a feature request and will update the issue accordingly.

adnxn on 6 Jun 2019

@adnxn OK. :-(

I still do take issue that this _was a documented feature_ that was removed without any reference in the CHANGELOG.

Bridge mode networking, as I mentioned earlier, allows use of docker sidecars, which can be heavily used in a lot of microservice environments. Also, we run Datadog and Consul agents on the host systems for stats and service discovery. I believe that the option should be present.

hylaride on 7 Jun 2019

@hylaride could I get a quick clarification?

We didn’t remove the ability to run tasks with bridge networking + IAM roles. We only changed the network mode agent itself is designed to run in. If you can’t access IAM roles from a bridge network task, that’s a different problem that we can absolutely troubleshoot and fix.

Which version of ecs-init is running on your instances?

petderek on 7 Jun 2019

@petderek OH! Interesting.

What happened is we did a rolling update to 1.26.0 and IAM roles for our tasks broke, so our containers couldn't access AWS services like S3, etc.

We're using a custom ECS host based off of ubuntu (maybe there's an issue there). We create an image with the ecs-agent (currently running 1.25.3) and various internal support software installed (like Datadog, consul etc). My understanding is that ecs-init is part of the Amazon Linux ecosystem.

FYI, here's our ECS config

ECS_CLUSTER=REMOVED
ECS_LOGLEVEL=info
ECS_LOGFILE=/log/ecs-agent.log
ECS_DATADIR=/data
ECS_APPARMOR_CAPABLE=true
ECS_RESERVED_MEMORY=512
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","syslog"]
ECS_IMAGE_PULL_BEHAVIOR=always
ECS_ENGINE_AUTH_TYPE=dockercfg
ECS_ENGINE_AUTH_DATA={"https://REMOVED":{"auth":"REMOVED=","email":"ecs-agent@REMOVED"},"REMOVED":{"auth":"REMOVED =","email":"ecs-agent@REMOVED"}}
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true

hylaride on 7 Jun 2019

@petderek Also, here's how we're starting the ECS Agent in our systemd file.

ExecStart=/usr/bin/docker run \
    --name ecs-agent \
    --log-driver json-file \
    --env-file=/etc/ecs/ecs.config \
    --volume=/var/run/docker.sock:/var/run/docker.sock \
    --volume=/var/lib/ecs-data:/data \
    --volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
    --volume=/run/runc:/var/lib/docker/execdriver/native:ro \
    --publish=127.0.0.1:51678:51678 \
    --publish=127.0.0.1:51679:51679 \
    ${ECS_AGENT_DOCKER_IMAGE}:${ECS_AGENT_DOCKER_TAG}

hylaride on 7 Jun 2019

Try running agent with ‘--net=host’ in that systemd unit. Let us know if that works.

petderek on 7 Jun 2019

@petderek I'll try that and let you know. It'll take about 2-3 hours to build the new AMI and roll it out to our development environment.

hylaride on 7 Jun 2019

@petderek It looks like it's working! THANK YOU. Re-reading this thread, I realize that I misunderstood the original problem and see what you guys actually did (which makes sense).

Apologies to @adnxn for my wrong comments (RE undocumented changes) above. In my defense I was staring down having to move over a ton of containers someplace else. We quite like ECS for its overall simplicity! :-)

hylaride on 7 Jun 2019

🎉1

Closing the issue for customer's question solved already.

yumex93 on 16 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Logentries docker driver

AbelGuti · 5Comments

HostPort not present in ECS Task Metadata Endpoint response with bridge network type

MartinMitro · 3Comments

devicemapper leaking

cjbottaro · 4Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments

ECS Agent 1.36.0 becomes unhealthy, resulting in tasks stuck in pending state

truppert-mdsol · 5Comments