Amazon-ecs-agent: ECS agent fails to degrade gracefully if Cloudwatch Logs down

Created on 17 Nov 2017 · 14Comments · Source: aws/amazon-ecs-agent

Summary

During a CloudWatch Logs outage in us-east-1 on Nov 16th, we found that our ECS cluster was unable to launch any containers, due to the logging being unable to standup.

Description

If CloudWatch Logs API is unavailable due to outage, it is impossible to launch ECS containers that utilise awslogs driver, due to the ECS agent failing to establish the logging driver.

Expected Behaviour

If the logging service is non-functional, the container should degrade gracefully and launch without logging, in order to ensure services are still delivered to users.

Observed Behaviour

All containers on ECS failed to launch, resulting in production outages that impacted a large portion of our customer base.

Environment Details

ECS agent version 1.14.5

Supporting Log Snippets

Following error is from the ECS agent:

2017-11-16T13:51:01Z [DEBUG] Container change event passed on module="TaskEngine" event="{TaskArn:arn:aws:ecs:us-east-1:123456:task/abcdef123-abc1-abc1-abc1-abcdef123 ContainerName:myfirstcloudservice Status:STOPPED Reason:CannotStartContainerError: API error (500): Failed to initialize logging driver: RequestError: send request failed
caused by: Post https://logs.us-east-1.amazonaws.com/: read tcp 192.168.20.97:57288->54.239.25.60:443: read: connection reset by peer

kinenhancement scopECS Agent scopECS Service

Source

jcarr-sailthru

👍9 😕4

Most helpful comment

If the logging service is non-functional, the container should degrade gracefully and launch without logging, in order to ensure services are still delivered to users.

Agreed. However, I'm not sure if ECS Agent should be the entity that makes these decisions. This means that ECS Agent gets to override options set in the task definition, which can be problematic, especially since there are no good solutions for surfacing these “in-flight” changes. Some application developers might want their containers to execute if they cannot get application logs, as these might be business critical. There are also use-cases where absence of logs or logs being sent to the wrong destination (like the disk) could have security implications.

It seems more appealing instead to let the logdriver handle these failures in a way that it seems fit (as these failure modes can also vary by the choice of the log driver). --log-opt mode=non-blocking seems like an acceptable solution here, which gives the application developers control over dealing with these failures. We still have work to do on our end as this log option needs to be supported in the task definition (we've already started looking into this).

To the metapoint about ECS Agent choosing to override options, it might make more sense for us to support something like this field in the task definition: use-none-logdriver-on-initialization-failure=true, where we provide fallbacks for use-cases where it's still fine for your container to start even when it cannot send logs to the desired destination.

Please let us know if we've captured your concerns here and your thoughts on the same.

aaithal on 23 Nov 2017

👍3

All 14 comments

You will be able to handle this situation by utilizing the option --log-opt mode=non-blocking
The logs might be lost in the case of any cloudwatch outage
moby issue #33803

kssaril on 19 Nov 2017

If the logging service is non-functional, the container should degrade gracefully and launch without logging, in order to ensure services are still delivered to users.

@jcarr-sailthru Thanks for bringing this to our attention and you're right, graceful degradation is a reasonable action to take in this situation.

I'm tagging this as feature work to add to our roadmap. I suspect the changes required for this would only be in the agent code, and wouldn't need any additional work on the service side. So in the meantime, proposals for a path forward (such as defining graceful degradation in this context) or code contributions to the repo are welcome. =]

adnxn on 20 Nov 2017

If the logging service is non-functional, the container should degrade gracefully and launch without logging, in order to ensure services are still delivered to users.

Please let us know if we've captured your concerns here and your thoughts on the same.

aaithal on 23 Nov 2017

👍3

thanks @aaithal , @adnxn , @kssaril

I think this is a very fair point regarding logs and their role in security. As per your suggestion, an approach to offer a user-selectable option such as use-none-logdriver-on-initialization-failure=true seems like a sensible approach and allows for both user needs to be met by a single host, as opposed to having a host enforcing one approach or another but not permitting both,

To confirm the suggestion regarding --log-opt mode=non-blocking - this isn't something usable right now, unless ECS adds support for passing this argument to the container at launch right? Or is there a way I can set it for all containers at the ECS host level right now?

regards,
Jethro

jcarr-sailthru on 24 Nov 2017

Ran some tests today to answer my question.

Had to upgrade AMI to amzn-ami-2017.09.c-amazon-ecs-optimized to gain a new enough Docker to support the --log-opt mode=non-blocking argument (ECS Agent 1.15.2, Docker version 17.06.2-ce).

I was able to add the non-blocking argument successfully to the options line in /etc/sysconfig/docker and restart the Docker daemon. By setting up an iptables rule to DROP connections to logs.us-east-1.amazonaws.com I was able to simulate a failure of the Cloudwatch Logs service.

During this simulated failure, existing containers were not blocked and continued to run as expected and uploaded the buffered logs once the log service recovered.

Unfortunately, I was still unable to launch new containers due to the log service being unavailable:

2017-11-27T18:23:03Z [INFO] TaskHandler: batching container event: arn:aws:ecs:us-east-1:12345678:task/37f49cab-2f09-e3e2-8c6f-79905003202d myfirstcloudservice -> STOPPED, Reason CannotStartContainerError: API error (500): failed to initialize logging driver: RequestError: send request failed
caused by: Post https://logs.us-east-1.amazonaws.com/: dial tcp 54.239.25.71:443: i/o timeout

I suspect there's some initial networking call that does not respect the non-blocking mode argument, probably something like the API call to create the log stream fro the task.

jcarr-sailthru on 27 Nov 2017

👍2

@jcarr-sailthru Thanks for checking this for your use case and your feedback. We're tracking this internally and we will update this issue when we have a path forward.

adnxn on 27 Nov 2017

Any update?

jason-riddle on 5 Mar 2018

Submitted PR to update awslogs in docker to respect the non-blocking mode.

IRCody on 8 Mar 2018

@jcarr-sailthru @jason-riddle, here's an update: @IRCody's pull request has been merged upstream in the Moby project. The next step is getting this change pulled into the docker package for Amazon Linux, and then we can release it in the ECS-optimized AMI.

samuelkarp on 1 May 2018

👍2

This is awesome news - thanks @IRCody !

jcarr-sailthru on 2 May 2018

@samuelkarp I've just noticed that ECS is also failing to start tasks that are using non blocking log opts when using the fluentd logging driver if the fluentd endpoint is unavailable. Did you want me to raise a new issue for this or is this going to be related here and thus needs fixing before this issue can be considered fixed?

I have the following logging driver config:

...
      "logConfiguration": {
        "logDriver": "fluentd",
        "options": {
          "mode": "non-blocking",
          "max-buffer-size": "4m",
          "tag": "unparsed.frontend.staging"
        }
      },
...

but the task refuses to start when fluentd isn't running yet:

2018-05-08T10:18:52Z [INFO] Task engine [arn:aws:ecs:eu-west-1:340242599322:task/21ec9330-8aef-4c2d-9695-f5be6c9ae415]: starting container: frontend
2018-05-08T10:18:52Z [INFO] Task engine [arn:aws:ecs:eu-west-1:340242599322:task/21ec9330-8aef-4c2d-9695-f5be6c9ae415]: error transitioning container [frontend] to [RUNNING]: API error (500): failed to initialize logging driver: dial tcp 127.0.0.1:24224: getsockopt: connection refused

2018-05-08T10:18:52Z [WARN] Managed task [arn:aws:ecs:eu-west-1:340242599322:task/21ec9330-8aef-4c2d-9695-f5be6c9ae415]: Error starting/provisioning container[frontend]; marking its desired status as STOPPED: API error (500): failed to initialize logging driver: dial tcp 127.0.0.1:24224: getsockopt: connection refused

For reference I'm running fluentd as a ghetto daemon set by scheduling 1000 tasks across the cluster and they have host networking binding to 24224 which means only a single task can run on an ECS container instance at a time. Unfortunately ECS throttles rescheduling these tasks so it can take a while before the log shipper task is scheduled on to new nodes and so new tasks take a while to be schedulable on the container instance.

tomelliff on 8 May 2018

@tomelliff Please open a different issue; this issue is only about the awslogs driver. I'm not familiar with the fluentd driver's codebase, so I don't know whether the approach we took for the awslogs driver would be applicable to the fluentd driver.

samuelkarp on 8 May 2018

The fix for this is in the 18.06 release.

IRCody on 19 Jul 2018