Containers-roadmap: [ECS] [Instance Health]: ECS Instance Health Status

Created on 19 Feb 2020  Â·  2Comments  Â·  Source: aws/containers-roadmap

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
A mechanism to surface errors at the docker daemon level.

Which service(s) is this request for?
ECS

ECS Proposed

Most helpful comment

Hello everyone, we have a rough high-level plan for what we are going to do for this feature, described below:

Background/Introduction

ECS Container Instances can become unsuitable to host new tasks for a variety of reasons, some the fault of AWS and others the fault of customers. From the ECS agent’s perspective, this usually manifests as unexpected errors from the docker API, such as when trying to inspect or list containers. It is often the result of IO throttling on the instance but this is not necessarily the only reason.

When the ECS agent encounters an issue with docker, there is currently no way to surface these issues to our customers. From the customer’s perspective, the agent is still connected and the ECS instance state is ACTIVE, so they expect that the overall system is healthy. From ECS CP’s perspective, the instance is connected and new tasks can and will be scheduled to run on an instance where docker is unresponsive.

In order to alert customers and the ECS task placement engine to docker problems on a host, we will add a new object called ContainerInstanceHealth to the ContainerInstance object. The ContainerInstanceHealth object for now will only have one field: “containerRuntimeHealth”, with three options: HEALTHY, UNHEALTHY, and UNKNOWN. This state will be determined by ECS based on data sent to it from the ECS agent. To be clear, this feature will not include ECS taking actions to attempt to remedy the health, such as killing or replacing tasks, or killing instances. It will, however, include ECS changing it’s task placement behavior to ignore unhealthy instances when placing new tasks. It will also allow customers to determine for themselves if they can tolerate an unhealthy instance in their cluster, or if they want to terminate the instance.

Customers will be able to determine the health of their instance by using the DescribeContainerInstances API.

Statement of Scope

In Scope:

  • Health indicators that are specific to ECS and the container runtime.
  • Changing the ContainerInstance object to include health.
  • Creating a new ContainerInstanceHealth object.
  • Add logic to agent for tracking docker API errors.
  • Add docs providing recommendations and explanations around unhealthy instances (see ec2 docs for reference here).
  • Task Placement.

Out of Scope:

  • EC2- or linux-level health indicators.
  • Health indicators that can be monitored by the cloudwatch agent (disk usage, cpu usage, etc.)
  • Actions to remedy an unhealthy instance (ie rebooting docker, rebooting the container instance, draining the instance, killing tasks, etc.).

Model changes

NOTE Please keep in mind that the model changes described here are in a rough state and are likely to change in their exact format.

We will create a new ContainerInstanceHealth object:

{
    "containerRuntimeStatus": HEALTHY | UNHEALTHY | UNKNOWN
}

The ContainerInstance object will be updated to include the ContainerInstanceHealth object.

{
    "agentConnected": true,
+    "health": {
+        "containerRuntimeStatus": "HEALTHY"
+    }
    "attributes": [ ... ],
    "clusterArn": "arn:aws:ecs:us-east-1:111122223333:cluster/default",
    "containerInstanceArn": "arn:aws:ecs:us-east-1:111122223333:container-instance/b54a2a04-046f-4331-9d74-3f6d7f6ca315",
    "ec2InstanceId": "i-f3a8506b",
    "registeredResources": [ ... ],
    "remainingResources": [ ... ],
    "status": "ACTIVE",
    "version": 14801,
    "versionInfo": { ... },
    "updatedAt": "2016-12-06T16:41:06.991Z"
}

All 2 comments

Hello everyone, we have a rough high-level plan for what we are going to do for this feature, described below:

Background/Introduction

ECS Container Instances can become unsuitable to host new tasks for a variety of reasons, some the fault of AWS and others the fault of customers. From the ECS agent’s perspective, this usually manifests as unexpected errors from the docker API, such as when trying to inspect or list containers. It is often the result of IO throttling on the instance but this is not necessarily the only reason.

When the ECS agent encounters an issue with docker, there is currently no way to surface these issues to our customers. From the customer’s perspective, the agent is still connected and the ECS instance state is ACTIVE, so they expect that the overall system is healthy. From ECS CP’s perspective, the instance is connected and new tasks can and will be scheduled to run on an instance where docker is unresponsive.

In order to alert customers and the ECS task placement engine to docker problems on a host, we will add a new object called ContainerInstanceHealth to the ContainerInstance object. The ContainerInstanceHealth object for now will only have one field: “containerRuntimeHealth”, with three options: HEALTHY, UNHEALTHY, and UNKNOWN. This state will be determined by ECS based on data sent to it from the ECS agent. To be clear, this feature will not include ECS taking actions to attempt to remedy the health, such as killing or replacing tasks, or killing instances. It will, however, include ECS changing it’s task placement behavior to ignore unhealthy instances when placing new tasks. It will also allow customers to determine for themselves if they can tolerate an unhealthy instance in their cluster, or if they want to terminate the instance.

Customers will be able to determine the health of their instance by using the DescribeContainerInstances API.

Statement of Scope

In Scope:

  • Health indicators that are specific to ECS and the container runtime.
  • Changing the ContainerInstance object to include health.
  • Creating a new ContainerInstanceHealth object.
  • Add logic to agent for tracking docker API errors.
  • Add docs providing recommendations and explanations around unhealthy instances (see ec2 docs for reference here).
  • Task Placement.

Out of Scope:

  • EC2- or linux-level health indicators.
  • Health indicators that can be monitored by the cloudwatch agent (disk usage, cpu usage, etc.)
  • Actions to remedy an unhealthy instance (ie rebooting docker, rebooting the container instance, draining the instance, killing tasks, etc.).

Model changes

NOTE Please keep in mind that the model changes described here are in a rough state and are likely to change in their exact format.

We will create a new ContainerInstanceHealth object:

{
    "containerRuntimeStatus": HEALTHY | UNHEALTHY | UNKNOWN
}

The ContainerInstance object will be updated to include the ContainerInstanceHealth object.

{
    "agentConnected": true,
+    "health": {
+        "containerRuntimeStatus": "HEALTHY"
+    }
    "attributes": [ ... ],
    "clusterArn": "arn:aws:ecs:us-east-1:111122223333:cluster/default",
    "containerInstanceArn": "arn:aws:ecs:us-east-1:111122223333:container-instance/b54a2a04-046f-4331-9d74-3f6d7f6ca315",
    "ec2InstanceId": "i-f3a8506b",
    "registeredResources": [ ... ],
    "remainingResources": [ ... ],
    "status": "ACTIVE",
    "version": 14801,
    "versionInfo": { ... },
    "updatedAt": "2016-12-06T16:41:06.991Z"
}

@sparrc this seems like a great first step. Quick question about:

In order to alert customers and the ECS task placement engine to docker problems on a host, we will add a new object called ContainerInstanceHealth to the ContainerInstance object.

In the current scope, will we be able to list container instances that are failing this health check? For example (describe call also works):

aws ecs list-container-instances --container-runtime-status UNHEALTHY
Was this page helpful?
0 / 5 - 0 ratings