Tell us about your request
A mechanism to surface errors at the docker daemon level.
Which service(s) is this request for?
ECS
Hello everyone, we have a rough high-level plan for what we are going to do for this feature, described below:
ECS Container Instances can become unsuitable to host new tasks for a variety of reasons, some the fault of AWS and others the fault of customers. From the ECS agent’s perspective, this usually manifests as unexpected errors from the docker API, such as when trying to inspect or list containers. It is often the result of IO throttling on the instance but this is not necessarily the only reason.
When the ECS agent encounters an issue with docker, there is currently no way to surface these issues to our customers. From the customer’s perspective, the agent is still connected and the ECS instance state is ACTIVE, so they expect that the overall system is healthy. From ECS CP’s perspective, the instance is connected and new tasks can and will be scheduled to run on an instance where docker is unresponsive.
In order to alert customers and the ECS task placement engine to docker problems on a host, we will add a new object called ContainerInstanceHealth to the ContainerInstance object. The ContainerInstanceHealth object for now will only have one field: “containerRuntimeHealth”, with three options: HEALTHY, UNHEALTHY, and UNKNOWN. This state will be determined by ECS based on data sent to it from the ECS agent. To be clear, this feature will not include ECS taking actions to attempt to remedy the health, such as killing or replacing tasks, or killing instances. It will, however, include ECS changing it’s task placement behavior to ignore unhealthy instances when placing new tasks. It will also allow customers to determine for themselves if they can tolerate an unhealthy instance in their cluster, or if they want to terminate the instance.
Customers will be able to determine the health of their instance by using the DescribeContainerInstances API.
In Scope:
Out of Scope:
NOTE Please keep in mind that the model changes described here are in a rough state and are likely to change in their exact format.
We will create a new ContainerInstanceHealth object:
{
"containerRuntimeStatus": HEALTHY | UNHEALTHY | UNKNOWN
}
The ContainerInstance object will be updated to include the ContainerInstanceHealth object.
{
"agentConnected": true,
+ "health": {
+ "containerRuntimeStatus": "HEALTHY"
+ }
"attributes": [ ... ],
"clusterArn": "arn:aws:ecs:us-east-1:111122223333:cluster/default",
"containerInstanceArn": "arn:aws:ecs:us-east-1:111122223333:container-instance/b54a2a04-046f-4331-9d74-3f6d7f6ca315",
"ec2InstanceId": "i-f3a8506b",
"registeredResources": [ ... ],
"remainingResources": [ ... ],
"status": "ACTIVE",
"version": 14801,
"versionInfo": { ... },
"updatedAt": "2016-12-06T16:41:06.991Z"
}
@sparrc this seems like a great first step. Quick question about:
In order to alert customers and the ECS task placement engine to docker problems on a host, we will add a new object called ContainerInstanceHealth to the ContainerInstance object.
In the current scope, will we be able to list container instances that are failing this health check? For example (describe call also works):
aws ecs list-container-instances --container-runtime-status UNHEALTHY
Most helpful comment
Hello everyone, we have a rough high-level plan for what we are going to do for this feature, described below:
Background/Introduction
ECS Container Instances can become unsuitable to host new tasks for a variety of reasons, some the fault of AWS and others the fault of customers. From the ECS agent’s perspective, this usually manifests as unexpected errors from the docker API, such as when trying to inspect or list containers. It is often the result of IO throttling on the instance but this is not necessarily the only reason.
When the ECS agent encounters an issue with docker, there is currently no way to surface these issues to our customers. From the customer’s perspective, the agent is still connected and the ECS instance state is ACTIVE, so they expect that the overall system is healthy. From ECS CP’s perspective, the instance is connected and new tasks can and will be scheduled to run on an instance where docker is unresponsive.
In order to alert customers and the ECS task placement engine to docker problems on a host, we will add a new object called ContainerInstanceHealth to the ContainerInstance object. The ContainerInstanceHealth object for now will only have one field: “containerRuntimeHealth”, with three options: HEALTHY, UNHEALTHY, and UNKNOWN. This state will be determined by ECS based on data sent to it from the ECS agent. To be clear, this feature will not include ECS taking actions to attempt to remedy the health, such as killing or replacing tasks, or killing instances. It will, however, include ECS changing it’s task placement behavior to ignore unhealthy instances when placing new tasks. It will also allow customers to determine for themselves if they can tolerate an unhealthy instance in their cluster, or if they want to terminate the instance.
Customers will be able to determine the health of their instance by using the DescribeContainerInstances API.
Statement of Scope
In Scope:
Out of Scope:
Model changes
NOTE Please keep in mind that the model changes described here are in a rough state and are likely to change in their exact format.
We will create a new ContainerInstanceHealth object:
The ContainerInstance object will be updated to include the ContainerInstanceHealth object.