Amazon-ecs-agent: CPU Percent for containers set to 1 on Windows when CPU not defined in task

Created on 4 Dec 2017  ·  11Comments  ·  Source: aws/amazon-ecs-agent

Summary

CPU Percent for containers set to 1 on Windows when CPU not defined in task

Description

Prior to version 1.16.0 Windows tasks definition with CPU not set or set to 0 would receive a value for CpuPercent of 0.

Now those values seem to be getting translated to a percentage of 1 which is causing existing tasks not to launch correctly on that version (fine on versions <=1.15.2).

Expected Behavior

If CPU is not set in the definition of a container or set to 0, then CpuPercent should not be set either.

Observed Behavior

When CPU is not set in the definition of a container or set to 0, CpuPercent is set to 1.

Environment Details

AMI: Windows_Server-2016-English-Full-ECS_Optimized-2017.11.24 (ami-0b60bc73)

user data for container instance:

<powershell>
[Environment]::SetEnvironmentVariable("ECS_DISABLE_IMAGE_CLEANUP", "true", "Machine")
[Environment]::SetEnvironmentVariable("ECS_IMAGE_MINIMUM_CLEANUP_AGE", "6h", "Machine")
[Environment]::SetEnvironmentVariable("ECS_ENABLE_TASK_CPU_MEM_LIMIT", "false", "Machine")
[Environment]::SetEnvironmentVariable("ECS_ENABLE_CONTAINER_METADATA", "true", "Machine")
Import-Module ECSTools
Initialize-ECSAgent -Cluster 'windows-nonprod-v3' -EnableTaskIAMRole -LogLevel debug
</powershell>

docker info:

Containers: 2
 Running: 0
 Paused: 0
 Stopped: 2
Images: 3
Server Version: 17.06.2-ee-5
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 14393 (14393.1794.amd64fre.rs1_release.171008-1615)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 1
Total Memory: 2GiB
Name: EC2AMAZ-N8320N0
ID: WED3:ICYW:F56U:OR3Y:HS7E:N42U:Y4CP:RTLR:QR7V:S23I:GPEN:CBT7
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Instance Metadata:

PS C:\Users\Administrator> Invoke-RestMethod http://localhost:51678/v1/metadata

Cluster            ContainerInstanceArn                                                                       Version
-------            --------------------                                                                       -------
windows-nonprod-v3 arn:aws:ecs:us-west-2:108110855944:container-instance/054fad8f-f648-4a06-b1fa-63b8342f189a Amazon ECS Agent - v1.16.0 (1ca656c)

Supporting Log Snippets

snippet of task definition:

        "containerDefinitions": [
            {
                "name": "app",
                "image": "108110855944.dkr.ecr.us-west-2.amazonaws.com/collectionsach/dev:2.1.0.4",
                "cpu": 0,
                "memory": 256,
                "portMappings": [],
                "essential": true,
                "environment": [
                    {
                        "name": "EXIT_DELAY",
                        "value": "60"
                    },
                    {
                        "name": "ENVIRONMENT",
                        "value": "dev"
                    }
                ],
                "mountPoints": [],
                "volumesFrom": [],
                "dnsServers": []
            }
        ]

snippet of log showing conversion for CPU shares:

2017-12-04T17:31:13Z [INFO] Creating container module="TaskEngine" task="CollectionsACH-dev:26 arn:aws:ecs:us-west-2:108110855944:task/ae7d60f0-1d51-449d-a8e4-83ade6cadce4, TaskStatus: (NONE->RUNNING) Containers: [app (PULLED->RUNNING),]" container="app(108110855944.dkr.ecr.us-west-2.amazonaws.com/collectionsach/dev:2.1.0.4) (PULLED->RUNNING)"
2017-12-04T17:31:13Z [DEBUG] Converting CPU shares to allowed minimum of 2 for task arn: [arn:aws:ecs:us-west-2:108110855944:task/ae7d60f0-1d51-449d-a8e4-83ade6cadce4] and cpu shares: 0

snippet of docker inspect for created container:

            "CpuShares": 0,
            "Memory": 268435456,
            "NanoCpus": 0,
...
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
...
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": -1,
...
            "CpuCount": 0,
            "CpuPercent": 1,
kinbug owindows pending release scopECS Agent

All 11 comments

@msheppard-nf The behavior in agent was changed in the latest agent version for windows, the ECS Agent will convert cpu shares to cpu percent, due to the cpu shares won't work as expected with Docker for windows, see this issue for detailed explanation.
The ECS Agent calculates the cpu percent by CPUShares * / TotalCPURecource with a minimum of 1. So for example, if the cpu is set to 1024 and the instance has 2 cpu core, the cpu percent will be set to 50(1024/(2*1024)), if the cpu isn't set in the task definition or is too low it will be set to minimum 1.

Now those values seem to be getting translated to a percentage of 1 which is causing existing tasks not to launch correctly on that version (fine on versions <=1.15.2).

Beside the output of docker inspect shows cpu percent, can you explain to me how this impact your task at runtime. If you want to enforce the cpu resource limit you may need to set the cpu field in the task definition based on the resource on the instance, the TotalCPUResource is the total available cpu shares available on the instance which is 1024 * core_of_instance.

We may change the behavior back if cpu shares works as expected with docker for windows, the current work around is convert cpu shares to cpu percent so that cpu limitation actually enforced on windows. Let me know if you have more question about this.

My concern is that a task with no value defined for CPU is being ran with --cpu-percent=1 when I'd expect that to simply not be set at all if not defined.

I think what is happening is that this function in agent/api/task.go is converting the 0 value to 2:

// dockerCPUShares converts containerCPU shares if needed as per the logic stated below:
// Docker silently converts 0 to 1024 CPU shares, which is probably not what we
// want.  Instead, we convert 0 to 2 to be closer to expected behavior. The
// reason for 2 over 1 is that 1 is an invalid value (Linux's choice, not Docker's).
func (task *Task) dockerCPUShares(containerCPU uint) int64 {
    if containerCPU <= 1 {
        seelog.Debugf(
            "Converting CPU shares to allowed minimum of 2 for task arn: [%s] and cpu shares: %d",
            task.Arn, containerCPU)
        return 2
    }
    return int64(containerCPU)
}

Which then it turn gets converted to 1% CPU (2 / 1024 shares in this case).

It seems based on all of my testing that a container cannot use more than the cpu percent it was allocated, so when my containers are started at cpu percent 1 the processes in those containers hang and the task gets a docker timeout error after several minutes. I've verified that the same task with CPU: 256 is able to launch and run just fine, but it is not desirable for me to give all tasks hard limits.

Hey @msheppard-nf,

We have had some internal discussion about the best way to address this. Ideally, we would be able to use the same burst capabilities that Linux CPU shares offer us.

An alternative approach would be to let Windows CPU usage be unbounded if the “cpu” parameter isn't set or is zero. This departs from the current behavior in Linux, where 0 implies the lowest CPU setting available, but it will avoid the problem described in this issue. The downside to this approach is that you can now have unbounded tasks consuming CPU in your cluster alongside tasks with properly-bounded CPU, meaning that your properly-bounded tasks may get less effective CPU than you've specified in your task definition.

We could start working on the second approach to mitigate this sooner, and open a separate issue for making burst work properly. Would this solution work for you?

Hi @petderek That approach would definitely be preferable to the current behavior in 1.16.0

I'm pretty concerned about mixing unbounded and bounded workloads in the same cluster by default. I think that this is a recipe for confusing behavior and robbing properly-bounded tasks of their intended resources. If we do enable something like this in the short term, I'd prefer that it be opt-in with an environment variable and clearly documented as to the impact.

My preference here is to get burst working properly. For some context as to why we moved away from burst: our experiments were not showing CPU resources being allocated the way we expected under contention with specified weights (CPU "shares"). We'll need to ensure that we're setting the parameters appropriately in order to get the desired weight/priority behavior, and build a test that verifies this going forward.

If we can't get burst working properly on Windows (and I much prefer burst working to this), another approach would be to let the container "boot" (start in-container Windows services and launch the first console process) at higher CPU limit (or unlimited) and then impose limits after the "boot" completes; we theoretically should be able to do this with Docker's "update" API (and some work in the agent task engine to model a new transition).

Has there been any further conversation on this topic? In our use-case we have 0 bounded workloads deployed to this cluster so even the ability to override the default behavior (maybe via an agent configuration setting?) would be a suitable short-term work-around until burst can be solved. I currently have my clusters locked to agent version 1.15.2 in order to avoid having to over-provision the cluster.

The latest release 1.17.1 includes a fix for this issue.

 "CpuPercent": 1,

Probable regression with ECS Agent 1.43.0.

@bulebuk can you detail how your Environment Details match those in the summary?

from the fix description at https://github.com/aws/amazon-ecs-agent/pull/1227
if CPU is not set or set to zero, CPU percent and shares of the container will also be zero, else set to 1 if CPU percent calculates to zero
We calculate the CPU percent for windows from the CPU specified in the task definition. if it calculates to 0 - we don't set it as 0 and set the minimum value, which is 1.
Have you set your CPU field in the taskdefinition?
If the CPU is set and calculates to 0, then this is the expected behavior.

The task definition CPU was unset or null. If this is where the issue landed, then it is working as designed. I misread the issue to conclude that the default would change to unbound and not 1%.

In my experience, the container was inoperable at 1% and could not reliably produce logs. In the end, I discovered my configuration worked when run directly through the docker cli. After diffing docker inspect for the docker run container vs the ECS run container, the "CpuPercent": 1, line jumped out at me. It feels like in this case, the defaults do not work.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
this is all I found on these defaults in the AWS public docs:

Agent versions <= 1.1.0: Null and zero CPU values are passed to Docker as 0, which Docker then converts to 1,024 CPU shares. CPU values of 1 are passed to Docker as 1, which the Linux kernel converts to two CPU shares.

Agent versions >= 1.2.0: Null, zero, and CPU values of 1 are passed to Docker as two CPU shares.

On Windows container instances, the CPU limit is enforced as an absolute limit, or a quota. Windows containers only have access to the specified amount of CPU that is described in the task definition.

I'll see if we can get clarification added about the defaults for windows for Null and zero CPU values.

Was this page helpful?
0 / 5 - 0 ratings