Autoscaler: Scale up windows on AWS EKS cluster

Created on 14 May 2020 · 34Comments · Source: kubernetes/autoscaler

Hi,
I`m using Kubernetes based on EKS 1.15 with windows node group, vpc controller and webhook
and cluster autoscaler
cluster-autoscaler cluster-autoscaler v1.15.6

The problem that I have is similar to https://github.com/kubernetes/autoscaler/issues/2888
When ASG need to be scaled from 0 to 2 instances after couple days of inactivity autoscaler don`t trigger scale up.

The workaround is to set the minimum size of ASG to 1. In such case, autoscaler don`t have any problem with scale up and scale down.
After update to v1.15.6 problem still occurs

Here is pod output

Name:                 job-038erq28k
Namespace:            default
Priority:             10000
Priority Class Name:  low-priority
Node:                 <none>
Labels:               app=my-eks-job
                      platform=WINDOWS
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        Job/job-038e11d2
Init Containers:
  init-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     250m
      memory:  300Mi
    Requests:
      cpu:     250m
      memory:  300Mi
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Containers:
  main-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Requests:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  mytoken:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mytoken
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/os=windows
Tolerations:     dedicated=WINDOWS:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   41m (x14 over 60m)      default-scheduler   0/31 nodes are available: 21 Insufficient memory, 31 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Warning  FailedScheduling   31m (x19 over 65m)      default-scheduler   0/31 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  16m (x1672 over 13h)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu, 2 max limit reached
  Normal   NotTriggerScaleUp  6m31s (x1720 over 13h)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
  Normal   NotTriggerScaleUp  89s (x301 over 13h)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 max limit reached, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu
  Warning  FailedScheduling   60s (x22 over 62m)      default-scheduler   0/30 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 30 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 30 node(s) didn't match node selector.

and some logs from autoscaller

I0513 06:49:01.806775       1 utils.go:229] Pod job-038erq28k can't be scheduled on linux-node-asg-20191203023958042900000018, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient cpu, Insufficient vpc.amazonaws.com/PrivateIPv4Address, 
I0513 06:49:01.806973       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"job-038erq28k", UID:"<removed>", APIVersion:"v1", ResourceVersion:"73451570", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached

lifecyclrotten

Source

chmielas

All 34 comments

Can confirm, on eks 1.16
If I manually update desired instances to 1, then it works. Even after auto-downscaling to 0, it can scale up again.

But if you never had any instance it doesn't work.

Haven't tried waiting a few days after downscaling to 0, it may stop working again.

jaimehrubiks on 16 Jun 2020

Have the same issue with 1.17.

iusergii on 29 Jul 2020

Did you put labels to ASG tags?

Jeffwan on 5 Aug 2020

This should be resolved in last release. https://github.com/kubernetes/autoscaler/issues/2888

Jeffwan on 5 Aug 2020

/assign @Jeffwan

Jeffwan on 5 Aug 2020

@Jeffwan yes, it scales up if you already have at least one node up.
external-dns image: 0.7.2-debian-10-r46
EKS: 1.17.6

iusergii on 5 Aug 2020

@iusergii

Scale from 0 should be working as well. Could you share your ASG tags?

Jeffwan on 5 Aug 2020

@Jeffwan Here my tags:

Name: eks-windows-node-1a-Node
alpha.eksctl.io/cluster-name: eks   
alpha.eksctl.io/eksctl-version: 0.24.0
alpha.eksctl.io/nodegroup-name: windows-node-1a 
alpha.eksctl.io/nodegroup-type: unmanaged
eksctl.cluster.k8s.io/v1alpha1/cluster-name: eks    
eksctl.io/v1alpha2/nodegroup-name: windows-node-1a  
k8s.io/cluster-autoscaler/eks: owned    
k8s.io/cluster-autoscaler/enabled: true 
k8s.io/cluster-autoscaler/node-template/label/windows-node: 1a  
k8s.io/cluster-autoscaler/node-template/taint/windows: true:NoSchedule  
kubernetes.io/cluster/eks: owned

As you can see I also have a Taints on these nodes.

iusergii on 10 Aug 2020

@iusergii

Can you add these tags? CA won't know your node has ENI and IP addresses. Please check https://github.com/kubernetes/autoscaler/issues/2888#issue-575225019 for more details

k8s.io/cluster-autoscaler/node-template/label/beta.kubernetes.io/os windows
k8s.io/cluster-autoscaler/node-template/label/os windows
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 14

Jeffwan on 10 Aug 2020

@Jeffwan didn't help
```
I0812 07:08:35.605505 1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158 1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928 1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780 1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939 1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]
```

iusergii on 12 Aug 2020

👍1

@iusergii

Did you restart your CA or wait for a while after you apply the tag changes?

Jeffwan on 13 Aug 2020

@Jeffwan yes, I did:

created IG
restarted CA
Redeployed application.
Still have a pod in Pending state

iusergii on 20 Aug 2020

👍1

@iusergii One last thing, what's the patch version are you using?

Jeffwan on 20 Aug 2020

@Jeffwan sorry, didn't get you.
CA: k8s.gcr.io/cluster-autoscaler:v1.17.1
API: GitVersion:"v1.17.6-eks-4e7f64"

iusergii on 25 Aug 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 23 Nov 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 23 Dec 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 22 Jan 2021

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 22 Jan 2021

/reopen

dschunack on 25 Jan 2021

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

k8s-ci-robot on 25 Jan 2021

Hi,

Problem is still exist on EKS 1.17 and 1.18. Problem is not solved yet.
We see that same behavior on our EKS, described here.

@Jeffwan didn't help

I0812 07:08:35.605505       1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158       1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928       1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780       1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]

dschunack on 25 Jan 2021

Please reopen the issue.

/reopen

dschunack on 25 Jan 2021

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

Please reopen the issue.

/reopen

k8s-ci-robot on 25 Jan 2021

/reopen

chmielas on 26 Jan 2021

@chmielas: Reopened this issue.

In response to this:

/reopen

k8s-ci-robot on 26 Jan 2021

@dschunack Issue has been reopened

chmielas on 26 Jan 2021

👍1

any news?

dschunack on 22 Feb 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

fejta-bot on 24 Mar 2021

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot on 24 Mar 2021

/reopen

dschunack on 24 Mar 2021

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

k8s-ci-robot on 24 Mar 2021

Problem still exist, please reopen the issue

dschunack on 24 Mar 2021

I commented in June and can confirm that when you scale down to zero, and wait some time (not sure how much time), then it stops working again. Only solution is either setting the min to 1 or scaling manually from zero every time

jaimehrubiks on 24 Mar 2021

This is not really a solution, but I think a solution could be to add the stable APIs as described in my other issues #3802 here.
The new stable APIs are missing in the aws manager.

dschunack on 24 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings