Autoscaler: Scale up windows on AWS EKS cluster

Created on 14 May 2020  路  34Comments  路  Source: kubernetes/autoscaler

Hi,
I`m using Kubernetes based on EKS 1.15 with windows node group, vpc controller and webhook
and cluster autoscaler
cluster-autoscaler cluster-autoscaler v1.15.6

The problem that I have is similar to https://github.com/kubernetes/autoscaler/issues/2888
When ASG need to be scaled from 0 to 2 instances after couple days of inactivity autoscaler don`t trigger scale up.

The workaround is to set the minimum size of ASG to 1. In such case, autoscaler don`t have any problem with scale up and scale down.
After update to v1.15.6 problem still occurs

Here is pod output

Name:                 job-038erq28k
Namespace:            default
Priority:             10000
Priority Class Name:  low-priority
Node:                 <none>
Labels:               app=my-eks-job
                      platform=WINDOWS
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        Job/job-038e11d2
Init Containers:
  init-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     250m
      memory:  300Mi
    Requests:
      cpu:     250m
      memory:  300Mi
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Containers:
  main-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Requests:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  mytoken:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mytoken
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/os=windows
Tolerations:     dedicated=WINDOWS:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   41m (x14 over 60m)      default-scheduler   0/31 nodes are available: 21 Insufficient memory, 31 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Warning  FailedScheduling   31m (x19 over 65m)      default-scheduler   0/31 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  16m (x1672 over 13h)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu, 2 max limit reached
  Normal   NotTriggerScaleUp  6m31s (x1720 over 13h)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
  Normal   NotTriggerScaleUp  89s (x301 over 13h)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 max limit reached, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu
  Warning  FailedScheduling   60s (x22 over 62m)      default-scheduler   0/30 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 30 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 30 node(s) didn't match node selector.

and some logs from autoscaller

I0513 06:49:01.806775       1 utils.go:229] Pod job-038erq28k can't be scheduled on linux-node-asg-20191203023958042900000018, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient cpu, Insufficient vpc.amazonaws.com/PrivateIPv4Address, 
I0513 06:49:01.806973       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"job-038erq28k", UID:"<removed>", APIVersion:"v1", ResourceVersion:"73451570", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
lifecyclrotten

All 34 comments

Can confirm, on eks 1.16
If I manually update desired instances to 1, then it works. Even after auto-downscaling to 0, it can scale up again.

But if you never had any instance it doesn't work.

Haven't tried waiting a few days after downscaling to 0, it may stop working again.

Have the same issue with 1.17.

Did you put labels to ASG tags?

This should be resolved in last release. https://github.com/kubernetes/autoscaler/issues/2888

/assign @Jeffwan

@Jeffwan yes, it scales up if you already have at least one node up.
external-dns image: 0.7.2-debian-10-r46
EKS: 1.17.6

@iusergii

Scale from 0 should be working as well. Could you share your ASG tags?

@Jeffwan Here my tags:

Name: eks-windows-node-1a-Node
alpha.eksctl.io/cluster-name: eks   
alpha.eksctl.io/eksctl-version: 0.24.0
alpha.eksctl.io/nodegroup-name: windows-node-1a 
alpha.eksctl.io/nodegroup-type: unmanaged
eksctl.cluster.k8s.io/v1alpha1/cluster-name: eks    
eksctl.io/v1alpha2/nodegroup-name: windows-node-1a  
k8s.io/cluster-autoscaler/eks: owned    
k8s.io/cluster-autoscaler/enabled: true 
k8s.io/cluster-autoscaler/node-template/label/windows-node: 1a  
k8s.io/cluster-autoscaler/node-template/taint/windows: true:NoSchedule  
kubernetes.io/cluster/eks: owned    

As you can see I also have a Taints on these nodes.

@iusergii

Can you add these tags? CA won't know your node has ENI and IP addresses. Please check https://github.com/kubernetes/autoscaler/issues/2888#issue-575225019 for more details

k8s.io/cluster-autoscaler/node-template/label/beta.kubernetes.io/os windows
k8s.io/cluster-autoscaler/node-template/label/os windows
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 14

@Jeffwan didn't help
```
I0812 07:08:35.605505 1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158 1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928 1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780 1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939 1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]
```

@iusergii

Did you restart your CA or wait for a while after you apply the tag changes?

@Jeffwan yes, I did:

  • created IG
  • restarted CA
  • Redeployed application.
    Still have a pod in Pending state

@iusergii One last thing, what's the patch version are you using?

@Jeffwan sorry, didn't get you.
CA: k8s.gcr.io/cluster-autoscaler:v1.17.1
API: GitVersion:"v1.17.6-eks-4e7f64"

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Hi,

Problem is still exist on EKS 1.17 and 1.18. Problem is not solved yet.
We see that same behavior on our EKS, described here.

@Jeffwan didn't help

I0812 07:08:35.605505       1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158       1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928       1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780       1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address, 

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]

Please reopen the issue.

/reopen

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

Please reopen the issue.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@chmielas: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dschunack Issue has been reopened

any news?

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Problem still exist, please reopen the issue

I commented in June and can confirm that when you scale down to zero, and wait some time (not sure how much time), then it stops working again. Only solution is either setting the min to 1 or scaling manually from zero every time

This is not really a solution, but I think a solution could be to add the stable APIs as described in my other issues #3802 here.
The new stable APIs are missing in the aws manager.

Was this page helpful?
0 / 5 - 0 ratings