Hi,
I`m using Kubernetes based on EKS 1.15 with windows node group, vpc controller and webhook
and cluster autoscaler
cluster-autoscaler cluster-autoscaler v1.15.6
The problem that I have is similar to https://github.com/kubernetes/autoscaler/issues/2888
When ASG need to be scaled from 0 to 2 instances after couple days of inactivity autoscaler don`t trigger scale up.
The workaround is to set the minimum size of ASG to 1. In such case, autoscaler don`t have any problem with scale up and scale down.
After update to v1.15.6 problem still occurs
Here is pod output
Name: job-038erq28k
Namespace: default
Priority: 10000
Priority Class Name: low-priority
Node: <none>
Labels: app=my-eks-job
platform=WINDOWS
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: Job/job-038e11d2
Init Containers:
init-container:
Image: myimage:latest
Port: <none>
Host Port: <none>
Limits:
cpu: 250m
memory: 300Mi
Requests:
cpu: 250m
memory: 300Mi
Mounts:
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Containers:
main-container:
Image: myimage:latest
Port: <none>
Host Port: <none>
Limits:
cpu: 7
memory: 15000Mi
vpc.amazonaws.com/PrivateIPv4Address: 1
Requests:
cpu: 7
memory: 15000Mi
vpc.amazonaws.com/PrivateIPv4Address: 1
Mounts:
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
mytoken:
Type: Secret (a volume populated by a Secret)
SecretName: mytoken
Optional: false
QoS Class: Guaranteed
Node-Selectors: beta.kubernetes.io/os=windows
Tolerations: dedicated=WINDOWS:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 41m (x14 over 60m) default-scheduler 0/31 nodes are available: 21 Insufficient memory, 31 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
Warning FailedScheduling 31m (x19 over 65m) default-scheduler 0/31 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
Normal NotTriggerScaleUp 16m (x1672 over 13h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu, 2 max limit reached
Normal NotTriggerScaleUp 6m31s (x1720 over 13h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
Normal NotTriggerScaleUp 89s (x301 over 13h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 max limit reached, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu
Warning FailedScheduling 60s (x22 over 62m) default-scheduler 0/30 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 30 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 30 node(s) didn't match node selector.
and some logs from autoscaller
I0513 06:49:01.806775 1 utils.go:229] Pod job-038erq28k can't be scheduled on linux-node-asg-20191203023958042900000018, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient cpu, Insufficient vpc.amazonaws.com/PrivateIPv4Address,
I0513 06:49:01.806973 1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"job-038erq28k", UID:"<removed>", APIVersion:"v1", ResourceVersion:"73451570", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
Can confirm, on eks 1.16
If I manually update desired instances to 1, then it works. Even after auto-downscaling to 0, it can scale up again.
But if you never had any instance it doesn't work.
Haven't tried waiting a few days after downscaling to 0, it may stop working again.
Have the same issue with 1.17.
Did you put labels to ASG tags?
This should be resolved in last release. https://github.com/kubernetes/autoscaler/issues/2888
/assign @Jeffwan
@Jeffwan yes, it scales up if you already have at least one node up.
external-dns image: 0.7.2-debian-10-r46
EKS: 1.17.6
@iusergii
Scale from 0 should be working as well. Could you share your ASG tags?
@Jeffwan Here my tags:
Name: eks-windows-node-1a-Node
alpha.eksctl.io/cluster-name: eks
alpha.eksctl.io/eksctl-version: 0.24.0
alpha.eksctl.io/nodegroup-name: windows-node-1a
alpha.eksctl.io/nodegroup-type: unmanaged
eksctl.cluster.k8s.io/v1alpha1/cluster-name: eks
eksctl.io/v1alpha2/nodegroup-name: windows-node-1a
k8s.io/cluster-autoscaler/eks: owned
k8s.io/cluster-autoscaler/enabled: true
k8s.io/cluster-autoscaler/node-template/label/windows-node: 1a
k8s.io/cluster-autoscaler/node-template/taint/windows: true:NoSchedule
kubernetes.io/cluster/eks: owned
As you can see I also have a Taints on these nodes.
@iusergii
Can you add these tags? CA won't know your node has ENI and IP addresses. Please check https://github.com/kubernetes/autoscaler/issues/2888#issue-575225019 for more details
k8s.io/cluster-autoscaler/node-template/label/beta.kubernetes.io/os windows
k8s.io/cluster-autoscaler/node-template/label/os windows
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 14
@Jeffwan didn't help
```
I0812 07:08:35.605505 1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158 1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928 1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780 1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,
After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:
I0812 07:23:36.962939 1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]
```
@iusergii
Did you restart your CA or wait for a while after you apply the tag changes?
@Jeffwan yes, I did:
Pending state @iusergii One last thing, what's the patch version are you using?
@Jeffwan sorry, didn't get you.
CA: k8s.gcr.io/cluster-autoscaler:v1.17.1
API: GitVersion:"v1.17.6-eks-4e7f64"
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi,
Problem is still exist on EKS 1.17 and 1.18. Problem is not solved yet.
We see that same behavior on our EKS, described here.
@Jeffwan didn't help
I0812 07:08:35.605505 1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 W0812 07:08:35.606158 1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 I0812 07:08:35.606928 1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable I0812 07:11:35.789780 1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:
I0812 07:23:36.962939 1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]
Please reopen the issue.
/reopen
@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
Please reopen the issue.
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@chmielas: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@dschunack Issue has been reopened
any news?
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-contributor-experience at kubernetes/community.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Problem still exist, please reopen the issue
I commented in June and can confirm that when you scale down to zero, and wait some time (not sure how much time), then it stops working again. Only solution is either setting the min to 1 or scaling manually from zero every time
This is not really a solution, but I think a solution could be to add the stable APIs as described in my other issues #3802 here.
The new stable APIs are missing in the aws manager.