Autoscaler: [Azure AKS] Can't scale-up nodes

Created on 13 Jul 2018 · 24Comments · Source: kubernetes/autoscaler

I have deployed cluster-autoscaler (v1.3.1) in AKS (with k8s 1.10.3) yesterday and to test, I created a pod with a large CPU/RAM request. I was expecting a new node to be created but that didn't happened and all I can see is the output below:

I0713 10:34:43.451225       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2018-07-13 10:35:43.451205157 +0000 UTC m=+577.450491985
I0713 10:34:43.706164       1 utils.go:503] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0713 10:34:43.706875       1 scale_up.go:249] Pod default/frontend is unschedulable
I0713 10:34:43.926322       1 scale_up.go:376] No expansion options

I manually scaled up my cluster with 1 extra node and the pod was scheduled and it is running fine. Interestingly, I still see the same output as above in cluster-autoscaler log even after the pod is running. In every ASG refresh, cluster-autoscaler still repeats the same log output.

Before scaling up the nodes manually, the following message was present in CA logs (it is not anymore):

Node group agentpool is not ready for scaleup

Am I missing something here? I didn't changed any default options in CA other than the required values.

Thanks!

areprovideazure cluster-autoscaler kinbug lifecyclstale

Source

guitmz

Most helpful comment

There is a known issue in AKS such that autoscaler will get 400 response from AKS RP when the cluster was created with Custom VNet due to autoscaler using older api-version 2017-08-31. We are working on the fix. It should go out in 1-2 weeks.

weinong on 20 Jul 2018

👍2

All 24 comments

Interestingly, I still see the same output as above in cluster-autoscaler log even after the pod is running. In every ASG refresh, cluster-autoscaler still repeats the same log output.

Can you provide more information on the pod? In particular, output of kubectl get pod frontend -o yaml may be helpful. Also, is there anything else in CA log related to it?

In general, this error message isn't published on refresh of data from cloud provider, but at the begining of every iteration, as long as there are unschedulable pods.

aleksandra-malinowska on 13 Jul 2018

@aleksandra-malinowska its just a random mysql pod, nothing important, heres the definition:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: db
    image: mysql
    env:
    - name: MYSQL_ROOT_PASSWORD
      value: "password"
    resources:
      requests:
        memory: "5000Mi"
        cpu: "2000m"

there isnt anything else in the logs, it just kept repeating the same output as my first comment for a while and now this is the only output I see:

I0713 13:36:24.394739       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2018-07-13 13:37:24.394729752 +0000 UTC m=+3090.987714729
I0713 13:36:24.543894       1 utils.go:503] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0713 13:36:24.544435       1 static_autoscaler.go:258] No unschedulable pods
I0713 13:36:24.688536       1 scale_down.go:643] No candidates for scale down

It seems after some time (a long time) it doesnt see the frontend pod as unschedulable anymore. But the original issue still persists.

guitmz on 13 Jul 2018

Does the output of that command contain a nodeName: <node-where-the-pod-is-running> field in spec:? This is how Cluster Autoscaler identifies pending pods. If this field is empty or missing, but the pod is running, it may indicate a problem with scheduling.

aleksandra-malinowska on 13 Jul 2018

it does contains nodeName: aks-agentpool-17300195-11

scheduling itself works fine from kubernetes perspective.. I just had to add a new node manually and the pod was scheduled and it is running fine since. The problem is that CA did not added the node for me.

guitmz on 13 Jul 2018

The problem is that CA did not added the node for me.

Error message Node group agentpool is not ready for scaleup indicates there was a problem with the node group (e.g. recent failed scale-up or cloud provider errors). cc @feiskyer

aleksandra-malinowska on 13 Jul 2018

yep, there should be an error before that. Let me try whether the issue could be reproduced.

feiskyer on 16 Jul 2018

Node group agentpool is not ready for scaleup

Is agentpool same with node's label agentpool=<name>? (It's nodepool1 in my cluster) If not, you set it by --nodes=<min-node>:<max-node>:nodepool1.

feiskyer on 16 Jul 2018

@feiskyer hey my agent pool name is indeed agentpool and I set the container parameters as - --nodes=3:11:agentpool

guitmz on 16 Jul 2018

@guitmz Thanks for the info. Could you share the status of CA? .e.g by kubectl -n kube-system describe configmaps cluster-autoscaler-status

feiskyer on 17 Jul 2018

@feiskyer sure, here it is:

Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated=2018-07-17 08:46:12.819241555 +0000 UTC

Data
====
status:
----
Cluster-autoscaler status at 2018-07-17 08:46:12.819241555 +0000 UTC:
Cluster-wide:
  Health:      Healthy (ready=11 unready=0 notStarted=0 longNotStarted=0 registered=11 longUnregistered=0)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleUp:     NoActivity (ready=11 registered=11)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048

NodeGroups:
  Name:        agentpool
  Health:      Healthy (ready=11 unready=0 notStarted=0 longNotStarted=0 registered=11 longUnregistered=0 cloudProviderTarget=11 (minSize=3, maxSize=11))
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleUp:     NoActivity (ready=11 cloudProviderTarget=11)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048


Events:  <none>

I will try today again to deploy a big pod and see if it scales to 12 nodes (or even delete the big pod that I have now and see if it scales down to 10 nodes - currently I have 11)

guitmz on 17 Jul 2018

@feiskyer Ok, some good news. I deleted the big test pod and the 11th node was indeed also deleted, cluster size is now 10.

Events:
  Type    Reason          Age   From                Message
  ----    ------          ----  ----                -------
  Normal  ScaleDownEmpty  3m    cluster-autoscaler  Scale-down: removing empty node aks-agentpool-17300195-11

But when I redeployed the very same pod, I can't see CA doing anything about it.

Pod events:

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  6s (x14 over 2m)  default-scheduler  0/10 nodes are available: 10 Insufficient cpu, 2 Insufficient memory.

CA logs after deploying the pod (I gave it some minutes, it did like 3 ASG refreshes and they all look like below):

I0717 09:17:38.209251       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2018-07-17 09:18:38.209235453 +0000 UTC m=+4505.380814757
I0717 09:17:38.407732       1 utils.go:503] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0717 09:17:38.408275       1 static_autoscaler.go:258] No unschedulable pods
I0717 09:17:38.585546       1 scale_down.go:643] No candidates for scale down

CA events show me nothing, only the scaledown from 11 to 10 nodes.

So I believe I can confirm scaling down is indeed working but not scaling up.

I did see this message after the node was deleted in CA logs, not sure if important:

I0717 09:12:11.957649       1 azure_container_service_pool.go:206] Set size request: 10
I0717 09:12:12.030917       1 azure_container_service_pool.go:241] Current size: 11, Target size requested: 10
E0717 09:12:12.157347       1 azure_container_service_pool.go:262] containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="" Message=""

And now, some minutes later in CA logs, I see a lot of those:

I0717 09:27:40.154865       1 utils.go:422] Decreasing size of agentpool, expected=11 current=10 delta=-1
I0717 09:27:40.408037       1 azure_container_service_pool.go:206] Set size request: 10
I0717 09:27:40.481879       1 azure_container_service_pool.go:241] Current size: 11, Target size requested: 10
E0717 09:27:40.579679       1 azure_container_service_pool.go:262] containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="" Message=""
E0717 09:27:40.579840       1 static_autoscaler.go:190] Failed to fix node group sizes: Failed to decrease agentpool: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="" Message=""
I0717 09:27:51.000853       1 utils.go:422] Decreasing size of agentpool, expected=11 current=10 delta=-1

guitmz on 17 Jul 2018

@guitmz Thanks for providing the information. Seems it has stuck on scaling down. Could you send the error logs together with (subscription ID, resource group, aks cluster name) to [email protected]? We need to dive into what's wrong.

feiskyer on 17 Jul 2018

@feiskyer just tried to send the email but got a postmaster reply stating that

The group aks-gh-feedback isn't set up to receive messages from my-email-address-here.

guitmz on 18 Jul 2018

@guitmz Not sure what's wrong here. Could you try again ?

feiskyer on 20 Jul 2018

@feiskyer just did, now it seems to have worked! Thanks

guitmz on 20 Jul 2018

weinong on 20 Jul 2018

👍2

facing the same issue : - Scale-up failed for group agentpool: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PropertyChangeNotAllowed" Message="Changing property 'enableRBAC' is not allowed."

The cluster is on Azure AKS and is not created in the CustomVNet. But Still Facing issue I did not used the advanced Networking while creating the Kubernetes Cluster

abizake on 1 Aug 2018

since https://github.com/kubernetes/autoscaler/pull/1071 is merged, if I try to upgrade my azure cluster to 1.10.6, should this fix be applied with that process? or should we wait for the 2 PRs in autoscaler to be merged (with the cherry picks)?

guitmz on 3 Aug 2018

There is an internal bug related with this issue. The fix should be released soon.

feiskyer on 6 Aug 2018

New CA releases with AKS API 2018-03-31: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.2.3-beta.1 and https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.3.2-beta.1.

@guitmz @abizake Could you check whether new releases fix the issue on your cluster?

feiskyer on 15 Aug 2018

👍1

@feiskyer cool, thanks! I will test as soon as I can and post the results

guitmz on 15 Aug 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 13 Nov 2018

Already fixed according to https://github.com/kubernetes/autoscaler/issues/1063#issuecomment-413055865.

/close

feiskyer on 20 Nov 2018

@feiskyer: Closing this issue.

In response to this:

Already fixed according to https://github.com/kubernetes/autoscaler/issues/1063#issuecomment-413055865.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.