Autoscaler: [Azure AKS] Can't scale-up nodes

Created on 13 Jul 2018  路  24Comments  路  Source: kubernetes/autoscaler

I have deployed cluster-autoscaler (v1.3.1) in AKS (with k8s 1.10.3) yesterday and to test, I created a pod with a large CPU/RAM request. I was expecting a new node to be created but that didn't happened and all I can see is the output below:

I0713 10:34:43.451225       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2018-07-13 10:35:43.451205157 +0000 UTC m=+577.450491985
I0713 10:34:43.706164       1 utils.go:503] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0713 10:34:43.706875       1 scale_up.go:249] Pod default/frontend is unschedulable
I0713 10:34:43.926322       1 scale_up.go:376] No expansion options

I manually scaled up my cluster with 1 extra node and the pod was scheduled and it is running fine. Interestingly, I still see the same output as above in cluster-autoscaler log even after the pod is running. In every ASG refresh, cluster-autoscaler still repeats the same log output.

Before scaling up the nodes manually, the following message was present in CA logs (it is not anymore):

Node group agentpool is not ready for scaleup

Am I missing something here? I didn't changed any default options in CA other than the required values.

Thanks!

areprovideazure cluster-autoscaler kinbug lifecyclstale

Most helpful comment

There is a known issue in AKS such that autoscaler will get 400 response from AKS RP when the cluster was created with Custom VNet due to autoscaler using older api-version 2017-08-31. We are working on the fix. It should go out in 1-2 weeks.

All 24 comments

Interestingly, I still see the same output as above in cluster-autoscaler log even after the pod is running. In every ASG refresh, cluster-autoscaler still repeats the same log output.

Can you provide more information on the pod? In particular, output of kubectl get pod frontend -o yaml may be helpful. Also, is there anything else in CA log related to it?

In general, this error message isn't published on refresh of data from cloud provider, but at the begining of every iteration, as long as there are unschedulable pods.

@aleksandra-malinowska its just a random mysql pod, nothing important, heres the definition:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: db
    image: mysql
    env:
    - name: MYSQL_ROOT_PASSWORD
      value: "password"
    resources:
      requests:
        memory: "5000Mi"
        cpu: "2000m"

there isnt anything else in the logs, it just kept repeating the same output as my first comment for a while and now this is the only output I see:

I0713 13:36:24.394739       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2018-07-13 13:37:24.394729752 +0000 UTC m=+3090.987714729
I0713 13:36:24.543894       1 utils.go:503] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0713 13:36:24.544435       1 static_autoscaler.go:258] No unschedulable pods
I0713 13:36:24.688536       1 scale_down.go:643] No candidates for scale down

It seems after some time (a long time) it doesnt see the frontend pod as unschedulable anymore. But the original issue still persists.

Does the output of that command contain a nodeName: <node-where-the-pod-is-running> field in spec:? This is how Cluster Autoscaler identifies pending pods. If this field is empty or missing, but the pod is running, it may indicate a problem with scheduling.

it does contains nodeName: aks-agentpool-17300195-11

scheduling itself works fine from kubernetes perspective.. I just had to add a new node manually and the pod was scheduled and it is running fine since. The problem is that CA did not added the node for me.

The problem is that CA did not added the node for me.

Error message Node group agentpool is not ready for scaleup indicates there was a problem with the node group (e.g. recent failed scale-up or cloud provider errors). cc @feiskyer

yep, there should be an error before that. Let me try whether the issue could be reproduced.

Node group agentpool is not ready for scaleup

Is agentpool same with node's label agentpool=<name>? (It's nodepool1 in my cluster) If not, you set it by --nodes=<min-node>:<max-node>:nodepool1.

@feiskyer hey my agent pool name is indeed agentpool and I set the container parameters as - --nodes=3:11:agentpool

@guitmz Thanks for the info. Could you share the status of CA? .e.g by kubectl -n kube-system describe configmaps cluster-autoscaler-status

@feiskyer sure, here it is:

Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated=2018-07-17 08:46:12.819241555 +0000 UTC

Data
====
status:
----
Cluster-autoscaler status at 2018-07-17 08:46:12.819241555 +0000 UTC:
Cluster-wide:
  Health:      Healthy (ready=11 unready=0 notStarted=0 longNotStarted=0 registered=11 longUnregistered=0)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleUp:     NoActivity (ready=11 registered=11)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048

NodeGroups:
  Name:        agentpool
  Health:      Healthy (ready=11 unready=0 notStarted=0 longNotStarted=0 registered=11 longUnregistered=0 cloudProviderTarget=11 (minSize=3, maxSize=11))
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleUp:     NoActivity (ready=11 cloudProviderTarget=11)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2018-07-17 08:46:12.463547509 +0000 UTC m=+2559.635126913
               LastTransitionTime: 2018-07-17 08:03:50.778431644 +0000 UTC m=+17.950011048


Events:  <none>

I will try today again to deploy a big pod and see if it scales to 12 nodes (or even delete the big pod that I have now and see if it scales down to 10 nodes - currently I have 11)

@feiskyer Ok, some good news. I deleted the big test pod and the 11th node was indeed also deleted, cluster size is now 10.

Events:
  Type    Reason          Age   From                Message
  ----    ------          ----  ----                -------
  Normal  ScaleDownEmpty  3m    cluster-autoscaler  Scale-down: removing empty node aks-agentpool-17300195-11

But when I redeployed the very same pod, I can't see CA doing anything about it.

Pod events:

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  6s (x14 over 2m)  default-scheduler  0/10 nodes are available: 10 Insufficient cpu, 2 Insufficient memory.

CA logs after deploying the pod (I gave it some minutes, it did like 3 ASG refreshes and they all look like below):

I0717 09:17:38.209251       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2018-07-17 09:18:38.209235453 +0000 UTC m=+4505.380814757
I0717 09:17:38.407732       1 utils.go:503] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0717 09:17:38.408275       1 static_autoscaler.go:258] No unschedulable pods
I0717 09:17:38.585546       1 scale_down.go:643] No candidates for scale down

CA events show me nothing, only the scaledown from 11 to 10 nodes.

So I believe I can confirm scaling down is indeed working but not scaling up.

I did see this message after the node was deleted in CA logs, not sure if important:

I0717 09:12:11.957649       1 azure_container_service_pool.go:206] Set size request: 10
I0717 09:12:12.030917       1 azure_container_service_pool.go:241] Current size: 11, Target size requested: 10
E0717 09:12:12.157347       1 azure_container_service_pool.go:262] containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="" Message=""

And now, some minutes later in CA logs, I see a lot of those:

I0717 09:27:40.154865       1 utils.go:422] Decreasing size of agentpool, expected=11 current=10 delta=-1
I0717 09:27:40.408037       1 azure_container_service_pool.go:206] Set size request: 10
I0717 09:27:40.481879       1 azure_container_service_pool.go:241] Current size: 11, Target size requested: 10
E0717 09:27:40.579679       1 azure_container_service_pool.go:262] containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="" Message=""
E0717 09:27:40.579840       1 static_autoscaler.go:190] Failed to fix node group sizes: Failed to decrease agentpool: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="" Message=""
I0717 09:27:51.000853       1 utils.go:422] Decreasing size of agentpool, expected=11 current=10 delta=-1

@guitmz Thanks for providing the information. Seems it has stuck on scaling down. Could you send the error logs together with (subscription ID, resource group, aks cluster name) to [email protected]? We need to dive into what's wrong.

@feiskyer just tried to send the email but got a postmaster reply stating that

The group aks-gh-feedback isn't set up to receive messages from my-email-address-here.

@guitmz Not sure what's wrong here. Could you try again ?

@feiskyer just did, now it seems to have worked! Thanks

There is a known issue in AKS such that autoscaler will get 400 response from AKS RP when the cluster was created with Custom VNet due to autoscaler using older api-version 2017-08-31. We are working on the fix. It should go out in 1-2 weeks.

facing the same issue : - Scale-up failed for group agentpool: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PropertyChangeNotAllowed" Message="Changing property 'enableRBAC' is not allowed."

The cluster is on Azure AKS and is not created in the CustomVNet. But Still Facing issue I did not used the advanced Networking while creating the Kubernetes Cluster

since https://github.com/kubernetes/autoscaler/pull/1071 is merged, if I try to upgrade my azure cluster to 1.10.6, should this fix be applied with that process? or should we wait for the 2 PRs in autoscaler to be merged (with the cherry picks)?

There is an internal bug related with this issue. The fix should be released soon.

New CA releases with AKS API 2018-03-31: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.2.3-beta.1 and https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.3.2-beta.1.

@guitmz @abizake Could you check whether new releases fix the issue on your cluster?

@feiskyer cool, thanks! I will test as soon as I can and post the results

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Already fixed according to https://github.com/kubernetes/autoscaler/issues/1063#issuecomment-413055865.

/close

@feiskyer: Closing this issue.

In response to this:

Already fixed according to https://github.com/kubernetes/autoscaler/issues/1063#issuecomment-413055865.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hjkatz picture hjkatz  路  4Comments

paramacharya picture paramacharya  路  5Comments

tjliupeng picture tjliupeng  路  6Comments

jadelafuente picture jadelafuente  路  4Comments

chapati23 picture chapati23  路  4Comments