Autoscaler: Azure CA doesn't scale multiple agent pools in parallel

Created on 18 May 2019  路  30Comments  路  Source: kubernetes/autoscaler

Scale-up plans in Azure never include multiple agent pools, even when it is needed to make room for new pods. It looks like, for some reason, the implementation for Azure never includes more than one agent pool in the same scale-up plan. The consecuence is that it serializes the operations, scaling only one agent pool at a time. Can you confirm the issue or whether it is the expected behavior?

At least cluster-autoscaler 1.3.9 for Azure (K8S 1.11.8) is affected. The autoscaler works (all the agent pools end up being scaled) but the issue makes it much slower than other implementations (e.g. AWS is able to scale multiple node pools in parallel, as part of the same same scale-up plan). With N agent pools in the cluster that need to be scaled and M minutes to scale each one, you need to wait N x M minutes instead of just M.

Most helpful comment

@edrevo @palmerabollo the way I see it, my PR is a pre-requisite for what you are asking for. It might not resolve the issue completely but without the node label patch no two agent pools can be considered "similar" from the autoscaler's perspective as they will each have a unique "agentpool" label, no matter what their VM size is. What my PR is doing is relaxing that requirement, not changing or adding any similarity requirements. The current implementation currently uses VM size as a similarity requirement. The current nodegroup comparator implementation, which is also used by AWS I believe, uses the following criteria to compare two node pools: // IsNodeInfoSimilar returns true if two NodeInfos are similar enough to consider // that the NodeGroups they come from are part of the same NodeGroupSet. The criteria are // somewhat arbitrary, but generally we check if resources provided by both nodes // are similar enough to likely be the same type of machine and if the set of labels // is the same (except for a pre-defined set of labels like hostname or zone).

My proposal: get the open PR to ignore agentpool labels to a mergeable state and merge it without closing this issue. This issue can remain open to track the remaining work to make autoscaling between node pools truly "parallel".

All 30 comments

That's exactly what happened to me and we had to move some services to AWS; I hope this will be fixed to come back to Azure...

Same issue here :disappointed:

It was a critical part in our last release, we were not able to solve it, and like @ealogar, we must move this part of the service to Google Cloud.

I've also tested the latest cluster-autoscaler 1.14.2 for Azure and it is still buggy. Frequent pod restarts make the issue worse. @feiskyer is this a known issue?

@palmerabollo Yep, it's a known bug, but didn't get chance to fix it yet. @palmerabollo @ealogar @dgalpaj Which vmtypes are you using?

/sig azure

@feiskyer We were using Standard_DS2_v2 and Standard_DS3_v2 for our deployments

@dgalpaj Sorry, I'm not clear about the vm types. I mean the types documented in CA, e.g. vmss, standard or aks.

@feiskyer In my case we were using vmss

Yes, vmss. The Kubernetes cluster is deployed with aks-engine. I can share the relevant parts of my aks-engine cluster definition if it helps.

@feiskyer yes, in my case they are also vmss

@feiskyer I'd love to take this on if you can give me some pointers

@CecileRobertMichon thanks. didn't dig into the details yet, appreciate if you could check why this is happening.

Thanks, @CecileRobertMichon. Can I help somehow?

@palmerabollo I haven't had time to look into this yet so if you want to look into it go for it. Or even if you have any pointers to get me started in the right direction when I get to it that'd be great.

Okay so I opened #2094. I found that the reason this doesn't work for aks-engine & AKS clusters is that we add an "agentpool" label to each agent node so two node pools are never identified as similar by the cluster-autoscaler node group comparator. @feiskyer please take a look. I'm not sure if the PR is an acceptable way to fix this since the label is specific to Azure and the comparator is shared amongst cloud providers. I also thought about removing that label from aks-engine but the issue with that is that there might be users and/or AKS flows depending on it at this point since it's been there for such a long time so removing it isn't straightforward. Also, removing the label would only fix it for new clusters, not for existing clusters. Thoughts?

Thanks @CecileRobertMichon, great news.

An option to avoid breaking backwards compatibility removing the "agentpool" label in AKS would be to add a flag to the agentPoolProfiles in the aks-engine cluster definition to disable the label. This way there would be no need to include azure-dependent labels in the cluster-autoscaler.

For example, adding a disableLegacyNodeLabels flag (default false for backwards compatibility):

"agentPoolProfiles": [
  {
    "name": "mypool",
    "count": 5,
    "vmSize": "Standard_DS2_v2",
    "disableLegacyNodeLabels": true
  }
...

or, maybe a bit trickier, allowing to specify null values in the existing customNodeLabels attribute:

"agentPoolProfiles": [
  {
    "name": "mypool",
    "count": 5,
    "vmSize": "Standard_DS2_v2",
    "customNodeLabels": {
       "agentpool": null
    }
  }
...

@palmerabollo there is a case to be made for keeping the "agentpool" label (or some version of it) as it is a convenience for quickly knowing which pool a node is part of. Sometimes, for example in the case of Windows nodes, it's not immediately obvious based on the machine name.

I've updated the PR to add an Azure specific node comparator that overrides the comparator for the Azure cloud provider instead of adding the label exemption to the existing one.

For those of you here using aws and have this working, can you help me clarify a few things? Is the --balance-similar-node-groups flag enabled in your cluster-autoscaler deployment? Are the node pools that you expect to scale in parallel "similar" in that they have the same CPU + Memory specs?

Thanks @CecileRobertMichon. The --balance-similar-node-groups is not enabled. The nodes are similar, in some cases they even use the same instance types. I can check it again in AWS with the setup you consider more interesting.

@CecileRobertMichon , many thanks for the fix. Do you have any plans on backporting it to current releases of the cluster autoscaler?

@CecileRobertMichon is their a workaround in the meantime?

@edrevo I'm still waiting on a repo maintainer to approve the PR so it can merge. I don't know of any backporting plans, maybe @feiskyer can help answer that.

@posix4e a workaround would be to delete the the "agentpool" label on your cluster's nodes since that is what is causing the pools to not be identified as "similar". I would careful about doing so however because it may have undesired secondary effects if any operations depend on those labels to exist.

Sure!

Sorry I'm a bit dumb, do you mean deleting the azure tags, or some k8s thing?

Ooh of course! Thanks!

@edrevo I'm still waiting on a repo maintainer to approve the PR so it can merge. I don't know of any backporting plans, maybe @feiskyer can help answer that.

Yep, we should cherry pick this to stable release branches.

Hi again. I've been thinking about this, and I am not sure the opened PR adresses @palmerabollo's description of the issue. Please do correct me if I'm wrong, but my impression is that after the PR is merged, the CAS will be able to scale two agent pools in a single call _only if they use the same node type_.

This means that, in general, this still stands:

With N agent pools in the cluster that need to be scaled and M minutes to scale each one, you need to wait N x M minutes instead of just M.

I believe that the root cause of this is that Azure's CAS, in contrast to what AWS does, blocks on Waiting for virtualMachineScaleSetsClient.CreateOrUpdate. Is there any way to avoid this synchronous block?

Yes, I agree with @edrevo. It should work with agentpools using different node types.

@edrevo @palmerabollo the way I see it, my PR is a pre-requisite for what you are asking for. It might not resolve the issue completely but without the node label patch no two agent pools can be considered "similar" from the autoscaler's perspective as they will each have a unique "agentpool" label, no matter what their VM size is. What my PR is doing is relaxing that requirement, not changing or adding any similarity requirements. The current implementation currently uses VM size as a similarity requirement. The current nodegroup comparator implementation, which is also used by AWS I believe, uses the following criteria to compare two node pools: // IsNodeInfoSimilar returns true if two NodeInfos are similar enough to consider // that the NodeGroups they come from are part of the same NodeGroupSet. The criteria are // somewhat arbitrary, but generally we check if resources provided by both nodes // are similar enough to likely be the same type of machine and if the set of labels // is the same (except for a pre-defined set of labels like hostname or zone).

My proposal: get the open PR to ignore agentpool labels to a mergeable state and merge it without closing this issue. This issue can remain open to track the remaining work to make autoscaling between node pools truly "parallel".

Was this page helpful?
0 / 5 - 0 ratings