Autoscaler: Uneven scale-up of AWS ASG's

Created on 15 May 2019 · 44Comments · Source: kubernetes/autoscaler

Hi,
I'm testing the cluster autoscaler on our AWS EKS 1.12 cluster.
I created 3 identical ASG's in zones a/b/c, and created a test deployment using a basic nginx pod, which I scale up with commands like
kubectl -n playground scale --replicas=4 deployment nginx-scaleout
I've sized the pods so that 2 will fit on each node.

I started with 3 nodes, one per AZ, and began scaling up the deployment. I saw it add nodes evenly at first so that each zone had 2 nodes. I then scaled up further until I had 3/3/2 nodes across the zones (so far so good), but the next time it scaled up it added a fourth in zone A so I had 4/3/2, but I'm unsure why it did this instead of adding a new node in zone C?

The relevant I0515 10:17:35.100431 I0515 10:17:35.133540 I0515 10:17:35.743084 I0515 10:17:35.845006 I0515 10:17:35.845350 I0515 10:17:35.845378 I0515 10:17:35.942508 I0515 10:17:35.942538 I0515 10:17:35.942546 I0515 10:17:35.942667 I0515 10:17:35.942721 I0515 10:17:35.942853 I0515 10:17:35.942869 I0515 10:17:36.042655 I0515 10:17:36.042693 I0515 10:17:36.042705 I0515 10:17:36.042721 I0515 10:17:36.042732 I0515 10:17:36.042894 I0515 10:17:36.042918 I0515 10:17:36.042963 part of the log is this:
1 static_autoscaler.go:121] Starting main loop
1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler
1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [eks-zeus-autoscaler20190501200602699100000004 eks-zeus-scalemultiaz-a-20190514154009657700000006 eks-zeus-scalemultiaz-b-20190514154009657700000004 eks-zeus-scalemultiaz-c-20190514154009657700000005]
1 aws_manager.go:157] Refreshed ASG list, next refresh after 2019-05-15 10:17:45.84499724 +0000 UTC m=+45941.859559752
1 utils.go:552] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
1 static_autoscaler.go:252] Filtering out schedulables
1 static_autoscaler.go:262] No schedulable pods
1 scale_up.go:263] Pod playground/nginx-scaleout-85cf87558d-z2pmd is unschedulable
1 scale_up.go:263] Pod playground/nginx-scaleout-85cf87558d-w2h2j is unschedulable
1 scale_up.go:300] Upcoming 0 nodes
1 utils.go:208] Pod nginx-scaleout-85cf87558d-z2pmd can't be scheduled on eks-zeus-autoscaler20190501200602699100000004, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
1 utils.go:198] Pod nginx-scaleout-85cf87558d-w2h2j can't be scheduled on eks-zeus-autoscaler20190501200602699100000004. Used cached predicate check results
1 scale_up.go:406] No pod can fit to eks-zeus-autoscaler20190501200602699100000004
1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-a-20190514154009657700000006 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-b-20190514154009657700000004 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-c-20190514154009657700000005 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
1 scale_up.go:418] Best option to resize: eks-zeus-scalemultiaz-a-20190514154009657700000006
1 scale_up.go:422] Estimated 1 nodes needed in eks-zeus-scalemultiaz-a-20190514154009657700000006
1 scale_up.go:501] Final scale-up plan: [{eks-zeus-scalemultiaz-a-20190514154009657700000006 3->4 (max: 6)}]
1 scale_up.go:579] Scale-up: setting group eks-zeus-scalemultiaz-a-20190514154009657700000006 size to 4
1 auto_scaling_groups.go:203] Setting asg eks-zeus-scalemultiaz-a-20190514154009657700000006 size to 4

And my configuration looks like this:
`spec:
containers:

command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --balance-similar-node-groups
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/zeus
  
  image: k8s.gcr.io/cluster-autoscaler:v1.12.3`

Any help would be appreciated, Thanks.

areprovideaws lifecyclrotten

Source

danmcnulty

👍3

Most helpful comment

We’re hitting this issue in our clusters too. I think I know why.

We have 3 ASGs with c5.2xlarge spread across 3 AZs. Looks like when Amazon creates an EC2 instance either 15835076Ki or 15835084Ki total memory is provisioned for the VM. This is what is reported in the node status.capacity.memory, and verified with free -k on the node.

When the cluster autoscaler attempts to discover similar node groups, it requires an exact match in memory capacity here: https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79-L81

Also seems like node groups that are scaled to zero, get a different memory capacity again, maybe this? https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L129

Do we need some sort of tolerance in the capacity comparison, similar to the allocatable and free comparisons?

meringu on 28 May 2019

👍5

All 44 comments

/assign

Jeffwan on 16 May 2019

/sig aws

Jeffwan on 20 May 2019

by default it is "random" expander, if that is the case, scenario you described does not seem highly improbable.Does it happen each time?

vikaschoudhary16 on 21 May 2019

We're using the least-waste expander presently, but in this case the calculations for each of the ASG's is identical, so shouldn't it then choose to scale up to make the ASG's balanced?

danmcnulty on 21 May 2019

None of existing expanders cares about zone balancing, so it shouldn't matter which one you use. CA has a separate mechanism for balancing: it finds 'similar' NodeGroups (ASGs) and splits any scale-up between them. This happens after the expander makes a decision and it shouldn't depend on expander at all.
My guess would be that your ASGs are not actually 'similar' according to the definition used by CA (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/balance_similar.md#similar-node-groups). I'd look at the set of labels the nodes in each ASG have and see if they're identical except for Kubernetes defined zone and host labels.

MaciekPytel on 21 May 2019

All of the nodes have identical labels except zone and hostname - we used terraform to create and tag the ASG's they reside in.

`kubectl get nodes -l environment=playground --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-173-7.eu-west-1.compute.internal Ready 8m6s v1.12.7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.medium,beta.kubernetes.io/os=linux,cluster=zeus,environment=playground,failure-domain.beta.kubernetes.io/region=eu-west-1,failure-domain.beta.kubernetes.io/zone=eu-west-1a,kubernetes.io/hostname=ip-10-0-173-7.domain.com,nodegroup=scalemultiaz,workload=scalemultiaz

ip-10-0-187-220.eu-west-1.compute.internal Ready 7m59s v1.12.7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.medium,beta.kubernetes.io/os=linux,cluster=zeus,environment=playground,failure-domain.beta.kubernetes.io/region=eu-west-1,failure-domain.beta.kubernetes.io/zone=eu-west-1b,kubernetes.io/hostname=ip-10-0-187-220.domain.com,nodegroup=scalemultiaz,workload=scalemultiaz`

danmcnulty on 21 May 2019

for least-waste, it's backed by random strategy.
https://github.com/kubernetes/autoscaler/blob/cb4e60f8d4ffb7f0836509e65ba6270738f1e15c/cluster-autoscaler/expander/waste/waste.go#L33-L35

In your case, I think all 3 nodegroups are qualified, and random strategy will work and pick one of them.

Jeffwan on 21 May 2019

We’re hitting this issue in our clusters too. I think I know why.

Do we need some sort of tolerance in the capacity comparison, similar to the allocatable and free comparisons?

meringu on 28 May 2019

👍5

@meringu If that's the case, we have some options

Figure out why some instance have different memory provisioned. I can check with EC2 team on this. I suspect not all c5.2xlarge use exact same memory? Also, ec2_instance_type mapping has to be match with provisioned memory.

Use comparison with tolerance. My concern is this is common code and if that's only aws issues, we should not touch this.

Jeffwan on 29 May 2019

Regarding point 2 - the comparison logic lives in processor, ie. it's hidden behind an interface specifically to allow adding custom implementations without touching common code. If required you should be able to customize the logic by adding your own implementation of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/nodegroup_set_processor.go. You may be able to reuse most of implementation of balancing processor (default), just change the comparison logic.

MaciekPytel on 29 May 2019

Regarding point 2 - the comparison logic lives in processor, ie. it's hidden behind an interface specifically to allow adding custom implementations without touching common code. If required you should be able to customize the logic by adding your own implementation of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/nodegroup_set_processor.go. You may be able to reuse most of implementation of balancing processor (default), just change the comparison logic.

Thanks for guidance! @MaciekPytel

Jeffwan on 29 May 2019

@meringu I will try to reproduce this issue on my end and check with EC2 team at the same time. It will take some time and come back to you later then

Jeffwan on 30 May 2019

Any update on the issue @Jeffwan?

As I understand, this issue would be causing --balance-similar-nodegroups to not work for all AWS users of the autoscaler who configure an AutoScaling group per instance type per Availability Zone.

meringu on 13 Jun 2019

@meringu Sorry I was on call in past two weeks. Get some times this week to check this issue.

Jeffwan on 14 Jun 2019

Thanks @Jeffwan. Let me know if there is anything I can help with.

meringu on 14 Jun 2019

I just ran into a similar thing, with a kops-built cluster, and I know exactly why it is ignoring the balance-similar-node-groups flag.

The problem, at least in my case, and it sounds similar to the above description, is that kops is adding it's own labelling to the node, IE: kops.k8s.io/instancegroup: nodes-us-east-1a, meaning that groups do not match on the labelling.

One possible fix, specific to kops clusters, would be to add the kops.k8s.io/instancegroup string to the ignoredLabels here

Not sure if other tools IE: for EKS have well-known labels like this that could also be added, or if the better solution would be to add a flag to let additional ignore labels be specified at runtime.

Thoughts?

jhohertz on 18 Jun 2019

Not the case for our EKS cluster. We do have some control over the labels. The only different node labels on our clusters are in the ignoredLabels list.

It is the capacity check in our case, as the logs show it sometimes finds a similar node group or two, depending on if the node group has any instances, and a bit random as AWS doesn't give the exact amount of memory everytime.

meringu on 18 Jun 2019

Hi @Jeffwan, did you get a change to look at this last week? Is there anything I can help with?

We are happy to contribute engineer time if that would be helpful.

meringu on 24 Jun 2019

@meringu Sorry for late response. It would be helpful to help identify the problem. I am trying to see if this can be easily reproducible. In the ca log @danmcnulty provide, it failed node selector, could you share the logs and make sure it failed on mismatch of memory?

Pod nginx-scaleout-85cf87558d-z2pmd can't be scheduled on eks-zeus-autoscaler20190501200602699100000004, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector

I did some search and didn't find any clues of memory mismatch for one instance. Could you file a service ticket to AWS ec2 team? (I don't have all your details)

Jeffwan on 24 Jun 2019

Unfortunately there are no logs for the comparisons, you can see in my above link for the IsNodeInfoSimilar function, that there are no log statements. See my above for the memory disparity.

Summarised:

c5.2xlarge:
Cluster Autoscaler is coded to 16384MB
Some nodes free -k 15835076Ki
Other nodes free -k 15835084Ki

I'll raise a support ticket asking them about the difference in memory from instance to instance, and the why the advertised memory is different again.

meringu on 25 Jun 2019

@meringu this sounds like a similar problem to what I'm facing. In the event it is relevant, you may wish to skim aleksandra-malinowska's comment. Here's a snippet:

That's probably because actual node's memory is slightly different from predicted memory based on machine type (due to kernel reservation, specific to a given OS/machine combination)

leonsodhi-lf on 25 Jun 2019

@Jeffwan Just a quick note on the logs I provided - the ASG with prefix "eks-zeus-autoscaler" was used for something unrelated by us, and so it was expected that this group didn't match the node selector.

The question I have is about the 3 similar groups with prefixes:
eks-zeus-scalemultiaz-a
eks-zeus-scalemultiaz-b
eks-zeus-scalemultiaz-c
Which are the ones with identical labels, which scaled up un-evenly.

danmcnulty on 1 Jul 2019

Hey team,

I've heard back from AWS about the memory capacity disparity.

For the difference reported on the running hosts, I was using two different AMIs with different kernel versions. After updating all the instances to the newer AMI, they all report the same memory capacity.

With regards to the difference in reported in the API vs reported on the node, I got this response:

Pricing API provides total memory that comes with the instance type (which is also displayed in our websites). However, there will be certain memory that will reserve for the use of kernel , BIOs etc.
Hence total available memory for the use will be less than listed in pricing API

This explains why two different AMIs can have different memory capacity, as they can have different kernel versions etc. This also means the capacity in the pricing API will never match the reported capacity from the node.

So now looking at this comment in the code: https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79-L81 A course of action could be to look at a different way to satisfy the MaxMemoryTotal requirement so we can implement a tolerance on the capacity check.

meringu on 5 Jul 2019

Or I could disable scaling to zero for my ASGs. This means extra overhead however.

meringu on 5 Jul 2019

Isn't it possible to simplify the "similar check" to comparing just the instance type?

ewoutp on 12 Sep 2019

@ewoutp You could have multiple node pools of the same instance type that you want scaled separately for one reason or another.

The main issue is where you have pools that are the same for all intents across differing zones, and some tag/label is different between the zones and getting in the way. See this PR for what I am talking about: kubernetes/autoscaler#2207

jhohertz on 12 Sep 2019

Not the complete check, however for the CPU and especially the memory part (which seems to be a little off for now) ?

Robert-Stam on 12 Sep 2019

@jhohertz That is true. However I suspect there are many cases where comparison on instanceType alone are sufficient (and working very well).
Shouldn't that at least be an option?

ewoutp on 13 Sep 2019

I have 3 ASGs with c5.9xlarge spread across 3 AZs, however when CA tries to make a decision on splitting scale-up nodes between the node groups, most of the times it split among 2 AZs. Why do we want to check the 5% free resources? Aren't labels, capacity and allocatable sufficient enough to compare the node group similarity?

sulixu on 9 Oct 2019

I'm seeing this issue with m5.xlarge.
One node group reports Memory:16404668416 and another Memory:16228507648, for a difference of 176MB.

&nodeinfo.Resource{MilliCPU:4000, Memory:16228507648, EphemeralStorage:48307038948, AllowedPodNumber:58, ScalarResources:map[v1.ResourceName]int64{"attachable-volumes-aws-ebs":25, "hugepages-1Gi":0, "hugepages-2Mi":0}}

&nodeinfo.Resource{MilliCPU:4000, Memory:16404668416, EphemeralStorage:48307038948, AllowedPodNumber:58, ScalarResources:map[v1.ResourceName]int64{"attachable-volumes-aws-ebs":25, "hugepages-1Gi":0, "hugepages-2Mi":0}}

Updating MaxMemoryDifferenceInKiloBytes to 256000 fixes the issue. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L36

For those using the AWS CNI with custom networking, the label "k8s.amazonaws.com/eniConfig" needs to be whitelisted as well.

cdmurph32 on 16 Oct 2019

👍1

I'm seeing this issue with m5.xlarge.
One node group reports Memory:16404668416 and another Memory:16228507648.

Seen here on m5.large

JacobHenner on 16 Oct 2019

We're hitting this same issue; we can see from the logs that k8s decided only 2 of our 3 nodegroups were similar (by looking at the "Splitting..." log messages), despite all non-ignored labels being identical. These were m5s, so it's possible our problem is fixed by #2462, but in general it would be very nice to have some logging in the comparator that says _why_ it decided 2 nodegroups were dissimilar, would have made it much easier to diagnose this.

bazzargh on 7 Nov 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Mar 2020

/remove-lifecycle stale

See https://github.com/kubernetes/autoscaler/issues/1676#issuecomment-568189444 for list of some of the reasons this might be happening.

JacobHenner on 8 Apr 2020

I was looking into this issue last week and came up with this PR which you may all be interest in #3124

In my debugging, I was seeing a difference of 172032Ki across availability zones of m5.xlarge instances in us-east-2. I will add I was using the cluster-api provider rather than the AWS provider so this bug affects multiple providers.

My finding was that the value of MaxMemoryDifferenceInKiloBytes has caused some confusion. It was initially introduced by a colleague of mine who coded the difference to tolerate 128Ki as he described it. However in this original PR MaxMemoryDifferenceInKiloBytes was set to 128000. This is because the value was actually a number of Bytes and not a number of KiloBytes. So when this was doubled, it allowed a 256Ki diff in memory and not a 256Mi diff as was described in that PR.

To avoid confusion, I've reworked this difference calculation to use Kubernetes Quantities and prevent conversions to integers where possible to reduce the likelihood of mistakes being made in the future. So now the MaxMemoryDifference comes from resource.MustParse("256Mi") and all of the maths is done as Quantities. I've also added a test case that demonstrates some real world values I got from my testing.

Before this patch I added a bunch of debug logic to see exactly the values that the code was receiving that lead me to this discovery. I was able to consistently reproduce the problem and now, with this fix, I can see that the values coming through are being compared properly to the 256Mi tolerance.

What I would like to understand is why there are differences across instances (I've seen across AZs and within AZs)? In my testing, all machines booted from the same configuration, same AMI, so there should be no difference in the kernel version as reported earlier.

Also, I would like to understand which instance types are affected by this memory difference, is it all Nitro based EC2 instances? I've seen mention of C5 and M5, are R5 and T3 instances as well?

JoelSpeed on 11 May 2020

I've been doing some more testing of this today and have come up with some more results.

Managed to find differences on the following instance types:

m5.xlarge - Biggest diff around 168Mi
r5.4xlarge - Biggest diff 16Ki
t3.large - Biggest diff 16Ki
m5.16xlarge - Biggest diff around 2688Mi
m4.2xlarge - Biggest diff 200Ki
c5.4xlarge - Biggest diff around 224Mi

The large differences have only been seen on Nitro instances, but there are some differences on older instances as well

The larger differences are about 1% difference (1.14, 1.05, 0.7%). Perhaps we should just allow a small difference as the allocatable and free are allowing? Are there any problems or implications with that, that anyone can think of?

JoelSpeed on 12 May 2020

My finding was that the value of MaxMemoryDifferenceInKiloBytes has caused some confusion. It was initially introduced by a colleague of mine who coded the difference to tolerate 128Ki as he described it.

When I first noticed this, the differences I observed were way, way (way!) smaller than 128KiB. Sometimes just 8Kb. I just scaled this up as a first pass in absence of any hard numbers and also not to dramatically perturb the existing behaviour.

frobware on 15 May 2020

When I first noticed this, the differences I observed were way, way (way!) smaller than 128KiB. Sometimes just 8Kb. I just scaled this up as a first pass in absence of any hard numbers and also not to dramatically perturb the existing behaviour.

That makes sense! Thanks for chipping in. Do you happen to remember if you ever tested with 5th generation instances?

I've just had an update from an AWS contact who suggested it may be due to having different CPUs supporting various instance types. EG you may get two instances from the same type but one may run a Skylake and the other a Cascade lake. Perhaps there are also subtle difference in memory capacity for these two generations

JoelSpeed on 15 May 2020

Perhaps there are also subtle difference in memory capacity for these two generations

My guess at the time was "video memory". And I think I concluded on that for the m4 instance types we were using by looking at the dmesg output. I'm sure BIOS revisions could also play a part here. But dmesg should help show what's what.

frobware on 15 May 2020

I don't know if this is relevant here, but spot-enabled auto scaling groups typically have multiple instance types attached to them, and one of them will be selected based on spot market costs (I guess) at the time of creation.

trondhindenes on 4 Aug 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Nov 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 2 Dec 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 1 Jan 2021

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.