Hi,
I'm testing the cluster autoscaler on our AWS EKS 1.12 cluster.
I created 3 identical ASG's in zones a/b/c, and created a test deployment using a basic nginx pod, which I scale up with commands like
kubectl -n playground scale --replicas=4 deployment nginx-scaleout
I've sized the pods so that 2 will fit on each node.
I started with 3 nodes, one per AZ, and began scaling up the deployment. I saw it add nodes evenly at first so that each zone had 2 nodes. I then scaled up further until I had 3/3/2 nodes across the zones (so far so good), but the next time it scaled up it added a fourth in zone A so I had 4/3/2, but I'm unsure why it did this instead of adding a new node in zone C?
The relevant part of the log is this:
I0515 10:17:35.100431 1 static_autoscaler.go:121] Starting main loop
I0515 10:17:35.133540 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler
I0515 10:17:35.743084 1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [eks-zeus-autoscaler20190501200602699100000004 eks-zeus-scalemultiaz-a-20190514154009657700000006 eks-zeus-scalemultiaz-b-20190514154009657700000004 eks-zeus-scalemultiaz-c-20190514154009657700000005]
I0515 10:17:35.845006 1 aws_manager.go:157] Refreshed ASG list, next refresh after 2019-05-15 10:17:45.84499724 +0000 UTC m=+45941.859559752
I0515 10:17:35.845350 1 utils.go:552] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0515 10:17:35.845378 1 static_autoscaler.go:252] Filtering out schedulables
I0515 10:17:35.942508 1 static_autoscaler.go:262] No schedulable pods
I0515 10:17:35.942538 1 scale_up.go:263] Pod playground/nginx-scaleout-85cf87558d-z2pmd is unschedulable
I0515 10:17:35.942546 1 scale_up.go:263] Pod playground/nginx-scaleout-85cf87558d-w2h2j is unschedulable
I0515 10:17:35.942667 1 scale_up.go:300] Upcoming 0 nodes
I0515 10:17:35.942721 1 utils.go:208] Pod nginx-scaleout-85cf87558d-z2pmd can't be scheduled on eks-zeus-autoscaler20190501200602699100000004, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
I0515 10:17:35.942853 1 utils.go:198] Pod nginx-scaleout-85cf87558d-w2h2j can't be scheduled on eks-zeus-autoscaler20190501200602699100000004. Used cached predicate check results
I0515 10:17:35.942869 1 scale_up.go:406] No pod can fit to eks-zeus-autoscaler20190501200602699100000004
I0515 10:17:36.042655 1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-a-20190514154009657700000006 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
I0515 10:17:36.042693 1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-b-20190514154009657700000004 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
I0515 10:17:36.042705 1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-c-20190514154009657700000005 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
I0515 10:17:36.042721 1 scale_up.go:418] Best option to resize: eks-zeus-scalemultiaz-a-20190514154009657700000006
I0515 10:17:36.042732 1 scale_up.go:422] Estimated 1 nodes needed in eks-zeus-scalemultiaz-a-20190514154009657700000006
I0515 10:17:36.042894 1 scale_up.go:501] Final scale-up plan: [{eks-zeus-scalemultiaz-a-20190514154009657700000006 3->4 (max: 6)}]
I0515 10:17:36.042918 1 scale_up.go:579] Scale-up: setting group eks-zeus-scalemultiaz-a-20190514154009657700000006 size to 4
I0515 10:17:36.042963 1 auto_scaling_groups.go:203] Setting asg eks-zeus-scalemultiaz-a-20190514154009657700000006 size to 4
And my configuration looks like this:
`spec:
containers:
Any help would be appreciated, Thanks.
/assign
/sig aws
by default it is "random" expander, if that is the case, scenario you described does not seem highly improbable.Does it happen each time?
We're using the least-waste expander presently, but in this case the calculations for each of the ASG's is identical, so shouldn't it then choose to scale up to make the ASG's balanced?
None of existing expanders cares about zone balancing, so it shouldn't matter which one you use. CA has a separate mechanism for balancing: it finds 'similar' NodeGroups (ASGs) and splits any scale-up between them. This happens after the expander makes a decision and it shouldn't depend on expander at all.
My guess would be that your ASGs are not actually 'similar' according to the definition used by CA (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/balance_similar.md#similar-node-groups). I'd look at the set of labels the nodes in each ASG have and see if they're identical except for Kubernetes defined zone and host labels.
All of the nodes have identical labels except zone and hostname - we used terraform to create and tag the ASG's they reside in.
`kubectl get nodes -l environment=playground --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-173-7.eu-west-1.compute.internal Ready
ip-10-0-187-220.eu-west-1.compute.internal Ready
for least-waste, it's backed by random strategy.
https://github.com/kubernetes/autoscaler/blob/cb4e60f8d4ffb7f0836509e65ba6270738f1e15c/cluster-autoscaler/expander/waste/waste.go#L33-L35
In your case, I think all 3 nodegroups are qualified, and random strategy will work and pick one of them.
We鈥檙e hitting this issue in our clusters too. I think I know why.
We have 3 ASGs with c5.2xlarge spread across 3 AZs. Looks like when Amazon creates an EC2 instance either 15835076Ki or 15835084Ki total memory is provisioned for the VM. This is what is reported in the node status.capacity.memory, and verified with free -k on the node.
When the cluster autoscaler attempts to discover similar node groups, it requires an exact match in memory capacity here: https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79-L81
Also seems like node groups that are scaled to zero, get a different memory capacity again, maybe this? https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L129
Do we need some sort of tolerance in the capacity comparison, similar to the allocatable and free comparisons?
@meringu If that's the case, we have some options
Regarding point 2 - the comparison logic lives in processor, ie. it's hidden behind an interface specifically to allow adding custom implementations without touching common code. If required you should be able to customize the logic by adding your own implementation of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/nodegroup_set_processor.go. You may be able to reuse most of implementation of balancing processor (default), just change the comparison logic.
Regarding point 2 - the comparison logic lives in processor, ie. it's hidden behind an interface specifically to allow adding custom implementations without touching common code. If required you should be able to customize the logic by adding your own implementation of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/nodegroup_set_processor.go. You may be able to reuse most of implementation of balancing processor (default), just change the comparison logic.
Thanks for guidance! @MaciekPytel
@meringu I will try to reproduce this issue on my end and check with EC2 team at the same time. It will take some time and come back to you later then
Any update on the issue @Jeffwan?
As I understand, this issue would be causing --balance-similar-nodegroups to not work for all AWS users of the autoscaler who configure an AutoScaling group per instance type per Availability Zone.
@meringu Sorry I was on call in past two weeks. Get some times this week to check this issue.
Thanks @Jeffwan. Let me know if there is anything I can help with.
I just ran into a similar thing, with a kops-built cluster, and I know exactly why it is ignoring the balance-similar-node-groups flag.
The problem, at least in my case, and it sounds similar to the above description, is that kops is adding it's own labelling to the node, IE: kops.k8s.io/instancegroup: nodes-us-east-1a, meaning that groups do not match on the labelling.
One possible fix, specific to kops clusters, would be to add the kops.k8s.io/instancegroup string to the ignoredLabels here
Not sure if other tools IE: for EKS have well-known labels like this that could also be added, or if the better solution would be to add a flag to let additional ignore labels be specified at runtime.
Thoughts?
Not the case for our EKS cluster. We do have some control over the labels. The only different node labels on our clusters are in the ignoredLabels list.
It is the capacity check in our case, as the logs show it sometimes finds a similar node group or two, depending on if the node group has any instances, and a bit random as AWS doesn't give the exact amount of memory everytime.
Hi @Jeffwan, did you get a change to look at this last week? Is there anything I can help with?
We are happy to contribute engineer time if that would be helpful.
@meringu Sorry for late response. It would be helpful to help identify the problem. I am trying to see if this can be easily reproducible. In the ca log @danmcnulty provide, it failed node selector, could you share the logs and make sure it failed on mismatch of memory?
Pod nginx-scaleout-85cf87558d-z2pmd can't be scheduled on eks-zeus-autoscaler20190501200602699100000004, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
I did some search and didn't find any clues of memory mismatch for one instance. Could you file a service ticket to AWS ec2 team? (I don't have all your details)
Unfortunately there are no logs for the comparisons, you can see in my above link for the IsNodeInfoSimilar function, that there are no log statements. See my above for the memory disparity.
Summarised:
c5.2xlarge:
Cluster Autoscaler is coded to 16384MB
Some nodes free -k 15835076Ki
Other nodes free -k 15835084Ki
I'll raise a support ticket asking them about the difference in memory from instance to instance, and the why the advertised memory is different again.
@meringu this sounds like a similar problem to what I'm facing. In the event it is relevant, you may wish to skim aleksandra-malinowska's comment. Here's a snippet:
That's probably because actual node's memory is slightly different from predicted memory based on machine type (due to kernel reservation, specific to a given OS/machine combination)
@Jeffwan Just a quick note on the logs I provided - the ASG with prefix "eks-zeus-autoscaler" was used for something unrelated by us, and so it was expected that this group didn't match the node selector.
The question I have is about the 3 similar groups with prefixes:
eks-zeus-scalemultiaz-a
eks-zeus-scalemultiaz-b
eks-zeus-scalemultiaz-c
Which are the ones with identical labels, which scaled up un-evenly.
Hey team,
I've heard back from AWS about the memory capacity disparity.
For the difference reported on the running hosts, I was using two different AMIs with different kernel versions. After updating all the instances to the newer AMI, they all report the same memory capacity.
With regards to the difference in reported in the API vs reported on the node, I got this response:
Pricing API provides total memory that comes with the instance type (which is also displayed in our websites). However, there will be certain memory that will reserve for the use of kernel , BIOs etc.
Hence total available memory for the use will be less than listed in pricing API
This explains why two different AMIs can have different memory capacity, as they can have different kernel versions etc. This also means the capacity in the pricing API will never match the reported capacity from the node.
So now looking at this comment in the code: https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79-L81 A course of action could be to look at a different way to satisfy the MaxMemoryTotal requirement so we can implement a tolerance on the capacity check.
Or I could disable scaling to zero for my ASGs. This means extra overhead however.
Isn't it possible to simplify the "similar check" to comparing just the instance type?
@ewoutp You could have multiple node pools of the same instance type that you want scaled separately for one reason or another.
The main issue is where you have pools that are the same for all intents across differing zones, and some tag/label is different between the zones and getting in the way. See this PR for what I am talking about: kubernetes/autoscaler#2207
Not the complete check, however for the CPU and especially the memory part (which seems to be a little off for now) ?
@jhohertz That is true. However I suspect there are many cases where comparison on instanceType alone are sufficient (and working very well).
Shouldn't that at least be an option?
I have 3 ASGs with c5.9xlarge spread across 3 AZs, however when CA tries to make a decision on splitting scale-up nodes between the node groups, most of the times it split among 2 AZs. Why do we want to check the 5% free resources? Aren't labels, capacity and allocatable sufficient enough to compare the node group similarity?
I'm seeing this issue with m5.xlarge.
One node group reports Memory:16404668416 and another Memory:16228507648, for a difference of 176MB.
&nodeinfo.Resource{MilliCPU:4000, Memory:16228507648, EphemeralStorage:48307038948, AllowedPodNumber:58, ScalarResources:map[v1.ResourceName]int64{"attachable-volumes-aws-ebs":25, "hugepages-1Gi":0, "hugepages-2Mi":0}}
&nodeinfo.Resource{MilliCPU:4000, Memory:16404668416, EphemeralStorage:48307038948, AllowedPodNumber:58, ScalarResources:map[v1.ResourceName]int64{"attachable-volumes-aws-ebs":25, "hugepages-1Gi":0, "hugepages-2Mi":0}}
Updating MaxMemoryDifferenceInKiloBytes to 256000 fixes the issue. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L36
For those using the AWS CNI with custom networking, the label "k8s.amazonaws.com/eniConfig" needs to be whitelisted as well.
I'm seeing this issue with m5.xlarge.
One node group reports Memory:16404668416 and another Memory:16228507648.
Seen here on m5.large
We're hitting this same issue; we can see from the logs that k8s decided only 2 of our 3 nodegroups were similar (by looking at the "Splitting..." log messages), despite all non-ignored labels being identical. These were m5s, so it's possible our problem is fixed by #2462, but in general it would be very nice to have some logging in the comparator that says _why_ it decided 2 nodegroups were dissimilar, would have made it much easier to diagnose this.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
See https://github.com/kubernetes/autoscaler/issues/1676#issuecomment-568189444 for list of some of the reasons this might be happening.
I was looking into this issue last week and came up with this PR which you may all be interest in #3124
In my debugging, I was seeing a difference of 172032Ki across availability zones of m5.xlarge instances in us-east-2. I will add I was using the cluster-api provider rather than the AWS provider so this bug affects multiple providers.
My finding was that the value of MaxMemoryDifferenceInKiloBytes has caused some confusion. It was initially introduced by a colleague of mine who coded the difference to tolerate 128Ki as he described it. However in this original PR MaxMemoryDifferenceInKiloBytes was set to 128000. This is because the value was actually a number of Bytes and not a number of KiloBytes. So when this was doubled, it allowed a 256Ki diff in memory and not a 256Mi diff as was described in that PR.
To avoid confusion, I've reworked this difference calculation to use Kubernetes Quantities and prevent conversions to integers where possible to reduce the likelihood of mistakes being made in the future. So now the MaxMemoryDifference comes from resource.MustParse("256Mi") and all of the maths is done as Quantities. I've also added a test case that demonstrates some real world values I got from my testing.
Before this patch I added a bunch of debug logic to see exactly the values that the code was receiving that lead me to this discovery. I was able to consistently reproduce the problem and now, with this fix, I can see that the values coming through are being compared properly to the 256Mi tolerance.
What I would like to understand is why there are differences across instances (I've seen across AZs and within AZs)? In my testing, all machines booted from the same configuration, same AMI, so there should be no difference in the kernel version as reported earlier.
Also, I would like to understand which instance types are affected by this memory difference, is it all Nitro based EC2 instances? I've seen mention of C5 and M5, are R5 and T3 instances as well?
I've been doing some more testing of this today and have come up with some more results.
Managed to find differences on the following instance types:
The large differences have only been seen on Nitro instances, but there are some differences on older instances as well
The larger differences are about 1% difference (1.14, 1.05, 0.7%). Perhaps we should just allow a small difference as the allocatable and free are allowing? Are there any problems or implications with that, that anyone can think of?
My finding was that the value of
MaxMemoryDifferenceInKiloByteshas caused some confusion. It was initially introduced by a colleague of mine who coded the difference to tolerate 128Ki as he described it.
When I first noticed this, the differences I observed were way, way (way!) smaller than 128KiB. Sometimes just 8Kb. I just scaled this up as a first pass in absence of any hard numbers and also not to dramatically perturb the existing behaviour.
When I first noticed this, the differences I observed were way, way (way!) smaller than 128KiB. Sometimes just 8Kb. I just scaled this up as a first pass in absence of any hard numbers and also not to dramatically perturb the existing behaviour.
That makes sense! Thanks for chipping in. Do you happen to remember if you ever tested with 5th generation instances?
I've just had an update from an AWS contact who suggested it may be due to having different CPUs supporting various instance types. EG you may get two instances from the same type but one may run a Skylake and the other a Cascade lake. Perhaps there are also subtle difference in memory capacity for these two generations
Perhaps there are also subtle difference in memory capacity for these two generations
My guess at the time was "video memory". And I think I concluded on that for the m4 instance types we were using by looking at the dmesg output. I'm sure BIOS revisions could also play a part here. But dmesg should help show what's what.
I don't know if this is relevant here, but spot-enabled auto scaling groups typically have multiple instance types attached to them, and one of them will be selected based on spot market costs (I guess) at the time of creation.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
We鈥檙e hitting this issue in our clusters too. I think I know why.
We have 3 ASGs with c5.2xlarge spread across 3 AZs. Looks like when Amazon creates an EC2 instance either 15835076Ki or 15835084Ki total memory is provisioned for the VM. This is what is reported in the node
status.capacity.memory, and verified withfree -kon the node.When the cluster autoscaler attempts to discover similar node groups, it requires an exact match in memory capacity here: https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79-L81
Also seems like node groups that are scaled to zero, get a different memory capacity again, maybe this? https://github.com/kubernetes/autoscaler/blob/096873623e4c7eaa592a0ff53ff91aec49d8a22b/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L129
Do we need some sort of tolerance in the capacity comparison, similar to the allocatable and free comparisons?