As the FAQ says, there are several reasons that scale-up might not happen:
CA doesn't add nodes to the cluster if it wouldn't make a pod schedulable. It will only consider adding nodes to node groups for which it was configured. So one of the reasons it doesn't scale up the cluster may be that the pod has too large (e.g. 100 CPUs), or too specific requests (like node selector), and wouldn't fit on any of the available node types. Another possible reason is that all suitable node groups are already at their maximum size.
When the "all suitable node groups are already at their maximum size" case occurs, the failure message (eg, in the NotTriggerScaleUp event on the pod) says pod didn't trigger scale-up (it wouldn't fit if a new node is added). (This is based on having just experienced this on GKE with a pod that wants a certain type of node that's only in one node pool which is at its maximum size.)
This appears to be working as intended according to the FAQ, but I'd argue that the failure message is misleading — unless you've read the FAQ, it leads you in the direction of "make the pod smaller", not "raise the limit on the node group".
If there is at least one group at its maximum size, the message could say "a node group is at its maximum size, and the pod wouldn't fit if a new node is added to other node groups". (For even better results, it could give a count of maxed-out node groups, and only include the second half of the phrase if there are any non-maxed-out-node groups.)
If the code structure is such that this information is difficult to surface when determining the message, the message could just always say "it wouldn't fit if a new node is added to any node group not at its maximum size" or something, which I think would still be a bit of an improvement.
In a way CA doesn't actually know why the pod doesn't fit - it uses scheduler code as a black-box to check if the pod fits on the node or not. It doesn't understand what the criteria are, it just gets a yes/no answer. There is only so much we can do with that.
That being said in 1.12+ NotTriggerScaleUp event actually contains the information why the pod didn't fit that we got back from scheduler. It's still not 'do x and it will work' kind of message, but it lists the reasons why it doesn't fit on different nodepools.
Ok, so you're saying that something like my "if the code structure" suggestion in the last paragraph is coming in 1.12? (GKE is not quite on 1.12 I believe.) That's great to hear!
Or I see, sounds even better than the static string I'm suggesting. If so, very happy to close this!
It should look similar to messages scheduler puts on FailedScheduling events. And yes, it's coming to GKE in 1.12.
Yep, looks like https://github.com/kubernetes/autoscaler/pull/1188/ implements exactly what I'm looking for (including the max size reason). Thanks @aleksandra-malinowska !
Can this verbiage be updated? In our experience, the message is always incorrect. We've seen this error message because...
Maybe we're more careful than a lot of k8s users, but I've literally never seen this message appear, and the actual cause was the Pod spec requires more resource than a new Node would provide, so in our case, that verbiage is the least helpful verbiage that could possibly be presented.
If we're not going to add logic to make the error message be more correct, perhaps we could add a few other reasons the error might be appearing as static text? Because error messages guide the troubleshooter where to start, and often also mentally block out other things.
UPDATE: Oh I see, closed issue. PR#1188, if it does this correctly, will be a godsend.
Same issue as was described.
In my case, it was related to MAX_NODES limit in autoscaler configuration.
You can check it with this command.
kubectl get -n kube-system configmap cluster-autoscaler-status -o yaml
Fixed with a change to cluster-autoscaler deployment, but it will be nice to receive an understandable error message in such a case.
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=3:11:nodes.test
So far this events show up if a scale-up is not possible due to failing scheduler predicate functions (ie. if adding a node would be possible, but a pod still wouldn't be able to schedule on it). So hitting MAX_NODES wouldn't trigger it.
It's not ideal, but it's not easy to implement something that covers all the reason - there is no one place in autoscaler that understands why a scale-up can't be done. Various parts of logic are independent and it's not easy to figure out what is the reason behind the end result.
In the kubectl describe pods
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m12s default-scheduler 0/14 nodes are available: 11 node(s) didn't have free ports for the requested pod ports, 13 node(s) didn't match node selector, 4 Insufficient pods.
Normal NotTriggerScaleUp 3m6s (x29 over 8m13s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached
Can someone suggest me a fix?
Most helpful comment
Can this verbiage be updated? In our experience, the message is always incorrect. We've seen this error message because...
Maybe we're more careful than a lot of k8s users, but I've literally never seen this message appear, and the actual cause was the Pod spec requires more resource than a new Node would provide, so in our case, that verbiage is the least helpful verbiage that could possibly be presented.
If we're not going to add logic to make the error message be more correct, perhaps we could add a few other reasons the error might be appearing as static text? Because error messages guide the troubleshooter where to start, and often also mentally block out other things.
UPDATE: Oh I see, closed issue. PR#1188, if it does this correctly, will be a godsend.