The autoscaler has a timeout for non-ready nodes which forces it to kill those nodes and potentially select a different node pool in the next iteration. However, in the situation where the node pool cannot scale up at all it'll happily wait forever, keeping pods in Pending state without trying to compensate.
For example, setting multiple AWS Spot node pools with different instance types, or setting up a Spot pool and an On Demand pool doesn't really work. We'd expect CA to scale up one of the ASGs, detect a few minutes later that there's still no nodes coming up (because the corresponding Spot pool doesn't have capacity) and fall back to another pool. What actually happens is that CA will scale up the node pool by increasing desired capacity and then not do anything at all other than printing Upcoming 1 nodes/Failed to find readiness information for ....
CA should timeout scale-up (IIRC after 15 minutes) and put the node group that failed to scale-up in 'backoff' state. At that point it should try to scale-up again, ignoring this node group. You can see this by looking for one of the following:
Scale-up timed out for node group <name> after <time> in logsScaleUpTimedOut event on cluster-autoscaler-status configmap (in kube-system ns)failed_scale_ups_total metric increasingIf you can reproduce it and doesn't see any of the above it's a bug. In that case can you provide some more details (especially CA version you're using).
We're running version 1.2.2, but I can try with 1.3.1.
It shouldn't make any difference, this was added earlier than 1.2 (I don't remember exactly, but probably 1.1 timeframe?).
So you're saying it's stuck on Upcoming 1 node for more than 15 minutes? Can you provide a log of initial scale-up, a loop immediately after and another loop after 15+ minutes?
As @aleksandra-malinowska pointed out to me the timeout is effectively reset if there is another scale-up on the same node group (ie. CA only notices it if the last of multiple overlapping scale-ups times out). So that may be another thing to look for in the logs.
The logs are in this gist. Please ignore all ASGs and pods in 1b, the nodes there were constantly being created and destroyed by Spot termination.
It looks like the scale-up timeout is working just fine and the pool is marked as unhealthy. However, in the next loop iteration CA doesn't consider another group that would've fit the same pod (scale_up.go:178] No need for any nodes in nodepool-default-worker-m4-splitaz-aws-ACCT-eu-central-1-kube-aws-test-aermakov64-AutoScalingGroup1a-LJECHLYTYAKY in 03-next-iteration.txt). It only tries to scale it up after I manually disable the unhealthy node group by setting its max size to 0.
I think the problem is that building upcomingNodes in https://github.com/kubernetes/autoscaler/blob/f646b9a5570fa0ef92f7bd4e181d5880a7667a9d/cluster-autoscaler/core/scale_up.go#L278 ignores whether the node group is healthy or not, so the loop in https://github.com/kubernetes/autoscaler/blob/f646b9a5570fa0ef92f7bd4e181d5880a7667a9d/cluster-autoscaler/core/scale_up.go#L304 thinks that the pod could be scheduled on the perma-upcoming node and doesn't consider other groups, but I could also be completely wrong.
It seems that an easy fix would be to completely ignore upcoming nodes from unhealthy groups in ScaleUp(). It doesn't look like the list is used for anything other than scheduling estimation, so it shouldn't affect anything else. WDYT?
CA handles this by resizing the node group back to original size after timed-out scale-up. This is done in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L180. However, going through the code it looks like this may not work for timed-out scale-from-0. It would be signified by lines in log looking like Readiness for node group <group> not found, which I do see in your log.
I'll try to reproduce later to confirm this theory (not enough time this week, sorry), but I'm fairly confident that's what's happening. If I'm right it's a bug in clusterstate. I have an idea how to fix it, but clusterstate is not the easiest thing to reason about and I need to have some time to dig into it to make sure I'm not breaking anything.
@MaciekPytel did you have a chance to look into it yet?
@MaciekPytel could you reproduce this?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
/sigh I keep meaning to go back and fix this..
cc: @losipiuk
馃憤 using 1.2.4 (K8S 1.10.12) we had problems using spot and ondemand nodepools. CA tries (ad infinitum) request a spot (even with the price-too-low state) and never fallback to the next nodepools (ondemand).
With recent code changes cloudprovider.NodeGroup.Nodes() returns Instance objects. The intention of this method is to return an Instance object for not only running nodes but also those which are being crated or deleted. Additionally, for nodes which are being created (with Status.State==InstanceCreating), ErrorInfo can be provided. The core logic now reacts to errors specified via ErrorInfo and will retract from scale-up of give node group (and backoff node group) immediately, instead waiting 15mins for timeout.
If you want to implement it for AWS you may take a look at implementation for GCE which is already in.
Note: For CA logic to react on error the ErrorClass must be set to OutOfResourcesErrorClass. Probably it will be extended to OtherErrorClass too.
Circling back to @aermakov-zalando's point. Does it not make sense to skip upcoming nodes from unhealthy groups? We should be able to assume that the node is no longer upcoming and not increment upcomingNodes. That would make cluster-autoscaler fall back on a different ASG for the pod that cannot be unscheduled.
@viggeh Are you still planning on tackling this in a follow-up PR? We're hitting this and could take a stab at the recommended implementation, if that's helpful.
@davidquarles Sorry for the late reply. I've not had the time to take a proper look at this and don't expect to do so in the next couple of weeks. If you can take the problem on that would be awesome!
@MaciekPytel @mwielgus @viggeh I'm no AWS expert, but AFAICT instances aren't added to the ASG when spot requests are made. With an open, unfulfillable spot request triggered by an ASG, I see that the ASG's desired number of instances has been incremented, and the spot request shows up as an associated scaling activity, but there are no new instances under the ASG. Is the aforementioned strategy for solving this still valid? If so, would I just use AwsNodeGroup.TemplateNodeInfo to create fake nodes (with Status.State==InstanceCreating and corresponding ErrorInfo)?
I'll add that this is also valid for any failing reason other than spot requests not being fulfilled, eg: corrupted LC or LT config - AMI not found, missing subnet or SG, etc..
CA handles this by resizing the node group back to original size after timed-out scale-up.
I don't see this behaviour. In my test the ASG is just left at the new value even though the new instance didn't get provisioned.
Just to add my test results to this issue, if it helps any...
I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:
I0417 13:42:04.765399 1 scale_up.go:427] Best option to resize: eu01-stg-spot-2
I0417 13:42:04.765417 1 scale_up.go:431] Estimated 1 nodes needed in eu01-stg-spot-2
I0417 13:42:04.765439 1 scale_up.go:533] Final scale-up plan: [{eu01-stg-spot-2 2->3 (max: 20)}]
But spot price is not fulfilled so instance is not created. Then max-node-provision-time passes:
W0417 14:03:32.422046 1 clusterstate.go:198] Scale-up timed out for node group eu01-stg-spot-1 after 15m8.316405684s
W0417 14:03:32.422109 1 clusterstate.go:221] Disabling scale-up for node group eu01-stg-spot-1 until 2019-04-17 14:08:32.247733959 +0000 UTC m=+4217.943746338
W0417 14:03:32.532256 1 scale_up.go:329] Node group eu01-stg-spot-1 is not ready for scaleup - backoff
Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:
I0417 14:09:23.451114 1 scale_up.go:412] No need for any nodes in eu01-stg
I0417 14:09:23.451516 1 scale_up.go:412] No need for any nodes in eu01-stg-spot-1
I0417 14:09:23.452883 1 scale_up.go:412] No need for any nodes in eu01-stg-spot-2
And pods are left unschedulable.
We're running into the same issues here. I know it's only been like 13 days, but did you happen to find a workaround @max-rocket-internet?
@choseh nope 馃槓
Please address this issue.
Stumbled across this too. I tried to use a scale-to-zero pool besides my existing cluster for new CI/CD gitlab workers. Running on GKE with 1.13.6-gke.0.
Sadly there is no scaling above 1 node, thus 5 are allowed. Main-Cluster is currently running on 4 nodes.
Tried with and without preemptible nodes (guess the AWS term is SPOT for this)
pod description:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16s (x2 over 16s) default-scheduler 0/5 nodes are available: 4 node(s) didn't match node selector, 5 Insufficient cpu.
Normal NotTriggerScaleUp 4s (x2 over 15s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 in backoff after failed scale-up
AFAICT, #2235 properly handles the bug described here, and that PR has been merged. Can we can close this bug out now?
@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?
@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?
While the fix in #2235 is indeed for the AWS cloud provider, the solution is the one recommended by @MaciekPytel and @losipiuk for all cloud providers that suffer from the "I don't have an actual Instance (yet)" problem with their autoscaling API. See here for @MaciekPytel recommendation:
https://github.com/kubernetes/autoscaler/pull/2008#issuecomment-491280315
Note that he mentions that the managed instance groups API in GCE does the whole placeholder thing behind the scenes but is essentially what was implemented for the AWS cloud provider in #2235
I see. We only really care about AWS anyway, so I'll just close the issue. Thanks for the update!
Most helpful comment
Just to add my test results to this issue, if it helps any...
I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:
But spot price is not fulfilled so instance is not created. Then
max-node-provision-timepasses:Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:
And pods are left unschedulable.