Autoscaler: Node pool scale up timeout

Created on 9 Aug 2018 · 30Comments · Source: kubernetes/autoscaler

The autoscaler has a timeout for non-ready nodes which forces it to kill those nodes and potentially select a different node pool in the next iteration. However, in the situation where the node pool cannot scale up at all it'll happily wait forever, keeping pods in Pending state without trying to compensate.

For example, setting multiple AWS Spot node pools with different instance types, or setting up a Spot pool and an On Demand pool doesn't really work. We'd expect CA to scale up one of the ASGs, detect a few minutes later that there's still no nodes coming up (because the corresponding Spot pool doesn't have capacity) and fall back to another pool. What actually happens is that CA will scale up the node pool by increasing desired capacity and then not do anything at all other than printing Upcoming 1 nodes/Failed to find readiness information for ....

cluster-autoscaler kinbug

Source

aermakov-zalando

Most helpful comment

Just to add my test results to this issue, if it helps any...

I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:

I0417 13:42:04.765399       1 scale_up.go:427] Best option to resize: eu01-stg-spot-2
I0417 13:42:04.765417       1 scale_up.go:431] Estimated 1 nodes needed in eu01-stg-spot-2
I0417 13:42:04.765439       1 scale_up.go:533] Final scale-up plan: [{eu01-stg-spot-2 2->3 (max: 20)}]

But spot price is not fulfilled so instance is not created. Then max-node-provision-time passes:

W0417 14:03:32.422046       1 clusterstate.go:198] Scale-up timed out for node group eu01-stg-spot-1 after 15m8.316405684s
W0417 14:03:32.422109       1 clusterstate.go:221] Disabling scale-up for node group eu01-stg-spot-1 until 2019-04-17 14:08:32.247733959 +0000 UTC m=+4217.943746338
W0417 14:03:32.532256       1 scale_up.go:329] Node group eu01-stg-spot-1 is not ready for scaleup - backoff

Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:

I0417 14:09:23.451114       1 scale_up.go:412] No need for any nodes in eu01-stg
I0417 14:09:23.451516       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-1
I0417 14:09:23.452883       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-2

And pods are left unschedulable.

max-rocket-internet on 17 Apr 2019

👍5

All 30 comments

CA should timeout scale-up (IIRC after 15 minutes) and put the node group that failed to scale-up in 'backoff' state. At that point it should try to scale-up again, ignoring this node group. You can see this by looking for one of the following:

Scale-up timed out for node group <name> after <time> in logs
ScaleUpTimedOut event on cluster-autoscaler-status configmap (in kube-system ns)
Node group showing scale-up status as "Backoff" in status configmap
failed_scale_ups_total metric increasing

If you can reproduce it and doesn't see any of the above it's a bug. In that case can you provide some more details (especially CA version you're using).

MaciekPytel on 9 Aug 2018

We're running version 1.2.2, but I can try with 1.3.1.

aermakov-zalando on 9 Aug 2018

It shouldn't make any difference, this was added earlier than 1.2 (I don't remember exactly, but probably 1.1 timeframe?).

So you're saying it's stuck on Upcoming 1 node for more than 15 minutes? Can you provide a log of initial scale-up, a loop immediately after and another loop after 15+ minutes?

MaciekPytel on 9 Aug 2018

As @aleksandra-malinowska pointed out to me the timeout is effectively reset if there is another scale-up on the same node group (ie. CA only notices it if the last of multiple overlapping scale-ups times out). So that may be another thing to look for in the logs.

MaciekPytel on 9 Aug 2018

The logs are in this gist. Please ignore all ASGs and pods in 1b, the nodes there were constantly being created and destroyed by Spot termination.

It looks like the scale-up timeout is working just fine and the pool is marked as unhealthy. However, in the next loop iteration CA doesn't consider another group that would've fit the same pod (scale_up.go:178] No need for any nodes in nodepool-default-worker-m4-splitaz-aws-ACCT-eu-central-1-kube-aws-test-aermakov64-AutoScalingGroup1a-LJECHLYTYAKY in 03-next-iteration.txt). It only tries to scale it up after I manually disable the unhealthy node group by setting its max size to 0.

aermakov-zalando on 9 Aug 2018

I think the problem is that building upcomingNodes in https://github.com/kubernetes/autoscaler/blob/f646b9a5570fa0ef92f7bd4e181d5880a7667a9d/cluster-autoscaler/core/scale_up.go#L278 ignores whether the node group is healthy or not, so the loop in https://github.com/kubernetes/autoscaler/blob/f646b9a5570fa0ef92f7bd4e181d5880a7667a9d/cluster-autoscaler/core/scale_up.go#L304 thinks that the pod could be scheduled on the perma-upcoming node and doesn't consider other groups, but I could also be completely wrong.

aermakov-zalando on 9 Aug 2018

It seems that an easy fix would be to completely ignore upcoming nodes from unhealthy groups in ScaleUp(). It doesn't look like the list is used for anything other than scheduling estimation, so it shouldn't affect anything else. WDYT?

aermakov-zalando on 9 Aug 2018

CA handles this by resizing the node group back to original size after timed-out scale-up. This is done in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L180. However, going through the code it looks like this may not work for timed-out scale-from-0. It would be signified by lines in log looking like Readiness for node group <group> not found, which I do see in your log.

I'll try to reproduce later to confirm this theory (not enough time this week, sorry), but I'm fairly confident that's what's happening. If I'm right it's a bug in clusterstate. I have an idea how to fix it, but clusterstate is not the easiest thing to reason about and I need to have some time to dig into it to make sure I'm not breaking anything.

MaciekPytel on 9 Aug 2018

👍4

@MaciekPytel did you have a chance to look into it yet?

jrake-revelant on 20 Aug 2018

@MaciekPytel could you reproduce this?

szuecs on 27 Sep 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 26 Dec 2018

/remove-lifecycle stale
/sigh I keep meaning to go back and fix this..

MaciekPytel on 27 Dec 2018

cc: @losipiuk

MaciekPytel on 9 Jan 2019

👍 using 1.2.4 (K8S 1.10.12) we had problems using spot and ondemand nodepools. CA tries (ad infinitum) request a spot (even with the price-too-low state) and never fallback to the next nodepools (ondemand).

phspagiari on 16 Jan 2019

With recent code changes cloudprovider.NodeGroup.Nodes() returns Instance objects. The intention of this method is to return an Instance object for not only running nodes but also those which are being crated or deleted. Additionally, for nodes which are being created (with Status.State==InstanceCreating), ErrorInfo can be provided. The core logic now reacts to errors specified via ErrorInfo and will retract from scale-up of give node group (and backoff node group) immediately, instead waiting 15mins for timeout.

If you want to implement it for AWS you may take a look at implementation for GCE which is already in.

Note: For CA logic to react on error the ErrorClass must be set to OutOfResourcesErrorClass. Probably it will be extended to OtherErrorClass too.

losipiuk on 21 Jan 2019

Circling back to @aermakov-zalando's point. Does it not make sense to skip upcoming nodes from unhealthy groups? We should be able to assume that the node is no longer upcoming and not increment upcomingNodes. That would make cluster-autoscaler fall back on a different ASG for the pod that cannot be unscheduled.

viggeh on 8 Mar 2019

@viggeh Are you still planning on tackling this in a follow-up PR? We're hitting this and could take a stab at the recommended implementation, if that's helpful.

davidquarles on 28 Mar 2019

@davidquarles Sorry for the late reply. I've not had the time to take a proper look at this and don't expect to do so in the next couple of weeks. If you can take the problem on that would be awesome!

viggeh on 2 Apr 2019

@MaciekPytel @mwielgus @viggeh I'm no AWS expert, but AFAICT instances aren't added to the ASG when spot requests are made. With an open, unfulfillable spot request triggered by an ASG, I see that the ASG's desired number of instances has been incremented, and the spot request shows up as an associated scaling activity, but there are no new instances under the ASG. Is the aforementioned strategy for solving this still valid? If so, would I just use AwsNodeGroup.TemplateNodeInfo to create fake nodes (with Status.State==InstanceCreating and corresponding ErrorInfo)?

davidquarles on 9 Apr 2019

I'll add that this is also valid for any failing reason other than spot requests not being fulfilled, eg: corrupted LC or LT config - AMI not found, missing subnet or SG, etc..

mvisonneau on 11 Apr 2019

CA handles this by resizing the node group back to original size after timed-out scale-up.

I don't see this behaviour. In my test the ASG is just left at the new value even though the new instance didn't get provisioned.

max-rocket-internet on 17 Apr 2019

Just to add my test results to this issue, if it helps any...

I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:

I0417 13:42:04.765399       1 scale_up.go:427] Best option to resize: eu01-stg-spot-2
I0417 13:42:04.765417       1 scale_up.go:431] Estimated 1 nodes needed in eu01-stg-spot-2
I0417 13:42:04.765439       1 scale_up.go:533] Final scale-up plan: [{eu01-stg-spot-2 2->3 (max: 20)}]

But spot price is not fulfilled so instance is not created. Then max-node-provision-time passes:

W0417 14:03:32.422046       1 clusterstate.go:198] Scale-up timed out for node group eu01-stg-spot-1 after 15m8.316405684s
W0417 14:03:32.422109       1 clusterstate.go:221] Disabling scale-up for node group eu01-stg-spot-1 until 2019-04-17 14:08:32.247733959 +0000 UTC m=+4217.943746338
W0417 14:03:32.532256       1 scale_up.go:329] Node group eu01-stg-spot-1 is not ready for scaleup - backoff

Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:

I0417 14:09:23.451114       1 scale_up.go:412] No need for any nodes in eu01-stg
I0417 14:09:23.451516       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-1
I0417 14:09:23.452883       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-2

And pods are left unschedulable.

max-rocket-internet on 17 Apr 2019

👍5

We're running into the same issues here. I know it's only been like 13 days, but did you happen to find a workaround @max-rocket-internet?

choseh on 30 Apr 2019

@choseh nope 😐

max-rocket-internet on 30 Apr 2019

Please address this issue.

baxor on 1 May 2019

EDIT: This was not a fault on Kubernetes side, i simply ran into an IP Address Quota on Google side...

Stumbled across this too. I tried to use a scale-to-zero pool besides my existing cluster for new CI/CD gitlab workers. Running on GKE with 1.13.6-gke.0.

Sadly there is no scaling above 1 node, thus 5 are allowed. Main-Cluster is currently running on 4 nodes.

Tried with and without preemptible nodes (guess the AWS term is SPOT for this)

pod description:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   16s (x2 over 16s)  default-scheduler   0/5 nodes are available: 4 node(s) didn't match node selector, 5 Insufficient cpu.
  Normal   NotTriggerScaleUp  4s (x2 over 15s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 in backoff after failed scale-up

MaxWinterstein on 17 Jun 2019

👍1

AFAICT, #2235 properly handles the bug described here, and that PR has been merged. Can we can close this bug out now?

jaypipes on 20 Aug 2019

@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?

aermakov-zalando on 20 Aug 2019

@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?

While the fix in #2235 is indeed for the AWS cloud provider, the solution is the one recommended by @MaciekPytel and @losipiuk for all cloud providers that suffer from the "I don't have an actual Instance (yet)" problem with their autoscaling API. See here for @MaciekPytel recommendation:

https://github.com/kubernetes/autoscaler/pull/2008#issuecomment-491280315

Note that he mentions that the managed instance groups API in GCE does the whole placeholder thing behind the scenes but is essentially what was implemented for the AWS cloud provider in #2235

jaypipes on 20 Aug 2019

I see. We only really care about AWS anyway, so I'll just close the issue. Thanks for the update!

aermakov-zalando on 20 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings