Autoscaler: [cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity

Created on 24 Jun 2020 · 9Comments · Source: kubernetes/autoscaler

I have two ASG: a spot and on-demand ASG. They are GPU nodes, so frequently spot instances aren't available. AWS tells us very quickly that a spot instance is unavailable: we can see "Could not launch Spot Instances. InsufficientInstanceCapacity - There is no Spot capacity available that matches your request. Launching EC2 instance failed" in the ASG logs.

The current behavior is that autoscaler tries to use the spot ASG for 15 minutes (my current timeout) before it gives up and tries to use a non spot ASG. Ideally, it could notice that the reason the ASG did not scale up, InsufficientInstanceCapacity, is unlikely to go away in the next 15 minutes and would instead mark that group as unable to scale up and fall back to the on-demand ASG.

Source

cep21

👍13

Most helpful comment

Super important!
/remove-lifecycle stale

cep21 on 18 Dec 2020

👍2

All 9 comments

Having the same issue here.

https://github.com/kubernetes/autoscaler/blob/852ea800914cae101824687a71236f7688ee653d/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L220

SetDesiredCapacity will not return any error related to InsufficientInstanceCapacity according to its doc. We might need to check the scaling activities by calling DescribeScalingActivities.

{
    "Activities": [
        {
            "ActivityId": "ee05cf07-241b-2f28-2be4-3b60f77a76e9",
            "AutoScalingGroupName": "nodes-gpu-spot-cn-north-1a.aws-cn-north-1.prod-1.k8s.local",
            "Description": "Launching a new EC2 instance.  Status Reason: There is no Spot capacity available that matches your request. Launching EC2 instance failed.",
            "Cause": "At 2020-08-06T03:20:39Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
            "StartTime": "2020-08-06T03:20:43.979Z",
            "EndTime": "2020-08-06T03:20:43Z",
            "StatusCode": "Failed",
            "StatusMessage": "There is no Spot capacity available that matches your request. Launching EC2 instance failed.",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"subnet-5d6fb339\",\"Availability Zone\":\"cn-north-1a\"}"
        },
        ...
    ]
}

qqshfox on 6 Aug 2020

I think the title of this issue should be amended to include other holding states. For example, I'm running into a similar issue with price-too-low. If the maximum spot price for my ASGs is below the current spot prices, cluster-autoscaler waits quite a while before it attempts to use a non-spot ASG.

JacobHenner on 18 Sep 2020

It's not just spot. Another example is you can hit your account limit on number of instances of a specific instance type: that will also not likely change in the next 15 minutes and it's best to try another ASG.

A general understanding of failure states that are unlikely to change could be very helpful.

cep21 on 18 Sep 2020

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 17 Dec 2020

Super important!
/remove-lifecycle stale

cep21 on 18 Dec 2020

👍2

Looking at AWS API, it seems like there is no reliable way to find out that scaling out for particular SetDesiredCapacity call has failed. If SetDesiredCapacity returned ActivityId for scaling activity, that would work.
Otherwise - personally I can't come up with nothing better than parsing autoscaling activities "younger" than mySetDesiderCapacity API call. Don't feel like this way is production-ready.
Any better ideas?

klebediev on 21 Dec 2020

I wouldn't expect anything that ties back to a single SetDesiredCapacity since it's async and there could be multiple calls.

parsing autoscaling activities "younger" than mySetDesiderCapacity API call

Maybe look at the last activity (rather than them all), if it's recent (for some definition of recent), then assume the capacity isn't able to change right now and quick fallover any scaling operation.

cep21 on 22 Dec 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot on 22 Mar 2021

Super important!
/remove-lifecycle stale

cep21 on 22 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings