Autoscaler: Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy

Created on 30 Jan 2020 · 7Comments · Source: kubernetes/autoscaler

The AWS README currently advises:

Note that the instance types should have the same amount of RAM and number of CPU cores, since this is fundamental to CA's scaling calculations. Using mismatched instances types can produce unintended results.

The README also provides an example:

Set LaunchTemplateOverrides to include the 'base' instance type r5.2xlarge and suitable alternatives, e.g. r5d.2xlarge, i3.2xlarge, r5a.2xlarge and r5ad.2xlarge.

This raises two questions for me:

a) While r5.2xlarge has 64 GB of RAM, the i3.2xlarge has less: 61 GB of RAM. Wouldn't the 3 fewer GB of RAM play havoc with the CA's scaling calculations, as documented? Am I missing something here, or is i3.2xlarge erroneously included?

b) Would it be permissible to list as an alternative an instance type with slightly more CPU and/or RAM, accepting that the extra CPU and/or RAM will not be recognized/utilized/exploited by the CA's scheduler? For example, permitting the C5n family to be used as an alternative for the C5 family? If so, then the documentation language should be changed from "the same amount" and "mismatched" to language that makes it clear that larger alternatives are acceptable. And if not, then the documentation should clarify that larger alternatives are unacceptable, because at least to this naive perspective of somebody unfamiliar with the specifics of the scheduling algorithm, it seems as though it should be fine.

lifecyclstale

Source

ari-becker

Most helpful comment

Hi @ari-becker ,

Perhaps the i3 family is not the best example, but personally I have not had any issues with them. It depends on the use case in question, particularly around resource requests...

Typically CA will just add more nodes until the current requests are satisfied. However, there is a theoretical edge case where if a request is for 62GB of RAM and CA adds i3.2xlarge instances, then the request may never be satisfied. I have not personally tested this edge case, but i3.2xlarge remains a "permissible alternative" in many cases.

Your point about the c5n instances is probably valid in the sense that more memory is almost certainly better than not enough. For this reason, I have some groups set up to use t3.xlarge and fall back to t3.2xlarge if necessary. I don't mind if there is some capacity wasted (especially burstable and/or spot), as long as the workloads get scheduled.

The "mismatched" was put in as CA was never originally designed to handle multiple instance types and therefore I think the developers wanted something like a disclaimer for those of us who really want to use that feature.

drewhemm on 30 Jan 2020

👍2

All 7 comments

Paging @drewhemm , who wrote that section of the documentation

ari-becker on 30 Jan 2020

Hi @ari-becker ,

Perhaps the i3 family is not the best example, but personally I have not had any issues with them. It depends on the use case in question, particularly around resource requests...

drewhemm on 30 Jan 2020

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Apr 2020

/remove-lifecycle stale

@drewhemm 's comment answered my question, but I see the issue as a call to improve the documentation in line with his comment.

ari-becker on 29 Apr 2020

@ari-becker I'd love your feedback on #3198.

otterley on 12 Jun 2020

👀1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 10 Sep 2020

Should have been closed automatically when #3198 merged

ari-becker on 3 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings