The AWS README currently advises:
Note that the instance types should have the same amount of RAM and number of CPU cores, since this is fundamental to CA's scaling calculations. Using mismatched instances types can produce unintended results.
The README also provides an example:
Set LaunchTemplateOverrides to include the 'base' instance type r5.2xlarge and suitable alternatives, e.g. r5d.2xlarge, i3.2xlarge, r5a.2xlarge and r5ad.2xlarge.
This raises two questions for me:
a) While r5.2xlarge has 64 GB of RAM, the i3.2xlarge has less: 61 GB of RAM. Wouldn't the 3 fewer GB of RAM play havoc with the CA's scaling calculations, as documented? Am I missing something here, or is i3.2xlarge erroneously included?
b) Would it be permissible to list as an alternative an instance type with slightly more CPU and/or RAM, accepting that the extra CPU and/or RAM will not be recognized/utilized/exploited by the CA's scheduler? For example, permitting the C5n family to be used as an alternative for the C5 family? If so, then the documentation language should be changed from "the same amount" and "mismatched" to language that makes it clear that larger alternatives are acceptable. And if not, then the documentation should clarify that larger alternatives are unacceptable, because at least to this naive perspective of somebody unfamiliar with the specifics of the scheduling algorithm, it seems as though it should be fine.
Paging @drewhemm , who wrote that section of the documentation
Hi @ari-becker ,
Perhaps the i3 family is not the best example, but personally I have not had any issues with them. It depends on the use case in question, particularly around resource requests...
Typically CA will just add more nodes until the current requests are satisfied. However, there is a theoretical edge case where if a request is for 62GB of RAM and CA adds i3.2xlarge instances, then the request may never be satisfied. I have not personally tested this edge case, but i3.2xlarge remains a "permissible alternative" in many cases.
Your point about the c5n instances is probably valid in the sense that more memory is almost certainly better than not enough. For this reason, I have some groups set up to use t3.xlarge and fall back to t3.2xlarge if necessary. I don't mind if there is some capacity wasted (especially burstable and/or spot), as long as the workloads get scheduled.
The "mismatched" was put in as CA was never originally designed to handle multiple instance types and therefore I think the developers wanted something like a disclaimer for those of us who really want to use that feature.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
@drewhemm 's comment answered my question, but I see the issue as a call to improve the documentation in line with his comment.
@ari-becker I'd love your feedback on #3198.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Should have been closed automatically when #3198 merged
Most helpful comment
Hi @ari-becker ,
Perhaps the
i3family is not the best example, but personally I have not had any issues with them. It depends on the use case in question, particularly around resource requests...Typically CA will just add more nodes until the current requests are satisfied. However, there is a theoretical edge case where if a request is for 62GB of RAM and CA adds
i3.2xlargeinstances, then the request may never be satisfied. I have not personally tested this edge case, buti3.2xlargeremains a "permissible alternative" in many cases.Your point about the
c5ninstances is probably valid in the sense that more memory is almost certainly better than not enough. For this reason, I have some groups set up to uset3.xlargeand fall back tot3.2xlargeif necessary. I don't mind if there is some capacity wasted (especially burstable and/or spot), as long as the workloads get scheduled.The "mismatched" was put in as CA was never originally designed to handle multiple instance types and therefore I think the developers wanted something like a disclaimer for those of us who really want to use that feature.