Forking this off a thread in Slack.
Motivation behind this is to have some of the CA parameters operate on the nodegroup level. For example, have a different scale-down-utilization-threshold for every nodegroup instead of a cluster-wide setting.
From @marwanad:
I went through the parameters and a lot of them seem to be well-suite to be per nodegroup. After all, CA handles heterogeneous nodegroups well so it shouldn鈥檛 be a problem. The part up for discussion would probably be how to bubble down those parameters? Annotations on the node? Modify the node group spec to have an extra field? The last approach is a bit ugly so I think ideally we鈥檇 have a configmap of a sort for CA-related configs.
From @MaciekPytel:
i've also been thinking about this and i was actually considering going the route of adding a field to nodegroup interface; the reason for that is that:
- way nodegroups are configured is generally provider specific (even --nodes flag have different format across providers, not to mention various autodiscovery implementations)
- afaik autodiscovery is the most common way of configuring CA (admittedly i'm not familiar with all providers, so that may not be true across the board) - having nodegroup return the config would allow to extend existing autodiscovery setups to allow setting those values (for example via ASG tags); i think that would be most convenient to users, allowing to keep all configs related to nodegroup in the same place
I've put some more thoughts on why I prefer cloudprovider-based implementation over configmap: https://github.com/kubernetes/autoscaler/pull/2989#issuecomment-712245350.
I'd like to pick this up and implement it for 1.21. Here is my proposal for how to implement this:
type NodeGroupAutoscalingOptions struct {
ScaleDownUtilizationThreshold float64
ScaleDownGpuUtilizationThreshold float64
ScaleDownUnneededTime time.Duration
...
}
type AutoscalingOptions struct {
NodeGroupDefaults NodeGroupAutoscalingOptions
MaxEmptyBulkDelete int
...
}
type NodeGroupOptionsProcessor interface {
GetScaleDownUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error)
GetScaleDownGpuUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error)
GetScaleDownUnneededTime(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (time.Duration, error)
...
}
type NodeGroup interface {
...
GetOptions(defaults NodeGroupAutoscalingOptions) (*NodeGroupAutoscalingOptions, error)
...
}
Note that design allows customization at NodeGroup level and not at the level of individual nodes. I think this is a good thing for the reasons mentioned in https://github.com/kubernetes/autoscaler/pull/2989#issuecomment-712245350.
Appendix: per-nodegroup options
This is a _tentative_ list of options that I think makes sense to be specified at NodeGroup level. The list is subject to change during implementation (if it turns out the logic of the particular option can't be easily changed to apply it per-nodegroup, etc).
There are a few other settings that would be fairly easy to change to per-nodegroup, but I'm not convinced it makes conceptual sense (ex. IgnoreDaemonSetsUtilization), but I'm not yet sure if they make conceptual sense on NodeGroup level. I will look at them and extend this list as needed.
Another set of fields to potentially migrate are the ones related to how often a node is checked for scale-down (UnremovableNodeRecheckTimeout, etc). This is quite easy to set per-nodegroup, however, setting a value to low for one NodeGroup can starve scale-down in other groups, so I'm not sure if we want to allow that.
this generally makes sense to me, i think the processor interface and split between the autoscaling options and per-node group options seems reasonable.
the main question i am left with is how will users be able to associate a set of options with a specific node group?
the main question i am left with is how will users be able to associate a set of options with a specific node group?
The idea is to leave this part up to each cloudprovider. This proposal is basically just a framework that will allow providers to expose the feature to users in whatever way makes most sense for the given provider.
The rationale is that the way in which you configure CA is already very provider specific:
Instead each provider can choose to add those additional settings to their existing way of configuring NodeGroups. For example in case of AWS it could be via ASG tags similarly to how various scale-from-0 settings are configured (or not :) the decision on this is with AWS owners), with GCE it could be included in autodiscovery specs, etc. I'm not sufficiently familiar with other providers to make good suggestions. I tried to make the processor flexible enough to hopefully fit each provider's use-case.
thanks @MaciekPytel , that makes good sense to me. i am wondering how users will interact with these mechanism too, would each provider be responsible for exposing flags or were you thinking about storing the configuration in some other way?
Each provider would be responsible for exposing a way to configure the feature to users and implementing NodeGroup.GetOptions() (or a completely different implementation of processor if needed). All the default implementation will do is poll NodeGroup.GetOptions() each loop and apply any provided overrides.
For a given provider doing nothing will result in no change in behavior - all NodeGroups will use default settings. At any point any provider interested in supporting the feature can add their implementation (this is the same pattern as scale-from-0).
i think it would be useful if we could have a standard methodology for providers to reference their configuration data, for example a command line flag to a yaml file or something. just to help contain the possible matrix of options that providers could choose. but aside from that, i don't have an objection to what you are describing. thanks for the extra detail =)
Specifically for cluster api - wouldn't something like an annotation on MD or MS be the most convenient way of configuring those settings? I'm not saying it's the right way to do it, I'm just looking for your opinion.
Anyway, the 'framework' changes above are completely independent from any mechanism for actually setting the values, so I think it's safe to implement the mechanics first and later add a flag if we decide we want it.
I've opened https://github.com/kubernetes/autoscaler/pull/3789 with the implementation.
Specifically for cluster api - wouldn't something like an annotation on MD or MS be the most convenient way of configuring those settings? I'm not saying it's the right way to do it, I'm just looking for your opinion.
yeah, adding an annotation to MD or MS would be a very straightforward way to do this from the capi perspective and it would give users an easy way to interact as well.
Anyway, the 'framework' changes above are completely independent from any mechanism for actually setting the values, so I think it's safe to implement the mechanics first and later add a flag if we decide we want it.
agreed, makes sense to me!