Autoscaler: Nodegroup Interface for per-nodegroup configuration

Created on 7 Oct 2020  路  10Comments  路  Source: kubernetes/autoscaler

Forking this off a thread in Slack.

Motivation behind this is to have some of the CA parameters operate on the nodegroup level. For example, have a different scale-down-utilization-threshold for every nodegroup instead of a cluster-wide setting.

From @marwanad:

I went through the parameters and a lot of them seem to be well-suite to be per nodegroup. After all, CA handles heterogeneous nodegroups well so it shouldn鈥檛 be a problem. The part up for discussion would probably be how to bubble down those parameters? Annotations on the node? Modify the node group spec to have an extra field? The last approach is a bit ugly so I think ideally we鈥檇 have a configmap of a sort for CA-related configs.

From @MaciekPytel:

i've also been thinking about this and i was actually considering going the route of adding a field to nodegroup interface; the reason for that is that:

  1. way nodegroups are configured is generally provider specific (even --nodes flag have different format across providers, not to mention various autodiscovery implementations)
  2. afaik autodiscovery is the most common way of configuring CA (admittedly i'm not familiar with all providers, so that may not be true across the board) - having nodegroup return the config would allow to extend existing autodiscovery setups to allow setting those values (for example via ASG tags); i think that would be most convenient to users, allowing to keep all configs related to nodegroup in the same place

All 10 comments

I've put some more thoughts on why I prefer cloudprovider-based implementation over configmap: https://github.com/kubernetes/autoscaler/pull/2989#issuecomment-712245350.

I'd like to pick this up and implement it for 1.21. Here is my proposal for how to implement this:

  1. We would split AutoscalingOptions into two structs: NodeGroupAutoscalingOptions and AutoscalingOptions. The options that make sense on NodeGroup level (tentative list below) would be moved to NodeGroupAutoscalingOptions. AutoscalingOptions would contain NodeGroupAutoscalingOptions that will be set via existing flags and used as a default for any NodeGroup that doesn't provide a value for a given option.
type NodeGroupAutoscalingOptions struct {
  ScaleDownUtilizationThreshold    float64
  ScaleDownGpuUtilizationThreshold float64
  ScaleDownUnneededTime            time.Duration
  ...
}

type AutoscalingOptions struct {
  NodeGroupDefaults  NodeGroupAutoscalingOptions
  MaxEmptyBulkDelete int
  ...
}
  1. We would introduce a new "processor" interface that will be used to determine the value of any NodeGroup level option. Anytime CA code requires the value of any of the options contained in NodeGroupAutoscalingOptions it will call the provided processors.
type NodeGroupOptionsProcessor interface {
  GetScaleDownUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error)
  GetScaleDownGpuUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error)
  GetScaleDownUnneededTime(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (time.Duration, error)
  ...
}
  1. We will add a method to NodeGroup interface that would return NodeGroupAutoscalingOptions that will be used for a given NodeGroup.

    • A NodeGroup will be provided default options (in case it only wants to change some values it will be able to set other ones to default).

    • A NodeGroup will be able to return nil NodeGroupAutoscalingOptions to indicate no special settings are configured for this NodeGroup. In that case default values will be used for this NodeGroup.

    • Implementation of NodeGroup.GetOptions method will be optional. A trivial implementation returning nil will result in the current behavior of using default values (specified via flags) for all NodeGroups. Any provider will be free to integrate this feature with their autodiscovery scheme (or other preferred mechanism) at any time they want.

type NodeGroup interface {
  ...
  GetOptions(defaults NodeGroupAutoscalingOptions) (*NodeGroupAutoscalingOptions, error)
  ...
}
  1. Default implementation of NodeGroupOptionsProcessor would rely on calling NodeGroup.GetOptions and filling in defaults as needed. If a user prefers a different way of configuring per-nodegroup settings (ex. via a configmap) they will be able to achieve it by swapping the processor implementation with a different one. This is the same pattern we use for any other advanced customization (ex. balancing similar NodeGroups).

Note that design allows customization at NodeGroup level and not at the level of individual nodes. I think this is a good thing for the reasons mentioned in https://github.com/kubernetes/autoscaler/pull/2989#issuecomment-712245350.


Appendix: per-nodegroup options

This is a _tentative_ list of options that I think makes sense to be specified at NodeGroup level. The list is subject to change during implementation (if it turns out the logic of the particular option can't be easily changed to apply it per-nodegroup, etc).

  • Below settings are checked independently for each node making them easy to migrate. They're also the most commonly requested ones, so they must be moved to NodeGroupAutoscalingOptions:

    • ScaleDownUtilizationThreshold

    • ScaleDownGpuUtilizationThreshold

    • ScaleDownUnneededTime

    • ScaleDownUnreadyTime

  • Below setting make conceptual sense on nodegroup level, but the way they're implemented makes migration more difficult. I'll try to include them in initial implementation, but it depends on number of changes required:

    • ScaleDownDelayAfterAdd

    • ScaleDownDelayAfterDelete

    • ScaleDownDelayAfterFailure

There are a few other settings that would be fairly easy to change to per-nodegroup, but I'm not convinced it makes conceptual sense (ex. IgnoreDaemonSetsUtilization), but I'm not yet sure if they make conceptual sense on NodeGroup level. I will look at them and extend this list as needed.

Another set of fields to potentially migrate are the ones related to how often a node is checked for scale-down (UnremovableNodeRecheckTimeout, etc). This is quite easy to set per-nodegroup, however, setting a value to low for one NodeGroup can starve scale-down in other groups, so I'm not sure if we want to allow that.

this generally makes sense to me, i think the processor interface and split between the autoscaling options and per-node group options seems reasonable.

the main question i am left with is how will users be able to associate a set of options with a specific node group?

the main question i am left with is how will users be able to associate a set of options with a specific node group?

The idea is to leave this part up to each cloudprovider. This proposal is basically just a framework that will allow providers to expose the feature to users in whatever way makes most sense for the given provider.

The rationale is that the way in which you configure CA is already very provider specific:

  • Since deploying CA across different providers already requires completely different way of configuring it, there is less value in making just one part of config standardized.
  • What NodeGroup maps to and how NodeGroup ids are determined is different for each provider. So even with a standardized way to configure it (such as configmap) one would still need to know about implementation of a particular provider to configure this.

Instead each provider can choose to add those additional settings to their existing way of configuring NodeGroups. For example in case of AWS it could be via ASG tags similarly to how various scale-from-0 settings are configured (or not :) the decision on this is with AWS owners), with GCE it could be included in autodiscovery specs, etc. I'm not sufficiently familiar with other providers to make good suggestions. I tried to make the processor flexible enough to hopefully fit each provider's use-case.

thanks @MaciekPytel , that makes good sense to me. i am wondering how users will interact with these mechanism too, would each provider be responsible for exposing flags or were you thinking about storing the configuration in some other way?

Each provider would be responsible for exposing a way to configure the feature to users and implementing NodeGroup.GetOptions() (or a completely different implementation of processor if needed). All the default implementation will do is poll NodeGroup.GetOptions() each loop and apply any provided overrides.

For a given provider doing nothing will result in no change in behavior - all NodeGroups will use default settings. At any point any provider interested in supporting the feature can add their implementation (this is the same pattern as scale-from-0).

i think it would be useful if we could have a standard methodology for providers to reference their configuration data, for example a command line flag to a yaml file or something. just to help contain the possible matrix of options that providers could choose. but aside from that, i don't have an objection to what you are describing. thanks for the extra detail =)

Specifically for cluster api - wouldn't something like an annotation on MD or MS be the most convenient way of configuring those settings? I'm not saying it's the right way to do it, I'm just looking for your opinion.

Anyway, the 'framework' changes above are completely independent from any mechanism for actually setting the values, so I think it's safe to implement the mechanics first and later add a flag if we decide we want it.

I've opened https://github.com/kubernetes/autoscaler/pull/3789 with the implementation.

Specifically for cluster api - wouldn't something like an annotation on MD or MS be the most convenient way of configuring those settings? I'm not saying it's the right way to do it, I'm just looking for your opinion.

yeah, adding an annotation to MD or MS would be a very straightforward way to do this from the capi perspective and it would give users an easy way to interact as well.

Anyway, the 'framework' changes above are completely independent from any mechanism for actually setting the values, so I think it's safe to implement the mechanics first and later add a flag if we decide we want it.

agreed, makes sense to me!

Was this page helpful?
0 / 5 - 0 ratings