Cluster-api: Support syncing a set of labels from MachineDeployment/MachineSet/Machine to Nodes

Created on 6 Sep 2018  路  65Comments  路  Source: kubernetes-sigs/cluster-api

This issue is to add the capability of defining user defined labels, annotations & taints to be created automatically on the created nodes.

for e.g.: The following MachineSet when realized on any of the provider, should result in the 2 user defined labels latency-sensitive: yes and service-class: gold to be created on the nodes created as part of this machineset.

apiVersion: cluster.k8s.io/v1alpha1
kind: MachineSet
metadata:
  name: gold-workers
spec:
  replicas: 3
  selector:
    matchLabels:
      node-type: worker-node
  template:
    metadata:
      labels:
        node-type: worker-node
    spec:
      metadata:
        labels:
          latency-sensitive: yes
          service-class: gold
      providerConfig:
      ...
      ...

The same mechanics should be extended for annotations as well as taints.

areapi kindesign prioritimportant-soon

Most helpful comment

the list of labels, taints, and annotations need to be additive (not authoritative)

All 65 comments

This would be nice, we had to write a little controller ourselves to get those applied.

the list of labels, taints, and annotations need to be additive (not authoritative)

@dgoodwin

This would be nice, we had to write a little controller ourselves to get those applied.

Wondering what is the issue in extending existing machine controller in cluster-api to perform this task as well in the reconcile loop?

/assign @hardikdr

@hardikdr are you currently working on this?

@sidharthsurana do you believe this is a hard requirement for v1alpha1, or could it slip to v1alpha2?

@ncdc I don't this is a hard requirement from the cluster-api point of view for the v1alpha1 milestone. Given that individual provider implementations can implement these independently as well. I opened this Issue so that if we can figure out a provider agnostic way of implementing this behavior.

/milestone Next

@ncdc Yes I have started looking into it, I will soon be providing a proposal-doc for the same.

@hardikdr What is the progress of this? If you are not working, can I take this over?

@qiujian16 @xunpan @clyang82

/area api

@gyliu513 I had prepared and presented the proposal doc. There was not any conclusion though. Doc has to be converted into the defined template as well.
I am more than happy to collaborate with you, to push it forward.

Thanks @hardikdr , the doc looks good ;-)

@vincepri @detiber can you please post some comments to the design doc as well? I think this is a very important project for cluster life cycle management.

The linked doc proposes full management of Labels, Annotations, and Taints for Machines. It might be better to initial target initial state (Labels and Taints are already supported in this manner if using kubeadm for bootstrapping).

@detiber but I think we still need to tell kubadm what kind of labels, taints should be added when bootstraping the nodes on different cloud, so seems we still need some logic to set those labels, annotations etc.

@gyliu513 fully agree

at least I saw vshpere support taint and label
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/116/files
though this is a pretty old stuff...

maybe different provider will honor the spec.Taints etc?

@hardikdr

I have one doubt, in the current API the MachineSpec already has the MetaObject and Taints field that is precisely defined for this purpose

see https://github.com/kubernetes-sigs/cluster-api/blob/0b215f4d102edd23977f04383eb8d2b028bd315b/pkg/apis/cluster/v1alpha1/machine_types.go#L58-L71

In the CAPV we have implemented this feature using those fields.
Do we really need to redefine a new Field like you are doing given that information already exists?

IIRC back in the day when I brought this topic up the main point of content was how to enforce these labels and taints on the node. In CAPV we went ahead with the best available option at that time of using the kubelet flags. Maybe now kubeadm supports that out of the box so the generic implementation needs to use that. I idea was to make this implement in CAPI in a provider-agnostic way.

@jichenjc yes you are right CAPV already implements this feature.

If we want all consumers of Cluster API to be able to get their nodes' annotations, labels, and taints set based on what's defined in the common MachineSpec, it may make sense to consider adding code to Cluster API itself to set a node's attributes after it has joined the cluster, instead of trying to use kubeadm to do this. This would ensure that any node-related fields in the MachineSpec would be honored everywhere, regardless of bootstrap provider.

@timothysc could you please weigh in on the question of having Cluster API continue to reconcile a node's attributes forever, vs. only setting the initial values? Thanks.

@ncdc I think the issue with having cluster-api try to manage these Node attributes is that we have to potentially worry about battling user management of the same attributes, since users may try to add, remove, or update labels, annotations, or taints using kubectl or other tools.

@detiber I agree, and we need to come to a consensus re if CAPI is allowed to reconcile beyond setting the initial values. I'm also wondering it's worth considering a little bit of wiggle room here for the initial values - having a CAPI controller do a one-time best-effort patch of the node to set the initial values for annotations/labels/taints vs requiring the bootstrapper to do that.

There is a bunch of history around this issue (cf. https://github.com/kubernetes/kubeadm/issues/202 and/or last years meeting notes). The previous consensus was that we would allow the setting of taints but that labels and annotations would be handled by another mechanism.

@detiber is right about the risk of battling controllers. This is why taints are additive: https://github.com/kubernetes-sigs/cluster-api/blob/9f53a3abf65c22ac8c98c2904946cf02f1038b4c/pkg/apis/cluster/v1alpha1/machine_types.go#L64

I also agree with @ncdc that we should provide a solution which works across provider implementations. This solution may documentation or a change in consensus...

As discussed in today's grooming call, we plan to include support for setting the initial values of a node's labels and taints in a kubeadm-based bootstrap provider (you may choose to use this provider or pick another one).

I don't believe kubeadm currently supports setting the initial annotations for a node. @timothysc says people typically deploy a daemonset pod that can set these values based on the node's/server's characteristics.

Managing the continuing reconciliation of a node's labels/annotations/taints is not in scope for Cluster API - it has security implications because changing these values can affect what can/can't be run on a node. This is better handled by a tool/user with full cluster-admin access and a focus on cluster security. (@timothysc did I say that correctly? 馃槃)

I completely appreciate the difficulty of competing with cluster admins when tainting/labelling nodes, but only setting the values once does go against standard controller reconciliation behaviour to continually bring desired and observed state into alignment. Are there other examples of this in kubernetes APIs? We need to be careful with principle of least surprise and make the intent of these in the machine spec clear.

An alternative would be to only allow setting of certain labels/taints, eg cluster-api.k8s.io/*, and continue to reconcile them as normal, leaving other taints/labels to be handled by cluster admins as normal.

agree, seems we will change the declartive API to imperative API if the first time setting and no further reconcile ...set cluster-api.k8s.io/* is reasonable as it will handle all resource belong to cluster-api if we consider the label is resource of cluster-api, but how much is that useful to user(only label with this prefix) is a question...

I should have been clearer in what I wrote above. The suggestion is to allow a bootstrap provider (as proposed in #997), if it wants, to allow the user to specify the initial taints and labels. The bootstrap provider's API types are 100% up to it. Cluster API doesn't require any specific functionality from a bootstrap provider beyond the overall goal of turning a server into a Kubernetes node (our definition of bootstrapping).

Letting the bootstrap provider manage initial taints and labels is aligned with kubeadm's NodeRegistrationOptions type, which is part of kubeadm's InitConfiguration and JoinConfiguration types. When you run kubeadm init or kubeadm join, it is a one-time operation, and it can set the initial values for a node's taints and labels. Any further management of these fields is outside of the scope of kubeadm, and also outside of the scope of a bootstrap provider.

I wrote above that it might be worth considering having Cluster API manage a node's labels, taints, and annotations consistently regardless of bootstrap or infrastructure provider. However, at last week's grooming discussion, @timothysc pointed out that manipulating a node's taints and labels can have security implications. Given that, I'm not sure Cluster API (via bootstrapping) should do anything other than set the initial taints and labels. We should probably also discuss removing Taints from MachineSpec, as it's really up to the bootstrap provider to handle them, and if we have a common field in MachineSpec for taints, a user could potentially be confused if they set it but their bootstrap provider doesn't do anything with the values.

I was hoping for a solution which is agnostic to both cloud-provider and bootstrap-provider. This seems like a generic feature and doesn't need to be coupled with providers I guess. I am also curious to know more about security implications if we decide to reconcile the taints/labels.

I don't believe this will make v1alpha2 (other than the kubeadm bootstrapper continues to support initial node labels and taints).

/milestone Next

We need to re-check the comments and behavior and verify current state.

As far as I'm aware, the cluster API manager doesn't interact with the created API, so it would be up to whatever bootstrap provider to apply these labels. The question is where to put the API for creating them: inside the bootstrap provider config means every bootstrap provider can support, some, all, or none of these features.

Putting the taint, label, and annotations in the cluster API implicitly means every bootstrap provider has more surface area to support. My inclination is to avoid this, because it'll be hard to guarantee that the fields won't be ignored. The Control Plane CAEP (#1613) means that this would only have an effect on worker nodes, not control plane nodes.

I think we should make a call on whether to include this in v1a3 or decline it entirely, three releases worth of purgatory is too many.

/cc @ncdc

The kubeadm bootstrap provider can do taints and initial node labels, but we don't manage them beyond initial registration. I'd like to stop there and close this issue.

I agree, I think this is something that could easily be layered on top of cluster-api if needed and doesn't need to be part of the core cluster-api behaviors.

/close

@liztio: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I understand the security concerns around reconciling labels/annotations on nodes. Assigning labels through kubeadm's nodeRegistration is a viable option for many use cases. However, it's inherently tied to the time of machine creation, thus adding/removing a label would require a brand new set of machines to be provisioned.

I would like to ask if the CAPI community sees value in providing a more flexible interface to users so that they can link their tooling (e.g. controllers running in workload cluster) with CAPI constructs (e.g. MachineDeployment).

If we can agree it is valuable, then we can discuss what such an interface should look like and how it should it be implemented? To me, reconciling node labels is just one possible approach -- one which we clearly don鈥檛 want to take. :)

In another approach, CAPI could maintain a node specific ConfigMap. MD could expose a field, say userData, and CAPI would do the job of populating each Node created for this MD with the data in this field. Users are then free to develop controllers that consume this configmap.

Another approach is to use a CRD instead of a ConfigMap. The spec gets populated by CAPI, and user controllers work to achieve the desired state and utilize the status field to provide visibility to the user.

@Arvinderpal

... providing a more flexible interface to users so that they can link their tooling (e.g. controllers running in workload cluster) with CAPI constructs (e.g. MachineDeployment)

Could we explore this in more detail? I think teasing out specific use cases could help us arrive at an approach. Do you have specific examples in mind where a Pod needs to know something about the Cluster API Machine/MachineSet/MachineDeployment?

The initial use case is about making in-place changes to our nodes. The changes are in-place in the sense that they can be carried out without machine reprovisioning or reboot. Our initial set of controllers will focus on tweaking host configurations (sysctl), making network specific change (iptables), and updating software packages (apt update).

First issue is identifying the set of nodes on which a specific change must be carried out. Labels can be used for this purpose. Since we are already using MachineDeployments to organize our nodes, it's natural to ask if we can take that a step further. That is, user labels added to a MD become visible to the controllers (e.g. through a ConfigMap) and it uses them to identify the set of nodes.

Second issue is getting the input data to the controller (i.e. what it should do). The conventional approach to do doing this is to have a controller specific CRD. However, I wonder if there is scope to allow the user to specify this input data at the MD level?

@ncdc piggybacking on your comment here; in particular, option 2. Supporting reconcile on labels with a specific prefix (e.g. cluster.x-k8s.io) would be another approach as well. I could image these "user labels" being exposed at the providerMachineTemplate level, and CAPI adding the prefix and handling reconciliation. I suppose, individuals providers may do this in their own controllers as well.

In any case, it would be great if we can continue this discussion.

Supporting reconcile on labels with a specific prefix (e.g. cluster.x-k8s.io)

My preference would be to reserve cluster.x-k8s.io/* labels for functionality defined by Cluster API, such as marking a machine as preemptible/interruptible/spot instance. This specific example does not provide the user the ability to set any labels and have them synced from Machine to Node. It instead enforces that a single label is present in the Node based on characteristics of the Machine.

I would imagine users would likely want full flexibility in the labels they assign to Nodes and not be restricted only to keys that are prefixed with cluster.x-k8s.io.

Going back to a previous comment:

I would like to ask if the CAPI community sees value in providing a more flexible interface to users so that they can link their tooling (e.g. controllers running in workload cluster) with CAPI constructs (e.g. MachineDeployment).

This makes me think that you have a persona who has access to and is interacting with both Cluster API resources and workload cluster resources in a single/combined context?

I would imagine users would likely want full flexibility in the labels they assign to Nodes and not be restricted only to keys that are prefixed with cluster.x-k8s.io.

That would be the ideal approach, but as noted introduces security concerns. Limiting CAPI to only reconcile a specific set of labels is a compromise of sorts. If we have user-labels specified as {foo1: bar1, foo2: bar2} show up on the Node(s) as:

    labels:
      cluster.x-k8s.io/foo1: "bar1"
      cluster.x-k8s.io/foo2: "bar2"

IMO, this should be okay, but I understand your reservations on using 'cluster.x-k8s.io'. What about letting the prefix also be user specified -- user-label-prefix: "my.prefix.io"? So, you end up labels like: my.prefix.io/foo1: "bar1". We could disallow well known prefixes like *.kubernetes.io, *.x-k8s.io etc.

This makes me think that you have a persona who has access to and is interacting with both Cluster API resources and workload cluster resources in a single/combined context?

We have built a tool -- airshipctl -- that does interact with both the mgmt and workload clusters. CAPI and a few providers (metal3 and capz) are used for cluster creation. It's targeted at large organizations (e.g. AT&T) that have wide ranging requirements for their workloads. We use CAPI constructs like MachineDeployment to create and organize our workload clusters, but we have additional functionality (like the in-place controllers I mentioned) that is outside the CAPI world. The ask here is to have a flexible means by which to link the two worlds. In particular, to address the node identification problem.

More than happy to elaborate further if there is any confusion. Perhaps a quick zoom chat.

Is this in any way similar to the ask in #3335? I realize you're talking about labels and that's talking about tokens, but I'm curious if the use cases are aligned?

I'm ok with CAPI reconciling well defined prefix labels, taints and tolerations are a different story.

Is this in any way similar to the ask in #3335? I realize you're talking about labels and that's talking about tokens, but I'm curious if the use cases are aligned?

Just did a quick read. It seems that for the author having labels applied through the kubelet is not sufficient as they allow 3rd parties to add nodes to a workload cluster (i.e. handing out bootstrap tokens). I suppose if labels could be controlled at the MachineDeployment/Machine/MachineTemplate level, then that may be a possible route for the author to take. But I'll stop there as I don't fully understand the specifics of what the author is trying to accomplish.

@Arvinderpal ok thanks for reading that 馃槃. I wasn't sure if it fit, as it took me a while to understand the use case, but it sounded like maybe it was related.

@timothysc I'm curious why you'd be ok having CAPI reconcile labels matching a prefix and not all allowed labels? How would we explain to users why it only handles a subset? And what prefix are you thinking would be allowed?

Prefix because it should be non-colliding, and you would know the source. I think something like this "could" be a feature you opt into vs. being a default behavior.

I don't care about a specific prefix other then something shared and unique.

I think #3335 is relevant here. Am looking into bootstrap security, and it's a risk that nodes can set their own labels via kubeadm on node registration, e.g.

  • I have a protected workload that uses some set of secrets, and I use a node selector using labels to set this up.
  • During new node creation, I could potentially override the labels, and gain access to the secrets, volumes, etc... on a node that wasn't supposed to.

Should add that kops is expecting Cluster API to provide a mechanism to replace what it's doing to mitigate this issue in https://github.com/kubernetes/kops/blob/master/docs/architecture/kops-controller.md#nodecontroller

cc @justinsb

@randomvariable Thanks for sharing the kops link; in particular, the proposal linked within it. It appears that a special prefix node-restriction.kubernetes.io/* has been created for node labels. kubelet's --node-labels flag can no longer be used to set labels with this prefix.

I tried this out on a local cluster:

kind: KubeadmConfigTemplate
spec:
  template:
    spec:
      joinConfiguration:
        nodeRegistration:
          kubeletExtraArgs: 
            node-labels: node-restriction.kubernetes.io/foo=bar,somethingelse=blah

It produced the following error when the kubelet tried to run.

```
Oct 06 19:20:40 test-test-control-plane-ll75l kubelet[2083]: --node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the specifically allowed set(beta.kubernet
es.io/arch, beta.kubernetes.io/instance-type, ...
````

Given the security concerns described in the proposal, I would imagine that for most admins/users, using node-restriction.kubernetes.io/* labels is the right way forward.

CAPI already documents this restriction here. One possible alternative, as documented, is to use kubectl to add the desired labels.

kubectl get nodes --no-headers -l '!node-role.kubernetes.io/master' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''

The example is picking all worker nodes and is doing so by using the !node-role.kubernetes.io/master taint.
This is fine for simplistic scenarios where all workers are from a single MachineDeployment, but what happens when you have different sets of workers? Any scheme that relies on labels added through kubelet's --node-labels will have the security vulnerability.

Still feels like we need a mechanism that allows a user (or a controller) safely identify different sets of nodes in a workload cluster.

Please correct me if I'm wrong, but the above proposal focuses on secure bootstrapping and thus labels applied at machine creation time only. Can we go back to the discussion on CAPI reconciling labels on a node?

If I'm not mistaken, the security concern around label spoofing can be addressed if CAPI were to reconcile labels with the a prefix like cluster.kubernetes.io/*, since kubelet's --node-labels flag cannot be used to set labels outside the allowed ranges in *kubernetes.io/*. See output in earlier comment.

Alternatively, we could reconcile with the prefix node-restriction.kubernetes.io/*; however, this may raise concerns around possible collisions with other enties that can manipulate labels on a node.

I'm pretty sure it's just kubernetes.io that is restricted, but other values (foo.com/bar) should be fine.

@ncdc Yes, you're right, but I was more referring to the security concern addressed here. Labels like foo.com/bar can be set via --node-labels and thus are vulnerable. Labels in the *kubernetes.io/* are more restrictive (which is a good thing).

If we're not immediately concerned about this security vulnerability or feel that it will be addressed down the road by some other mechanism, then we can support arbitrary prefixes for labels (i.e. foo.com/*).

Should we reopen this issue and maybe rename it for the new scope, or do we want to keep it closed?

I would suggest that we keep the discussion here, come to some sort of consensus on what the actual RFE is, and then open a new issue specifically for that.

/retitle Support syncing a set of labels, annotations, or taints to Kubernetes Nodes

/assign @ncdc

To agree on a small design and open a different RFE

Questions we need to resolve:

  1. What labels should be synced?

    1. Everything allowable?

    2. All labels matching a prefix?



      1. Where is this prefix specified - on the Cluster? A command-line flag to the controller? ...



  2. What behavior(s) are appropriate?

    1. Full sync i.e. CAPI "owns" the node's labels and will remove anything not in the CAPI spec



      1. This could step on toes and cause problems


      2. If we do labels matching a prefix, we could do a full sync of those, but leave everything else alone



    2. Additive only i.e. once CAPI adds a label, it's impossible to remove it

    3. Other ideas?

I'm +1 to

  • sync all labels matching a prefix (or more prefixes if possible); empty prefix means no label sync
  • full sync of matching labels
  • to define prefix on the command-line flag of the controller, because this provides stricter control on this feature while we gather the first round of user feedback

/retitle Support syncing a set of labels from MachineDeployment/MachineSet/Machine to Nodes

I would also vote for full sync of labels matching a set of prefixes.

  • We should support user prefixes within the *kubernetes.io/ space (for reasons noted earlier) and also prefixes outside it (e.g. foo.com/). For this reason, perhaps having a set of prefixes vs a single prefix is worthy of consideration.
  • I like the idea of specifying the field at the Cluster level. It feels more user friendly than a command-line flag. We can make it an immutable field.

/unassign

Looking for someone else to carry this forward

@ncdc I'm happy to take this one.
I can put together a design proposal to move this further along.

Was this page helpful?
0 / 5 - 0 ratings