Kops: Drain nodes when they terminate

Created on 7 Jun 2019  Â·  32Comments  Â·  Source: kubernetes/kops

When the ASG scales down the node should be drained and not just terminated (killing pods / not respecting poddisruptionbudget)

Either via "Amazon EC2 Auto Scaling Lifecycle Hooks" (up to 60 min) or with a termination script https://github.com/miglen/aws/blob/master/ec2/run-script-on-ec2-instance-termination.md (2 min max, so will be tight)

Flow could be Scale-Down -> SQS -> Drainer or Scale-Down -> SQS -> Node status -> https://github.com/planetlabs/draino

We might be able to contribute this, but need some "yes that's a good idea" / "yes we want this" first :)

good first issue hacktoberfest lifecyclfrozen

Most helpful comment

All 32 comments

I would personally suggest with the easiest approach, which looks the termination script and eventually iterate on it over the time.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

On Thu, Sep 5, 2019 at 10:26 AM fejta-bot notifications@github.com wrote:

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta
https://github.com/fejta.
/lifecycle stale

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kops/issues/7119?email_source=notifications&email_token=AAACYZ7OEMEBFG2QE6QQ2NDQIE6MJA5CNFSM4HVYMZNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ABAWQ#issuecomment-528486490,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAACYZ5WORU3IMVVNVPFKT3QIE6MJANCNFSM4HVYMZNA
.

I did a detailed write up on setting up LCH / SQS / node-drainer here

https://mofoseng.github.io/posts/eks-migration/#node-maintenance-autoscaling-and-lifecyclehooks

I compared kube-aws drainer (which has a single replica deployment updating a configmap and ds on each node grepping the configmap... ) to a pure AWS lambda based approach and this SQS approach seemed the most robust

it's not kops specific but it may help adoption (I moved company and now I'm back on kops after migrating out of it at my last company)

FYI We use a similar workflow but instead of a configmap set a label on the
node to mark it as draining ... Also looking at moving to a single
deployment to have less pods flying around.

On Tue, Sep 10, 2019, 8:29 PM so0k notifications@github.com wrote:

I did a detailed write up on setting up LCH / SQS / node-drainer here

https://mofoseng.github.io/posts/eks-migration/#node-maintenance-autoscaling-and-lifecyclehooks

I compared kube-aws drainer (which has a single replica deployment
updating a configmap and ds on each node grepping the configmap... ) to a
pure AWS lambda based approach and this SQS approach seemed the most robust

it's not kops specific but it may help adoption (I moved company and now
I'm back on kops after migrating out of it at my last company)

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kops/issues/7119?email_source=notifications&email_token=AAACYZ3LZ4NMWBVEGCUL2BDQJBQYJA5CNFSM4HVYMZNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEM2Y#issuecomment-530204267,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAACYZ62FXSQSK2VO333W7TQJBQYJANCNFSM4HVYMZNA
.

@grosser - how did you manage LCH creation for kops currently? do you generate TF and write config around that? Did you fork kops to add that functionality?

we are not doing that for kops yet, just our old clusters that we create
with cloudformation, still in the progress of figuring that out :(

On Thu, Sep 12, 2019 at 8:01 PM so0k notifications@github.com wrote:

@grosser https://github.com/grosser - how did you manage LCH creation
for kops currently? do you generate TF and write config around that? Did
you fork kops to add that functionality?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kops/issues/7119?email_source=notifications&email_token=AAACYZ3L2FVN4SSHFTDBNR3QJL7A3A5CNFSM4HVYMZNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6T2LUQ#issuecomment-531080658,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAACYZ2PPCAIFGCTUDWQNULQJL7A3ANCNFSM4HVYMZNA
.

sorry for off-topic

I'm also trying to taint spot instances in a mixedInstancePolicy - for this I see 2 approaches (without modifying kops):

  1. A Kubelet systemd service unit drop-in as a hook or an asset which adds --register-with-taints / --node-labels to the $DAEMON_ARGS based on aws ec2 describe-instances --instance-ids ${iid} --query 'Reservations[0].Instances[0].InstanceLifecycle' output
  2. A DS similar iameli/kube-node-labeller to set taints and labels

The difference is that option 1 would ensure taints are set before node registration, option 2 would take effect at some point in time later (maybe after some workloads which should not tolerate the taint have already started running on the tainted node...)

with EKS bootstrap it is quite simple to ensure nodes are labeled / tainted properly before they register with the API (and kops only supports labels/taints at the iG level, not at the per node level)

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Another way of handling node draining on termination taken from Zalando's kubernetes-on-aws: https://github.com/zalando-incubator/kubernetes-on-aws/blob/449f8f3bf5c60e0d319be538460ff91266337abc/cluster/userdata-worker.yaml#L92-L120

I have just implemented node drain via systemd. The systemd unit gets provisioned via kops hooks. The kubeconfg is written to disk by a daemonset.

Works like a charm. Thanks @thomaspeitz for your support.

@kforsthoevel , that sounds very interesting. Do you mind sharing the implementation details?

@paalkr I will write a little blog post about it and let you know.

The kops hook I saw by following links was a good proof of concept, but it had the problem that it assumed the container runtime was Docker.

Thx!

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

/lifecycle frozen

I've chatted with @olemarkus on Slack about adding aws-node-termination-handler (NTH) as an add-on for kops.

NTH operates in two different modes: IMDS Processor (old-way) and Queue-Processor. I think the queue-processor mode would be a good fit for kops since it's more resilient and can respond to more types of events (spot ITNs, spot rebalance recommendations, ec2 instance status changes, ASG termination lifecycle hooks, and more coming).

NTH queue-processor mode work by listening to an SQS queue that is sourced with events from Amazon EventBridge. kops can do a lot of the heavy lifting of setting up the Amazon EventBridge rules appropriately and creating an SQS queue.

Sounds good. I have used CloudFormation to create the EventBridge rules and SQS queue for now, but integrating closer into Kops would be a nice addition.

BTW, I ended up using the kubernetes.io/cluster/<cluster-name> as the managedAsgTag, as this label is already created and managed by Kops ;)

Ref: https://github.com/aws/aws-node-termination-handler/issues/272

So first step would be to add support for provisioning SQS + EventBridge rules. Then have cloudup provision those if NTH is enabled.

There is a number of changes to the template that is needed too. I think it is probably best to just have two separate templates and pick one based on which mode NTH is set to use. I would run the NTH deployment on masters and use the master IAM role for authenticating to SQS.

Also a AGS lifecycle hook need to be added to each ASG, so that nodes stays in the ASG until NTH has finished draining the node and can send a AGS lifecycle continue signal. And the behavior of NTH needs to be coordinated with kops rolling-update cluster commands. Which component is responsible for sending the AGS lifecycle continue signal during rolling update? NTH or Kops? And both Kops and NTH will also try to drain nodes during rolling updates.

I think kops should just let NTH do drain and terminate.

I can see a problem when using --cloudonly though. In this case, I think kops can just send the signal immediately.

But we can implement some of this also without the ASG lifecycle hooks, and adding things in increment may be a benefit here.

It looks like you both have a much better overview on how to approach this than I do. I can definitely help out with how to add the various bits.

Would any of you be able to try your hand at a PR for (parts of) this?

I can try to get something out. I'm not too familiar with the kops codebase, so I'll need some help as well. I'll be on vacation much of December, so probably can't do anything soon. But hopefully we can hammer out a plan and make sure this is going to work nicely.

How do the other cloud providers work for draining? I was going to say that if NTH is the termination handling component, then the kops controller should just let NTH do the draining, but I'm not sure that's really possible since the kops controller would still need to finish the drain in the other providers. It's probably not the end of the world if both drain since the eviction api calls are idempotent. Whoever completes the ASG lifecycle hook first would win, but it's kind of awkward.

Sounds good!

The code that drains a node can be found here: https://github.com/kubernetes/kops/blob/master/pkg/instancegroups/instancegroups.go#L328
This has access to the cluster spec, so you can skip call if NTH is enabled and jump straight to deleteNode. Our validation logic prevents NTH from being enabled on other clouds, so you don't have to worry about those.

One bit in the code linked above that you need to take into consideration is that it deletes the k8s node object. That bit also needs to be skipped. I assume NTH also does that on its own.

For the AWS provisioning piece, see https://github.com/kubernetes/kops/tree/master/pkg/model
I would assume you need a task for EventBrige and one for SQS.

Unfortunately I'm not a developer, but I can contribute with testing and discussing design and implementation specs in general.

Since draining is idempotent, it is not necessary to disable the drain code in rolling update. It is advantageous to do the bulk of the draining from rolling update as that makes what is going on more visible in the rolling update logs.

Was this page helpful?
0 / 5 - 0 ratings