Terraform-aws-eks: How to handle worker upgrade automatically ?

Created on 5 Aug 2019 · 19Comments · Source: terraform-aws-modules/terraform-aws-eks

I have issues

This module is great for deploying EKS clusters, but it has taken the decision to leave the worker upgrade out of its scope. This is ok for certain users, but for us, dealing manually with worker upgrade is a painful and repetitive task, mostly when you have a lot of workers.

This issue, is more like a discussion to decide if we want to implement this and how we should handle it.

I'm submitting a...

[ ] bug report
[x] feature request
[x] support request
[ ] kudos, thank you, warm fuzzy

Are you able to fix this problem and submit a PR? Link here if you have already.

Yes, I'll be happy to submit PRs for this. But before that, I want to know what direction I (we) should take for this.

To handle this, I would like to use autocaling group lifecycle hooks to drain nodes during scale in. I want to use a lambda function which will subscribe to autoscaling:EC2_INSTANCE_TERMINATING events and drain nodes before ASG terminates EC2 instances.

There is already a good proof of concept in aws-samples, called amazon k8s node drainer.

By using ASG lifecycle hooks, we can achieve what @max-rocket-internet proposed https://github.com/terraform-aws-modules/terraform-aws-eks/issues/412#issuecomment-507270902.

And by using both hooks and cloud formation, we can tackle https://github.com/terraform-aws-modules/terraform-aws-eks/issues/333#issuecomment-480238089.

So, my point is NOT to handle all these with this module, but I think it should allow users to decide whether or not to scale in nodes after an LT change and let them handle node draining.

My questions here are :

How to handle node scale in ? Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ? This will only prefix the ASG name with LT name.
For users who wants to have more control on upgrades, are cloud formation templates welcome? This can add rolling upgrades (node by node) and can fit nicely with autocaling group lifecycle hooks.

Any other relevant info

here are additional links for ASG lifecycle hooks :

stale

Source

barryib

👍5 ❤1

Most helpful comment

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

Rock hard

RothAndrew on 7 Aug 2019

😄4

All 19 comments

Thanks for the detailed issue @barryib!

So, my point is to handle all these with this module,

I'm skeptical about including a Lambda function and all the other bits like SNS/SQS in this repo. It won't be simple and I'm not sure it belongs in the scope of this TF module. But we could just include an optional aws_autoscaling_lifecycle_hook resource with its settings and notification_target_arn coming from external.

Questions:

I would like to use autocaling group lifecycle hooks to drain nodes during scale in.

At what point is the ASG ever choosing to terminate instances? In my experience the cluster-autoscaler tells the ASG what instance to remove after it drains the node.

and cloud formation

Where and why is CFN involved here?

Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ?

But as soon as the ASG is deleted, all instances are terminated?

are cloud formation templates welcome?

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

max-rocket-internet on 7 Aug 2019

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

Rock hard

RothAndrew on 7 Aug 2019

😄4

OK to answer my own questions..

At what point is the ASG ever choosing to terminate instances?

This can only be achieved with rollingupdate from CFN. As I understand this is not achievable in TF.

Where and why is CFN involved here?

See above

Overall I like the idea. I love AWS Lambda. I would like to have automation around this process. But IIRC, this is what you are proposing:

ASG and LC/LTs must be controlled by CFN
Include AWS Lambda function
Include associated Lambda trigger resources (SNS or SQS, IAM policy etc)

I'm always open to other opinions, so add yours if it's missing, but I think most people won't be happy with this direction.

max-rocket-internet on 7 Aug 2019

Also, the node update process really isn't that difficult? I mean you could script what I wrote here in about 10 lines of shell, right?

max-rocket-internet on 7 Aug 2019

Where and why is CFN involved here?

Unlike Terraform, CloudFormation allows you to replace nodes in batches of N instances (plus you have resource signaling to indicate that an instance is actually ready). When N is 1 and you have some mechanism like the mentioned node drainer, you can safely update all worker nodes with minimum disruption.

I recommend reading https://medium.com/@endofcake/using-terraform-for-zero-downtime-updates-of-an-auto-scaling-group-in-aws-60faca582664 on the subject.

mlafeldt on 7 Aug 2019

It sounds interesting, but yeah, that's a no from me dawg (RandyJacksonMemeHere)

RothAndrew on 7 Aug 2019

@max-rocket-internet

So, my point is to handle all these with this module,

Sorry for the typo. My point is not to handle all these with this module. In fact, this module, should only provide something to trigger change and why not an option to create initial life cycle hook.

But as soon as the ASG is deleted, all instances are terminated?

Yes. This is what I want. When coupled with life cycle hook, ASG doesn't terminate instance directly, It'll only put EC2 instance Pending:Wait or Terminating:Wait. From there, you can run custom action with a lambda per example.

@RothAndrew For the cloudformation discussion, this is quite a long debate. I don't like it either, but it give us more flexibility on EC2 upgrade. This is a fact ! In addition to @mlafeldt https://github.com/terraform-aws-modules/terraform-aws-eks/issues/462#issuecomment-519056825, I'll say upgrading node by node, can also prevent you from hitting the EC2 resource limit.

@max-rocket-internet So my proposal here so far, is to only add:

an option to recreate ASG when LT or LC change => PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/465
an option to create initial life cycle hook => PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/466

barryib on 7 Aug 2019

I was opposed to adding cloudformation to this terraform module. This sounds more reasonable. I think it's still need some more discussion but I'm less disgusted by the idea now. I share Max's hatred for all things cloudformation

RothAndrew on 7 Aug 2019

an option to recreate ASG when LT or LC change => PR #465

Looks OK to me. Even outside of k8s node updates this would be useful. Still keen to see what others think though

an option to create initial life cycle hook

As I understand, this is pointless without the CFN part?

I don't like it either, but it give us more flexibility on EC2 upgrade

I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining than move to CFN, Lambda etc 😅

max-rocket-internet on 7 Aug 2019

As I understand, this is pointless without the CFN part?

This is not related to cloudformation at all. Life cycle hook is a autoscaling group functionality. This can be added during the ASG creation https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#initial_lifecycle_hook

I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining that move to CFN, Lambda etc 😅

I totally disagree with the term "overkill" here. We are just offering user the ability to use AWS functionalities (like lambda which is quite standard today) to achieve something on EKS worker nodes.

Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.

As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.

barryib on 7 Aug 2019

FWIW, I actually tested the K8s node drainer with Terraform/CloudFormation: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/333#issuecomment-520752885

mlafeldt on 13 Aug 2019

Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.

OK cool!

As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.

Great. Thanks for the effort, let's merge these 😃

max-rocket-internet on 13 Aug 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Jan 2020

In the next couple of days, I'll add a small doc with link to some projects and issues to track to achieve this.

barryib on 4 Jan 2020

@barryib, interested in learning about your findings and what's the best and cleanest way to achieve worker upgrades.

manfredlift on 22 Jan 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 22 Apr 2020

/remove stale

barryib on 23 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 22 Jul 2020

This issue has been automatically closed because it has not had recent activity since being marked as stale.

stale[bot] on 21 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings