This module is great for deploying EKS clusters, but it has taken the decision to leave the worker upgrade out of its scope. This is ok for certain users, but for us, dealing manually with worker upgrade is a painful and repetitive task, mostly when you have a lot of workers.
This issue, is more like a discussion to decide if we want to implement this and how we should handle it.
Yes, I'll be happy to submit PRs for this. But before that, I want to know what direction I (we) should take for this.
To handle this, I would like to use autocaling group lifecycle hooks to drain nodes during scale in. I want to use a lambda function which will subscribe to autoscaling:EC2_INSTANCE_TERMINATING events and drain nodes before ASG terminates EC2 instances.
There is already a good proof of concept in aws-samples, called amazon k8s node drainer.
By using ASG lifecycle hooks, we can achieve what @max-rocket-internet proposed https://github.com/terraform-aws-modules/terraform-aws-eks/issues/412#issuecomment-507270902.
And by using both hooks and cloud formation, we can tackle https://github.com/terraform-aws-modules/terraform-aws-eks/issues/333#issuecomment-480238089.
So, my point is NOT to handle all these with this module, but I think it should allow users to decide whether or not to scale in nodes after an LT change and let them handle node draining.
My questions here are :
worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ? This will only prefix the ASG name with LT name.here are additional links for ASG lifecycle hooks :
Thanks for the detailed issue @barryib!
So, my point is to handle all these with this module,
I'm skeptical about including a Lambda function and all the other bits like SNS/SQS in this repo. It won't be simple and I'm not sure it belongs in the scope of this TF module. But we could just include an optional aws_autoscaling_lifecycle_hook resource with its settings and notification_target_arn coming from external.
Questions:
I would like to use autocaling group lifecycle hooks to drain nodes during scale in.
At what point is the ASG ever choosing to terminate instances? In my experience the cluster-autoscaler tells the ASG what instance to remove after it drains the node.
and cloud formation
Where and why is CFN involved here?
Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ?
But as soon as the ASG is deleted, all instances are terminated?
are cloud formation templates welcome?
I would say hard no as I hate CFN 馃槅 but open to hear other people's opinions.
I would say hard no as I hate CFN 馃槅 but open to hear other people's opinions.
Rock hard
OK to answer my own questions..
At what point is the ASG ever choosing to terminate instances?
This can only be achieved with rollingupdate from CFN. As I understand this is not achievable in TF.
Where and why is CFN involved here?
See above
Overall I like the idea. I love AWS Lambda. I would like to have automation around this process. But IIRC, this is what you are proposing:
I'm always open to other opinions, so add yours if it's missing, but I think most people won't be happy with this direction.
Also, the node update process really isn't that difficult? I mean you could script what I wrote here in about 10 lines of shell, right?
Where and why is CFN involved here?
Unlike Terraform, CloudFormation allows you to replace nodes in batches of N instances (plus you have resource signaling to indicate that an instance is actually ready). When N is 1 and you have some mechanism like the mentioned node drainer, you can safely update all worker nodes with minimum disruption.
I recommend reading https://medium.com/@endofcake/using-terraform-for-zero-downtime-updates-of-an-auto-scaling-group-in-aws-60faca582664 on the subject.
It sounds interesting, but yeah, that's a no from me dawg (RandyJacksonMemeHere)
@max-rocket-internet
So, my point is to handle all these with this module,
Sorry for the typo. My point is not to handle all these with this module. In fact, this module, should only provide something to trigger change and why not an option to create initial life cycle hook.
But as soon as the ASG is deleted, all instances are terminated?
Yes. This is what I want. When coupled with life cycle hook, ASG doesn't terminate instance directly, It'll only put EC2 instance Pending:Wait or Terminating:Wait. From there, you can run custom action with a lambda per example.
@RothAndrew For the cloudformation discussion, this is quite a long debate. I don't like it either, but it give us more flexibility on EC2 upgrade. This is a fact ! In addition to @mlafeldt https://github.com/terraform-aws-modules/terraform-aws-eks/issues/462#issuecomment-519056825, I'll say upgrading node by node, can also prevent you from hitting the EC2 resource limit.
@max-rocket-internet So my proposal here so far, is to only add:
I was opposed to adding cloudformation to this terraform module. This sounds more reasonable. I think it's still need some more discussion but I'm less disgusted by the idea now. I share Max's hatred for all things cloudformation
an option to recreate ASG when LT or LC change => PR #465
Looks OK to me. Even outside of k8s node updates this would be useful. Still keen to see what others think though
an option to create initial life cycle hook
As I understand, this is pointless without the CFN part?
I don't like it either, but it give us more flexibility on EC2 upgrade
I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining than move to CFN, Lambda etc 馃槄
As I understand, this is pointless without the CFN part?
This is not related to cloudformation at all. Life cycle hook is a autoscaling group functionality. This can be added during the ASG creation https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#initial_lifecycle_hook
I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining that move to CFN, Lambda etc 馃槄
I totally disagree with the term "overkill" here. We are just offering user the ability to use AWS functionalities (like lambda which is quite standard today) to achieve something on EKS worker nodes.
Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.
As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.
FWIW, I actually tested the K8s node drainer with Terraform/CloudFormation: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/333#issuecomment-520752885
Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.
OK cool!
As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.
Great. Thanks for the effort, let's merge these 馃槂
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In the next couple of days, I'll add a small doc with link to some projects and issues to track to achieve this.
@barryib, interested in learning about your findings and what's the best and cleanest way to achieve worker upgrades.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity since being marked as stale.
Most helpful comment
Rock hard