Terraform-aws-eks: Nodes didn't get automatically updated after a version upgrade.

Created on 25 Jun 2019  路  9Comments  路  Source: terraform-aws-modules/terraform-aws-eks

I have issues

I'm submitting a...

  • [ ] bug report
  • [ ] feature request
  • [x] support request
  • [ ] kudos, thank you, warm fuzzy

What is the current behavior?

I had a running EKS 1.12 cluster with a single worker group (two worker nodes), and I've updated cluster_version to 1.13. The control-plane update worked, and the launch configuration's AMI was updated to the correct version, but the worker nodes didn't get automatically updated - I had to manually scale down to 0 and scale back to the desired capacity for the change to take effect.

I am not sure whether this is a bug or just the expected behaviour, so it would be really nice to have someone looking into this and providing guidance - a section about cluster and worker group upgrades would be awesome! 馃挴

As a side note, even though this probably represents a different issue, kube-proxy and CoreDNS didn't get automatically updated to the relevant versions. It would be awesome if that could be handled automatically as well - I might be able to contribute with code if someone is available to guide me through what's required.

If this is a bug, how to reproduce? Please include a code sample if relevant.

As I've mentioned above, I am not entirely sure this is a bug or just the expected behaviour. My Terraform configuration was initially the following:

module "my-eks" {
  source = "terraform-aws-modules/eks/aws"

  cluster_create_timeout = "30m"
  cluster_delete_timeout = "30m"

  cluster_enabled_log_types = [
    "api",
  ]

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true
  cluster_name                    = "my-eks"
  cluster_version                 = "1.12" // Later changed to "1.13".

  subnets = [
    "${aws_subnet.my-eks-internal-001.id}",
    "${aws_subnet.my-eks-internal-002.id}",
    "${aws_subnet.my-eks-internal-003.id}",
    "${aws_subnet.my-eks-external-001.id}",
    "${aws_subnet.my-eks-external-002.id}",
    "${aws_subnet.my-eks-external-003.id}",
  ]

  vpc_id = "${aws_vpc.my-eks.id}"

  worker_groups = [
    {
      asg_desired_capacity = 2
      asg_min_size         = 1
      asg_max_size         = 3
      instance_type        = "m5.large"
      name                 = "my-eks-wg-0"

      subnets = [
        "${aws_subnet.my-eks-internal-001.id}",
        "${aws_subnet.my-eks-internal-002.id}",
        "${aws_subnet.my-eks-internal-003.id}",
      ]
    }
  ]

  write_kubeconfig = true
}

What's the expected behavior?

I was expecting the old worker nodes to be replaced by two new worker nodes running the correct AMI automatically.

Are you able to fix this problem and submit a PR? Link here if you have already.

N/A.

Environment details

  • Affected module version: v5.0.0.
  • OS: macOS 10.14.5.
  • Terraform version: v0.12.2.

Any other relevant info

N/A.

Most helpful comment

Here's what I do:

  1. Ensure you have cluster-autoscaler running
  2. Apply TF changes that updates the LC of the ASG to the new AMI
  3. Drain 1 older version node: kubectl drain --force --ignore-daemonsets --delete-local-data ip-xxxxxxx.eu-west-1.compute.internal
  4. Wait until work load is rescheduled
  5. cluster-autoscaler will create new nodes when required. These new nodes will have the new AMI version.
  6. Repeat 3-5 until all older version nodes are drained
  7. cluster-autoscaler will terminate the old nodes after 10-60 minutes automatically.

馃殌

All 9 comments

am not sure whether this is a bug or just the expected behaviour

It's expected behaviour. Updating nodes will be a different process for different workloads so we don't attempt to control this process in this module.

a section about cluster and worker group upgrades would be awesome!

You're right. Feel free to create a PR to add some details 馃槂

kube-proxy and CoreDNS didn't get automatically updated to the relevant versions. It would be awesome if that could be handled automatically as well

It could definitely be automated but this won't be part of this module. Details in https://github.com/terraform-aws-modules/terraform-aws-eks/issues/99

@max-rocket-internet .. would you share some best practice for rolling upgrade the asg nodes when eks is upgraded ? so far, I could think of having 2 asg worker nodes and manually changing desired capacity!

Here's what I do:

  1. Ensure you have cluster-autoscaler running
  2. Apply TF changes that updates the LC of the ASG to the new AMI
  3. Drain 1 older version node: kubectl drain --force --ignore-daemonsets --delete-local-data ip-xxxxxxx.eu-west-1.compute.internal
  4. Wait until work load is rescheduled
  5. cluster-autoscaler will create new nodes when required. These new nodes will have the new AMI version.
  6. Repeat 3-5 until all older version nodes are drained
  7. cluster-autoscaler will terminate the old nodes after 10-60 minutes automatically.

馃殌

Thanks @max-rocket-internet .. It's helpful to know this is the "standard" approach. I would think setting termination_policies = ["OldestLaunchConfiguration"] would help ? to hint to CA which nodes to delete

I would think setting termination_policies = ["OldestLaunchConfiguration"] would help ?

Essentially you don't want the ASG doing anything at all as it doesn't gracefully drain the node, it just shuts the instance down. This is much too aggressive. That's why you use the cluster-autoscaler and kubectl drain. This asks the pods to stop, respects all the timeout and shutdown settings in the pods (e.g. terminationGracePeriodSeconds and lifecycle settings) and stops any further pods being scheduled on the node.

In normal autoscaling then cluster-autoscaler will also drain nodes before telling the ASG to terminate that specific node.

Thanks @max-rocket-internet .. I understand what you just mentioned. I guess what's not clear, is what makes the CA prefer to kill the nodes that I've just drained? I assume CA has no visibility that those nodes are started from an older LC, or does it ? Thanks!

I assume CA has no visibility that those nodes are started from an older LC, or does it ?

No it doesn't. You are choosing to drain a node because it's an old one, as shown here:

$ kubectl get nodes
NAME                                             STATUS   ROLES    AGE     VERSION
ip-10-6-22-158.ap-southeast-1.compute.internal   Ready    <none>   23d     v1.12.7
ip-10-6-22-221.ap-southeast-1.compute.internal   Ready    <none>   31d     v1.11.9

Then CA will eventually terminate that node because it's status SchedulingDisabled, as shown here:

NAME                                        STATUS                     ROLES    AGE   VERSION
ip-10-0-27-15.eu-west-1.compute.internal    Ready,SchedulingDisabled   <none>   34d   v1.12.7

The CA will gracefully terminate nodes that are SchedulingDisabled or if they are not needed due to resources.

Thanks a tons @max-rocket-internet .. It would be really awesome, if nodes got a k8s label that is the launch-configuration version, or the ami-id ..etc, so that one can easily evict all nodes matching the old label (in case of a large number of nodes). Is it a possibility to do that today ?

PS: I'm happy to send a docs PR on autoscaling.md summarizing everything you mentioned here!

if nodes got a k8s label that is the launch-configuration version, or the ami-id

Yeah that's a good idea. PR welcome 馃槂

PS: I'm happy to send a docs PR on autoscaling.md summarizing everything you mentioned here!

Please do 馃挴

Was this page helpful?
0 / 5 - 0 ratings