I had a running EKS 1.12 cluster with a single worker group (two worker nodes), and I've updated cluster_version to 1.13. The control-plane update worked, and the launch configuration's AMI was updated to the correct version, but the worker nodes didn't get automatically updated - I had to manually scale down to 0 and scale back to the desired capacity for the change to take effect.
I am not sure whether this is a bug or just the expected behaviour, so it would be really nice to have someone looking into this and providing guidance - a section about cluster and worker group upgrades would be awesome! 馃挴
As a side note, even though this probably represents a different issue, kube-proxy and CoreDNS didn't get automatically updated to the relevant versions. It would be awesome if that could be handled automatically as well - I might be able to contribute with code if someone is available to guide me through what's required.
As I've mentioned above, I am not entirely sure this is a bug or just the expected behaviour. My Terraform configuration was initially the following:
module "my-eks" {
source = "terraform-aws-modules/eks/aws"
cluster_create_timeout = "30m"
cluster_delete_timeout = "30m"
cluster_enabled_log_types = [
"api",
]
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_name = "my-eks"
cluster_version = "1.12" // Later changed to "1.13".
subnets = [
"${aws_subnet.my-eks-internal-001.id}",
"${aws_subnet.my-eks-internal-002.id}",
"${aws_subnet.my-eks-internal-003.id}",
"${aws_subnet.my-eks-external-001.id}",
"${aws_subnet.my-eks-external-002.id}",
"${aws_subnet.my-eks-external-003.id}",
]
vpc_id = "${aws_vpc.my-eks.id}"
worker_groups = [
{
asg_desired_capacity = 2
asg_min_size = 1
asg_max_size = 3
instance_type = "m5.large"
name = "my-eks-wg-0"
subnets = [
"${aws_subnet.my-eks-internal-001.id}",
"${aws_subnet.my-eks-internal-002.id}",
"${aws_subnet.my-eks-internal-003.id}",
]
}
]
write_kubeconfig = true
}
I was expecting the old worker nodes to be replaced by two new worker nodes running the correct AMI automatically.
N/A.
N/A.
am not sure whether this is a bug or just the expected behaviour
It's expected behaviour. Updating nodes will be a different process for different workloads so we don't attempt to control this process in this module.
a section about cluster and worker group upgrades would be awesome!
You're right. Feel free to create a PR to add some details 馃槂
kube-proxy and CoreDNS didn't get automatically updated to the relevant versions. It would be awesome if that could be handled automatically as well
It could definitely be automated but this won't be part of this module. Details in https://github.com/terraform-aws-modules/terraform-aws-eks/issues/99
@max-rocket-internet .. would you share some best practice for rolling upgrade the asg nodes when eks is upgraded ? so far, I could think of having 2 asg worker nodes and manually changing desired capacity!
Here's what I do:
kubectl drain --force --ignore-daemonsets --delete-local-data ip-xxxxxxx.eu-west-1.compute.internal馃殌
Thanks @max-rocket-internet .. It's helpful to know this is the "standard" approach. I would think setting termination_policies = ["OldestLaunchConfiguration"] would help ? to hint to CA which nodes to delete
I would think setting termination_policies = ["OldestLaunchConfiguration"] would help ?
Essentially you don't want the ASG doing anything at all as it doesn't gracefully drain the node, it just shuts the instance down. This is much too aggressive. That's why you use the cluster-autoscaler and kubectl drain. This asks the pods to stop, respects all the timeout and shutdown settings in the pods (e.g. terminationGracePeriodSeconds and lifecycle settings) and stops any further pods being scheduled on the node.
In normal autoscaling then cluster-autoscaler will also drain nodes before telling the ASG to terminate that specific node.
Thanks @max-rocket-internet .. I understand what you just mentioned. I guess what's not clear, is what makes the CA prefer to kill the nodes that I've just drained? I assume CA has no visibility that those nodes are started from an older LC, or does it ? Thanks!
I assume CA has no visibility that those nodes are started from an older LC, or does it ?
No it doesn't. You are choosing to drain a node because it's an old one, as shown here:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-6-22-158.ap-southeast-1.compute.internal Ready <none> 23d v1.12.7
ip-10-6-22-221.ap-southeast-1.compute.internal Ready <none> 31d v1.11.9
Then CA will eventually terminate that node because it's status SchedulingDisabled, as shown here:
NAME STATUS ROLES AGE VERSION
ip-10-0-27-15.eu-west-1.compute.internal Ready,SchedulingDisabled <none> 34d v1.12.7
The CA will gracefully terminate nodes that are SchedulingDisabled or if they are not needed due to resources.
Thanks a tons @max-rocket-internet .. It would be really awesome, if nodes got a k8s label that is the launch-configuration version, or the ami-id ..etc, so that one can easily evict all nodes matching the old label (in case of a large number of nodes). Is it a possibility to do that today ?
PS: I'm happy to send a docs PR on autoscaling.md summarizing everything you mentioned here!
if nodes got a k8s label that is the launch-configuration version, or the ami-id
Yeah that's a good idea. PR welcome 馃槂
PS: I'm happy to send a docs PR on autoscaling.md summarizing everything you mentioned here!
Please do 馃挴
Most helpful comment
Here's what I do:
kubectl drain --force --ignore-daemonsets --delete-local-data ip-xxxxxxx.eu-west-1.compute.internal馃殌