What happened?
doing 'delete nodegroup' immediately killed all the nodes before draining them and moving pods to new nodegroup
What you expected to happen?
Nodegroup should have been drained first, ie, existing pods moved to new nodegroup, before deleting nodes.
The logs show the node is cordoned, but in reality the nodepool was instantly deleted and there was massive downtime on cluster while pods were slowly rescheduled onto new nodegroup. no 'draining' of nodes was done.
This is the behavior described in official docs
Also EKS upgrade docs also have this approach.
How to reproduce it?
Follow the upgrade steps for EKS cluster using eksctl.
create a new nodegroup, then delete the previous nodegroup using command:
eksctl delete nodegroup --cluster mycluster --name ng-1
Anything else we need to know?
Windows 10
Versions
$ eksctl version 0.8
$ kubectl version 1.14
Logs
eksctl delete nodegroup --cluster mycluster --name ng-1
[ℹ] eksctl version 0.8.0
[ℹ] using region us-east-1
[ℹ] combined include rules: ng-1
[ℹ] 1 nodegroup (ng-1) was included (based on the include/exclude rules)
[ℹ] will delete 1 nodegroups from auth ConfigMap in cluster "mycluster"
[ℹ] removing identity "arn:aws:iam::xxx:role/eksctl-mycluster-nodegroup-ng-1-NodeInstanceRole-T528C66G9SYK" from auth ConfigMap (username = "system:node:{{EC2PrivateDNSName}}", groups = ["system:bootstrappers" "system:nodes"])
[ℹ] will drain 1 nodegroups in cluster "mycluster"
[ℹ] cordon node "ip-192-168-18-105.ec2.internal"
[ℹ] cordon node "ip-192-168-62-100.ec2.internal"
[ℹ] cordon node "ip-192-168-67-209.ec2.internal"
[!] ignoring DaemonSet-managed Pods: default/datadog-agent-ksrv9, default/dsinfra-agent-q4f9k, default/logdna-agent-m2xtw, kube-system/aws-node-jkgd2, kube-system/kube-proxy-9htpd
[!] ignoring DaemonSet-managed Pods: default/datadog-agent-8vzkz, default/dsinfra-agent-p2lf9, default/logdna-agent-nb5wv, kube-system/aws-node-g55k4, kube-system/kube-proxy-gc6dj
[!] ignoring DaemonSet-managed Pods: default/datadog-agent-x9d99, default/dsinfra-agent-7qnsr, default/logdna-agent-bzndz, kube-system/aws-node-b5427, kube-system/kube-proxy-8rplq
[!] ignoring DaemonSet-managed Pods: default/datadog-agent-ksrv9, default/dsinfra-agent-q4f9k, default/logdna-agent-m2xtw, kube-system/aws-node-jkgd2, kube-system/[!] ignoring DaemonSet-managed Pods: default/datadog-agent-8vzkz, default/dsinfra-agent-p2lf9, default/logdna-agent-nb5wv, kube-system/aws-node-g55k4, kube-system/kube-proxy-gc6dj
[!] ignoring DaemonSet-managed Pods: default/datadog-agent-ksrv9, default/dsinfra-agent-q4f9k, default/logdna-agent-m2xtw, kube-system/aws-node-jkgd2, kube-system/[!] ignoring DaemonSet-managed Pods: default/datadog-agent-8vzkz, default/dsinfra-agent-p2lf9, default/logdna-agent-nb5wv, kube-system/aws-node-g55k4, kube-system/kube-proxy-gc6dj
[!] ignoring DaemonSet-managed Pods: default/datadog-agent-x9d99, default/dsinfra-agent-7qnsr, default/logdna-agent-bzndz, kube-system/aws-node-b5427, kube-system/kube-proxy-8rplq
[✔] drained nodes: [ip-192-168-18-105.ec2.internal ip-192-168-62-100.ec2.internal ip-192-168-67-209.ec2.internal]
[ℹ] will delete 1 nodegroups from cluster "mycluster"
[ℹ] 1 task: { delete nodegroup "ng-1" [async] }
[✔] deleted 1 nodegroups from cluster "mycluster"
seems like the pods were deleted instantly cause no pod disruption budget was set
Most helpful comment
seems like the pods were deleted instantly cause no pod disruption budget was set