Eksctl: Hanging node group after delete

Created on 12 Sep 2019 · 18Comments · Source: weaveworks/eksctl

What happened?
I preformed a eksctl delete nodegroup --cluster prod-eks --name ng-1 the drain failed because of existing daemon sets and some local data.

I drained the nodes manually with kubectl using kubectl drain -l 'alpha.eksctl.io/nodegroup-name=ng-1' --force --ignore-daemonsets --delete-local-data

I ran eksctl delete nodegroup --cluster prod-eks --name ng-1 and now got an error

2019-09-11T18:20:08-05:00 [!]  error getting instance role ARN for nodegroup "ng-1"

The CloudFormation delete has also failed to run with the events


2019-08-28 14:06:18 UTC-0500 | eksctl-mim-prod-eks-nodegroup-ng-1 | DELETE_FAILED | The following resource(s) failed to delete: [NodeInstanceRole].
-- | -- | -- | --
2019-08-28 14:06:17 UTC-0500 | NodeInstanceRole | DELETE_FAILED | Cannot delete entity, must detach all policies first. (Service: AmazonIdentityManagement; Status Code: 409; Error Code: DeleteConflict; Request ID: e9ebc137-c9c6-11e9-a56a-e1f2488279d7)

All instances were terminated but performing a eksctl get nodegroups --cluster prod-eks I can see

→ eksctl get nodegroup --cluster mim-prod-eks
CLUSTER         NODEGROUP       CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID
prod-eks    ng-1            2019-08-14T16:28:19Z    1               4               3                       t3.medium       ami-0f2e8e5663e16b436
prod-eks    ng-6            2019-09-11T19:21:31Z    1               10              4                       t3.large        ami-0d3998d69ebe9b214

What you expected to happen?
eksctl would no longer list the deleted node group

How to reproduce it?
Not sure why it failed tbh

Anything else we need to know?
Very standard install

Versions
Please paste in the output of these commands:

$ eksctl version
[ℹ]  version.Info{BuiltAt:"", GitCommit:"", GitTag:"0.5.3"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:36:28Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.10-eks-5ac0f1", GitCommit:"5ac0f1d9ab2c254ea2b0ce3534fd72932094c6e1", GitTreeState:"clean", BuildDate:"2019-08-20T22:39:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Logs
Include the output of the command line when running eksctl. If possible, eksctl should be run with debug logs. For example:
eksctl get clusters -v 4
Make sure you redact any sensitive information before posting.
If the output is long, please consider a Gist.

aredeletions kinbug needs-investigation

Source

austinbv

👍1

Most helpful comment

I had this issue, which in my case I found a solution for.

For me it related to dangling ENIs left behind by auto-scaling instances up and down (spot in my case). These ENIs were still attached to the node group security group, so the security groups could not be deleted when deleting the cloudformation stack (initiated by eksctl).

Deleting these ENIs (they have a status of Available and not attached to an instance, also will have the node group security group listed) allowed cloudformation to properly delete the stack for the node group and it appears completely deleted to eksctl.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
https://github.com/aws/amazon-vpc-cni-k8s/issues/59
https://github.com/aws/amazon-vpc-cni-k8s/issues/608
etc

davidhole on 26 Sep 2019

👍11 🎉2 👀1 🚀1

All 18 comments

hey

Have you been able to fix this issue in any way? The same thing happened to me yesterday and I can't find a way to permanently delete nodegroup from my EKS cluster.

jonasteif on 20 Sep 2019

I had this issue, which in my case I found a solution for.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
https://github.com/aws/amazon-vpc-cni-k8s/issues/59
https://github.com/aws/amazon-vpc-cni-k8s/issues/608
etc

davidhole on 26 Sep 2019

👍11 🎉2 👀1 🚀1

+1 faced the same issue

mr-karan on 4 Oct 2019

👍4

Facing the same issue here and not sure how to proceed.

In trying to delete the cluster I see the following error

eksctl delete cluster --name floral-rainbow-1574743755
eksctl version 0.10.2
using region us-east-1
deleting EKS cluster "floral-rainbow-1574743755"
cleaning up LoadBalancer services
[no eksctl-managed CloudFormation stacks found for "floral-rainbow-1574743755"

I went to the AWS console, I see the EKS cluster there, trying to delete the cluster manually I am seeing the following error.

ResourceInUseException
Cluster has node groups attached

Drilling into the NodeGroup, I see it listed there. Tried to manually delete the NodeGroup from AWS console and it error'd out as well with DELETE FAILED

With kubectl I am not seeing the nodes anymore but I see the following resources

kubectl get all --all-namespaces
NAMESPACE     NAME                           READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-77f96c54b6-c78x4   0/1     Pending   0          4d2h
kube-system   pod/coredns-77f96c54b6-j8jh4   0/1     Pending   0          4d2h

NAMESPACE     NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
default       service/kubernetes   ClusterIP   10.100.0.1    <none>        443/TCP         6d20h
kube-system   service/kube-dns     ClusterIP   10.100.0.10   <none>        53/UDP,53/TCP   6d20h

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/aws-node     0         0         0       0            0           <none>          6d20h
kube-system   daemonset.apps/kube-proxy   0         0         0       0            0           <none>          6d20h

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           6d20h

NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-77f96c54b6   2         2         0       6d20h

Any help is appreciated as I don't know of a way to clean and remove this cluster up now

Thanks

ddavtian on 3 Dec 2019

Hey @ddavtian did you manage to delete the EKS cluster ? How did you go about it ?

bubeamos on 31 Jan 2020

@Chidiebube I did, the issue is that there is a broken configmap in the cluster and that needs to be manually fixed first. Looking at my history of commands, try poking around this to make sure the yaml is proper

kubectl edit -n kube-system configmap/aws-auth

Fix it and try to remove it again using AWS Console.

ddavtian on 1 Feb 2020

+1 faced the same issue

lclpedro on 6 Feb 2020

I had this issue, which in my case I found a solution for.

For me it related to dangling ENIs left behind by auto-scaling instances up and down (spot in my case). These ENIs were still attached to the node group security group, so the security groups could not be deleted when deleting the cloudformation stack (initiated by eksctl).

Deleting these ENIs (they have a status of Available and not attached to an instance, also will have the node group security group listed) allowed cloudformation to properly delete the stack for the node group and it appears completely deleted to eksctl.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
aws/amazon-vpc-cni-k8s#59
aws/amazon-vpc-cni-k8s#608
etc

Thanks this worked for me.

jakazzy on 9 Mar 2020

tedostrem on 15 Apr 2020

musha68k on 15 Apr 2020

We also experienced this issue, what worked for us:

Delete dangling ENIs (as mentioned above)
Resume deleting process by manually deleting the associated CF stack

Gfeuillen on 15 Apr 2020

I trapped to the same problem :((

============

This step worked for me:

Delete all ENI associated with EKS
Delete all Security Group with EKS

after that, I can delete nodegroup and the cluster...

Yeah!

cordiaz on 18 Apr 2020

👍1

I'll also say this is not an eksctl specific issue. Our EKS cluster was not created or managed with eksctl and we had the same issue of dangling ENIs.

seanamosw on 21 Apr 2020

Same issue here. Although eksctl said it deleted the node group, the Cloud Formation stack had failed to delete it. The message "must detach all policies first" made me look at the node group's NodeInstanceRole in IAM. I removed the last remaining policy (CloudWatchLogsFullAccess) on that role and that worked for me.