Eksctl: deleting nodegroup while creating new ng leads to empty authconfigmap

Created on 9 Jan 2020  Â·  8Comments  Â·  Source: weaveworks/eksctl

What happened?
Running eksctl delete nodegroup -f nodegroup.yaml causes node/bootstrap role to be removed from AWS-AUTH config map... This instantly breaks every application and node in the cluster.

What you expected to happen?
The cause of this is cancelling out a nodegroup creation before its finished and running the delete command straight after. It has caused outages for me twice now and I'm never using EKSCTL because of this.

Fix both the AWS API throttling logic and aws-auth change validation.

How to reproduce it?
eksctl delete nodegroup -f file

Anything else we need to know?
The nodegroup definition IAM key is as follows

iam:
      instanceRoleARN: "arn:aws:iam::123456789012:role/EKS-NodesRole"
      instanceProfileARN: "arn:aws:iam::123456789012:instance-profile/EKS-NodesInstanceProfile"

Versions
eksctl: 0.10.2

Kubectl:
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-14T04:24:34Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Logs
eksctl delete nodegroup -f nodegroup-spot-nvme-b.yaml --approve
eksctl version 0.10.2
using region ap-southeast-2
comparing 1 nodegroups defined in the given config ("nodegroup-spot-nvme-b.yaml") against remote state
nodegroup "spot-a" present in the cluster, but missing from the given config
nodegroup "nodes-b" present in the cluster, but missing from the given config
nodegroup "spot-c" present in the cluster, but missing from the given config
nodegroup "nodes-c" present in the cluster, but missing from the given config
nodegroup "nodes-a" present in the cluster, but missing from the given config
nodegroup (spot-b) was included (based on the include/exclude rules)
will delete 1 nodegroups from auth ConfigMap in cluster "k8s-ssp"

Shared between all nodes in the cluster
_removing identity "arn:aws:iam::12345678901:role/EKS-NodesRole" from auth ConfigMap (username = "system:node:{{EC2PrivateDNSName}}", groups = ["system:bootstrappers" "system:nodes"])_

kinbug

All 8 comments

I faced similar issue few weeks back as well, then manually added the entry into config map back with eksctl create iamidentitymapping command.

Just check and find out that the default value for --update-config-map is true as per https://github.com/weaveworks/eksctl/blob/master/pkg/ctl/cmdutils/cmdutils.go#L179. Consider that there is no trivial way to check if the role is used by other node groups or not, can we make the default value of this flag as false (there is no harm for having additional entry in auth config map)? We can also print out the details to user consideration as well.

Sorry for the emotional issue logging, It was a bad day. Thanks for the response Tom @sayboras

I found the switched you mentioned and haven't experienced the issue again, It seemed to happen when the describe stacks operation was being throttled.

I'm not sure if using a single shared instance role for nodes is supported by EKSCTL but it works well currently.

@lachlanbb It's fine, we all have good day and bad day. Thanks for your issue anyway :), I still feel like some work needs to be done to avoid such issue going forward. PS: It's Tam :smile:

@martina-if @marccarre @cPu1 For me as a user, I would love to use mininal but working confiugration, this might require changes for default value of --update-config-map, or we can make this flag as required. Looking foward to hearing your inputs.

Consider that there is no trivial way to check if the role is used by other node groups or not

There is some thought in #544 about this

@sayboras @lachlanbb
looking at the code i would expect the behavior below. do you have an explanation why this doesn't happen for you? do you pass in --update-config-map=false in your create ng calls?

  • foo.yaml with 1 ng: create ng -f foo.yaml — creates 1 ng [+1 role in authconfigmap = 1 total]
  • replace ng in foo.yaml: create ng -f foo.yaml — creates 1 ng [+1 role = 2 total]
  • delete ng -f foo.yaml: deletes 1 ng [-1 role = 1 total]

Edit: just to give some clarification about identities here, eksctl adds duplicate identities, one for each nodegroup. if you create 3 nodegroups and delete 1, there are still 2 identities left.

The cause of this is cancelling out a nodegroup creation before its finished and running the delete command straight after.

I think I udnerstand.
So _while_ eksctl creates a nodegroup for you, you switch the another shell and delete the old ng? Creation only adds the authconfigmap entry _after_ it completes. But deletion removes it before it starts deletion.

So if you do this (all ngs have the same roles)

  • have a running ng [=1 total role in acm]
  • while creating a new ng [=still 1 total since creation hasn't finished]
  • delete a ng [-1= 0 total roles]
  • creation from above finishes [+1 = 1 total roles]

So the conclusion here is that you shouldn't run two competing parallel eksctl commands and wait for creation to finish before you delete old ones. Or then prevent the deletion from updating the authconfigmap by providing --update-auth-configmap=false.

I think the default of true is fine in general because when you are doings things in order (there is a reason eksctl waits for the nodes to join before it finishes) it works as expected.

@lachlanbb thanks a lot for the report. I hope @rndstr 's answer clarifies the behavior. I will close this issue but if you find more problems or you disagree and thing something more should be done please feel free to reopen it.

@martina-if
Yes it does, The only clarification I would give is the competing EKSCTL create / delete commands were necessary because our CFN API was rate limiting and the exponential backoff eventually causes a timeout. (This was our problem, unrelated to anything EKSCTL was doing).

Thanks for the assistance, Still using EKSCTL @ 1.15, it's been good:)

Was this page helpful?
0 / 5 - 0 ratings