Eksctl: Not able to create a cluster with managed node group

Created on 6 Apr 2020 · 22Comments · Source: weaveworks/eksctl

What happened?
Playing around on the eksworkshop.com
Created a eks cluster with eksctl create cluster -f eksworkshop.yaml, but the managed node group stack failed with Nodegroup nodegroup failed to stabilize: Internal Failure

I think it is easy to reproduce.

$ aws --version
aws-cli/1.18.36 Python/3.6.10 Linux/4.14.171-105.231.amzn1.x86_64 botocore/1.15.36
$ eksctl version
0.16.0
$ cat eksworkshop.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: eksworkshop-eksctl
  region: us-west-2

managedNodeGroups:
- name: nodegroup
  desiredCapacity: 3
  iam:
    withAddonPolicies:
      albIngress: true

secretsEncryption:
  keyARN: arn:aws:kms:us-west-2:<my aws account>:key/<kms key id or name>

What you expected to happen?
I expected the managed node group with desired capacity 3 is successfully created.

How to reproduce it?
Playing around on the eksworkshop.com. Just follow the steps specified in the https://eksworkshop.com/030_eksctl/launcheks/

Anything else we need to know?
What OS are you using, are you using a downloaded binary or did you compile eksctl, what type of AWS credentials are you using (i.e. default/named profile, MFA) - please don't include actual credentials though!

I am using downloaded binary. Just following the steps in https://eksworkshop.com/030_eksctl/launcheks/ and any prerequisite specified in the documented doc.

Versions
Please paste in the output of these commands:

$ eksctl version
0.16.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.10-eks-bac369", GitCommit:"bac3690554985327ae4d13e42169e8b1c2f37226", GitTreeState:"clean", BuildDate:"2020-02-21T23:37:18Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Logs

[ℹ]  deploying stack "eksctl-eksworkshop-eksctl-nodegroup-nodegroup"
[✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-eksworkshop-eksctl-nodegroup-nodegroup"
[ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
[✖]  AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_FAILED – "Nodegroup nodegroup failed to stabilize: Internal Failure"
[ℹ]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=eksworkshop-eksctl'
[✖]  waiting for CloudFormation stack "eksctl-eksworkshop-eksctl-nodegroup-nodegroup": ResourceNotReady: failed waiting for successful resource state

kinbug needs-investigation

Source

chaochn47

👍6

Most helpful comment

I have the same issue , it does not work with --managed , it works fine when I remove --managed.

eksctl create cluster --ssh-access --ssh-public-key=k8s-workshop --nodegroup-name airflowworkers --node-type t3.large --asg-access --vpc-private-subnets=subnet-0d578488c9c110198,subnet-0e8746acbeaad9f9a --vpc-public-subnets=subnet-0d69219ca76278fe1,subnet-0b3cf06c6e8a70a18 --managed

dayyehm on 29 Apr 2020

👍3

All 22 comments

I tried this with your config and the cluster and nodegroup creation were successful.

"Nodegroup nodegroup failed to stabilize: Internal Failure"

The error suggests it might be a temporary failure. Can you retry and see if you still get that error?
Since you're from Amazon, is it possible you're trying this from an AWS account that's whitelisted for internal/beta features and it's failing because of that?

cPu1 on 6 Apr 2020

Since this was probably a temporary failure on AWS' side I am closing this issue but please feel free to reopen it if the issue wasn't resolved.

martina-if on 27 Apr 2020

I'm having this same problem consistently today (four times, I think? lost count, really). The same YAML config was working several days ago without a hitch. See the redacted file (substituting MY_SERVICE and MY_ACCT_NO) below:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: MY_SERVICE
  region: us-east-1
managedNodeGroups:
- name: MY_SERVICE-node-group
  instanceType: t3.medium
  desiredCapacity: 2
  minSize: 2
  maxSize: 3
  labels:
    app: MY_SERVICE
    actorSystemName: MY_SERVICE
  ssh:
    allow: true
availabilityZones:
- us-east-1a
- us-east-1f

The relevant portion of the output:

[ℹ]  deploying stack "eksctl-MY_SERVICE-nodegroup-MY_SERVICE-node-group"
[✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-MY_SERVICE-nodegroup-MY_SERVICE-node-group"
[ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
[✖]  AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_FAILED – "Nodegroup MY_SERVICE-node-group failed to stabilize: Internal Failure"
[ℹ]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-east-1 --name=MY_SERVICE'
[✖]  waiting for CloudFormation stack "eksctl-MY_SERVICE-nodegroup-MY_SERVICE-node-group": ResourceNotReady: failed waiting for successful resource state

silverberry13 on 27 Apr 2020

I am having the same issue and it is still not resolved. The same cluster spec worked before, and I have tried it multiple times today and hitting the same issue. I have also tried with different regions and different availability zones assuming it is a capacity issue.

[ℹ]  building cluster stack "eksctl-MY_EKS-cluster"
[ℹ]  deploying stack "eksctl-MY_EKS-cluster"
[ℹ]  building managed nodegroup stack "eksctl-MY_EKS-nodegroup-MY_NODEGROUP"
[ℹ]  deploying stack "eksctl-MY_EKS-nodegroup-MY_NODEGROUP"
[✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-MY_EKS-nodegroup-MY_NODEGROUP"
[ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
[✖]  AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_FAILED – "Nodegroup MY_NODEGROUP failed to stabilize: Internal Failure"
[ℹ]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-east-1 --name=MY_EKS'
[✖]  waiting for CloudFormation stack "eksctl-MY_EKS-nodegroup-MY_NODEGROUP": ResourceNotReady: failed waiting for successful resource state

balaji-it on 28 Apr 2020

martina-if on 28 Apr 2020

I have the same issue , it does not work with --managed , it works fine when I remove --managed.

dayyehm on 29 Apr 2020

👍3

I am also not able to create a node group in existing cluster .
eksctl create nodegroup --cluster=mymir-eks --name=kafka-workers --region us-east-1 --nodes-min 3 --nodes-max 4 --ssh-access --ssh-public-key "C:/Users/desind/kafka.pub" --managed

2020-05-05T16:34:44-04:00 [✖] AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_FAILED – "Nodegroup kafka-stateful failed to stabilize: Internal Failure"

what is "internal failure"

dsindatry on 6 May 2020

Confirmed. Fails with managed node groups, works with non-managed ones.

silverberry on 6 May 2020

I experienced the same issue on eksctl 0.16.0 and found it was no longer an issue after upgrading to 0.18.0
I believe it was related to this PR: https://github.com/weaveworks/eksctl/pull/2002
You can see if you are having the same problem as me by trying to manually create a managed node group in the EKS console using the same settings as eksctl would have done. The error message regarding not being able to assign IP addresses in the chosen subnets is far more useful than the CF error.

miguel-aguirre-iBlocks on 7 May 2020

👍2

Kind of frustrating but I had the same behavior. Failed to stabilize with 0.16 but worked with 0.18.

matthewmrichter on 7 May 2020

👍2

Huh. I didn't even know 0.18 was out.

silverberry13 on 7 May 2020

Thanks @miguel-aguirre-iBlocks !

This is then because of the new defaults for the network settings "AssignPublicIpOnLaunch". This setting must now be enabled in every public subnet since it will not take effect when it is set in the nodegroup (that's the legacy settings). This has been the new default since eksctl 0.17.0

@chaochn-amazon can you confirm that you can create managed nodegroups with eksctl 0.17.0 or higher?

martina-if on 8 May 2020

Still facing the same issue with eksctl 0.30.0. Does anyone know the workaround?

shrutilamba on 17 Nov 2020

@shrutilamba Can you please post the exact commands or configs you're using as well as the output?

michaelbeaumont on 17 Nov 2020

@michaelbeaumont I am using the following config:

```# cluster.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: eks-managed-cluster
region: ap-south-1
vpc:
id: "vpc-xxxxxxxxxxxxxx"
subnets:
private:
ap-south-1a:
id: "subnet-xxxxxxxxxxxx"
ap-south-1b:
id: "subnet-xxxxxxxxxxxx"
ap-south-1c:
id: "subnet-xxxxxxxxxxxxx"
public:
ap-south-1a:
id: "subnet-xxxxxxxxxxxxxx"
ap-south-1b:
id: "subnet-xxxxxxxxxxxxx"
ap-south-1c:
id: "subnet-xxxxxxxxxxxxxx"
managedNodeGroups:

name: hs-eks-new-automation-testing
privateNetworking: true
instanceType: t3.medium
ami: ami-0cd8562db082e8c1a
minSize: 2
maxSize: 4
desiredCapacity: 3
volumeSize: 20
securityGroups:
attachIDs: ["sg-xxxxxxxxxxxxx"]
ssh:
allow: true
publicKeyPath: ~/.ssh/key.pub
labels: {role: worker-node-1}
tags:
nodegroup-role: worker-node-1
iam:
withAddonPolicies:
externalDNS: true
certManager: true
volumeName: /dev/xvda
volumeEncrypted: true
disableIMDSv1: true
overrideBootstrapCommand: |
/etc/eks/bootstrap.sh managed-cluster --kubelet-extra-args '--node-labels=eks.amazonaws.com/nodegroup=worker-node-1,eks.amazonaws.com/nodegroup-image=ami-0cd8562db082e8c1a'
preBootstrapCommands:
- "echo -e 'fs.file-max = 200000000nkernel.pid_max = 3500000nnet.ipv4.neigh.default.gc_thresh1 = 20000nnet.ipv4.neigh.default.gc_thresh2 = 40000nnet.ipv4.neigh.default.gc_thresh3 = 80000' >> /etc/sysctl.conf && sudo sysctl -p && echo -e '* soft nofile 200000n* hard nofile 200000' >> /etc/security/limits.conf && systemctl set-property docker.service TasksMax=infinity && mkdir -p /var/log/journal && chown -R root:systemd-journal /var/log/journal/ && chmod -R g+s /var/log/journal/ && sudo service systemd-journald restart"```

Its giving me this error:
[✖] AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_FAILED – "Nodegroup hs-eks-new-automation-testing failed to stabilize: []"

shrutilamba on 17 Nov 2020

@shrutilamba your overrideBootstrapCommand is using the wrong cluster name. The first argument to the bootstrap script should be the cluster name.

overrideBootstrapCommand: |
      /etc/eks/bootstrap.sh eks-managed-cluster

cPu1 on 17 Nov 2020

@cPu1 thanks. Changing the cluster name worked!

shrutilamba on 17 Nov 2020

Its January 16, 2021 and I am having the same struggles listed above and I attempted to create node groups in my EKS clusters both from the eksctl CLI and from the EKS console. I get the same results. I managed to get one node group to work after setting auto assign Ipv4 in my subnets. However, every other time it seems to fail now and even failed on the first attempt when I made those modifications, yet worked 6 hours later.

Is there something wrong with my region "us-east-1" in generation node groups? EVery tutorial and person I talk to says this should work like 1-2-3, but it goes for 25 minutes and then just bombs out with errors for the node group.

jontiefer on 16 Jan 2021

Also having issues creating Managed NodeGroups using a YAML file:

2021-01-17T13:27:19+02:00 [▶] nodegroups = []

tewner on 17 Jan 2021

I am having the same issue - can't create a managed node group on a subnet without enabling public IP address on the subnet. Any help is appreciated.

kellen-dunham on 20 Jan 2021

@jontiefer @tewner @kellen-dunham There are countless reasons node groups may fail to create. Please create separate issues and provide full configs or commands

michaelbeaumont on 21 Jan 2021

@michaelbeaumont I have an open issue which has the config and details you're needing https://github.com/weaveworks/eksctl/issues/2834. The bottom line is that as far as I know the AWS subnet has to have "Assign public IPV4" enabled, otherwise node groups can't be created in them or if this setting is disabled after the node group is created the node group becomes "Degraded" with an error similar to the one referenced in these issues.

kellen-dunham on 21 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings