Cluster creation timeout - private subnets - windows worker
What happened?
Timeout error occurred and cluster was not fully created. It works if I use public subnets, and fails if I use private subnets.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: dave-eks2
region: us-west-2
version: '1.16'
vpc:
id: "vpc-zzz"
cidr: "10.12.0.0/16"
subnets:
private:
us-west-2a:
id: "subnet-xxx"
cidr: "10.12.40.0/25"
us-west-2b:
id: "subnet-yyy"
cidr: "10.12.40.128/25"
managedNodeGroups:
- name: linux-ng
instanceType: t2.large
desiredCapacity: 1
privateNetworking: true # if only 'Private' subnets are given, this must be enabled
ssh:
publicKeyName: 'Development-01.2020-Windows'
allow: true
nodeGroups:
- name: windows-ng
instanceType: t3.large
desiredCapacity: 1
privateNetworking: true # if only 'Private' subnets are given, this must be enabled
securityGroups:
withShared: true
withLocal: true
attachIDs: ['sg-xxx']
ssh:
publicKeyName: 'Development-01.2020-Windows'
allow: true
volumeSize: 100
amiFamily: WindowsServer2019CoreContainer
eksctl create cluster -f .\eks-cluster2-spec-my-vpc-with-private-subnets.yaml --install-vpc-controllers --timeout 40m --verbose=5 *> create-cluster-output.txt
[create-cluster-output.txt](https://github.com/weaveworks/eksctl/files/4848650/create-cluster-output.txt)
cat create-cluster-output.txt
2020-06-29T21:10:44Z [螕盲鈺 eksctl version 0.22.0
2020-06-29T21:10:45Z [螕盲鈺 using region us-west-2
2020-06-29T21:10:45Z [螕没鈺 DEBUG: Request sts/GetCallerIdentity Details:
...
2020-06-29T21:22:06Z [螕没鈺 completed task: create cluster control plane "dave-eks2"
2020-06-29T21:22:06Z [螕没鈺 started task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } }
2020-06-29T21:22:06Z [螕没鈺 started task: install Windows VPC controller
2020-06-29T21:22:06Z [螕没鈺 started task: install Windows VPC controller
2020-06-29T21:22:36Z [螕没鈺 failed task: install Windows VPC controller (will not run other sequential tasks)
2020-06-29T21:22:36Z [螕没鈺 failed task: install Windows VPC controller (will not run other sequential tasks)
2020-06-29T21:22:36Z [螕没鈺 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-06-29T21:22:36Z [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-06-29T21:22:36Z [螕盲鈺 to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks2'
2020-06-29T21:22:36Z [螕拢没] getting list of API resources for raw REST client: Get "https://3C15FCAE7DAE5E25E346FD837BE3595C.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 52.34.186.59:443: i/o timeout
eksctl : Error: failed to create cluster "dave-eks2"
At line:1 char:1
+ eksctl create cluster -f .\eks-cluster2-spec-my-vpc-with-private-subn ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (Error: failed t...ter "dave-eks2":String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
md5-053b9fe9f968f1739feb5e81cb69e401
PS D:\dave> eksctl.exe version
0.22.0
PS D:\dave> .\kubectl.exe version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:40:13Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"windows/amd64"}
Logs
Include the output of the command line when running eksctl. If possible, eksctl should be run with debug logs. For example:
eksctl get clusters -v 4
Make sure you redact any sensitive information before posting.
If the output is long, please consider a Gist.
I have comments on #2201 as well, but decided to open a new issue as I am stuck and unable to create a windows cluster with private subnets.
@dave-meier I will see if I can reproduce this issue
Thanks @martina-if
Here is the most recent run below. In this case only the "eksctl-dave-eks-cluster" cloud formation stack was created, and I didn't see any node group CF stacks created like previously, so it seems like it is not getting as far. So no EC2 instances are created in this case:
yaml:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: dave-eks
region: us-west-2
version: '1.16'
vpc:
id: "vpc-zzz"
cidr: "10.12.0.0/16"
subnets:
private:
us-west-2a:
id: "subnet-xxx"
cidr: "10.12.40.0/25"
us-west-2b:
id: "subnet-yyy"
cidr: "10.12.40.128/25"
managedNodeGroups:
- name: linux-ng
instanceType: t3.medium
desiredCapacity: 1
privateNetworking: true # if only 'Private' subnets are given, this must be enabled
ssh:
publicKeyName: 'Development-01.2020-Windows'
allow: true
nodeGroups:
- name: windows-ng
instanceType: t3.medium
desiredCapacity: 1
privateNetworking: true # if only 'Private' subnets are given, this must be enabled
securityGroups:
withShared: true
withLocal: true
attachIDs: ['sg-vvv']
ssh:
publicKeyName: 'Development-01.2020-Windows'
allow: true
volumeSize: 100
amiFamily: WindowsServer2019CoreContainer
Results:
PS D:\\dave> eksctl create cluster -f .\\eks-cluster-spec-my-vpc-with-private-subnets.yaml --install-vpc-controllers --timeout 40m [鈩筣 eksctl version 0.22.0 [鈩筣 using region us-west-2 [鉁擼 using existing VPC (vpc-zzz) and subnets (private:[subnet-xxx subnet-yyy] public:[]) [!] custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets [鈩筣 nodegroup "windows-ng" will use "ami-0ee42fc568b2881e1" [WindowsServer2019CoreContainer/1.16]
[鈩筣 using EC2 key pair "Development-01.2020-Windows"
[鈩筣 using EC2 key pair "Development-01.2020-Windows"
[鈩筣 using Kubernetes version 1.16
[鈩筣 creating EKS cluster "dave-eks" in "us-west-2" region with managed nodes and un-managed nodes [鈩筣 2 nodegroups (linux-ng, windows-ng) were included (based on the include/exclude rules) [鈩筣 will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s) [鈩筣 will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s) [鈩筣 if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=dave-eks'
[鈩筣 CloudWatch logging will not be enabled for cluster "dave-eks" in "us-west-2"
[鈩筣 you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=dave-eks'
[鈩筣 Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "dave-eks" in "us-west-2"
[鈩筣 2 sequential tasks: { create cluster control plane "dave-eks", 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } } [鈩筣 building cluster stack "eksctl-dave-eks-cluster"
[鈩筣 deploying stack "eksctl-dave-eks-cluster"
[!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console [鈩筣 to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
[鉁朷 getting list of API resources for raw REST client: Get "https://3A745D7D5F51ED0801EA32F68C52E0A9.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 54.149.63.46:443: i/o timeout
Error: failed to create cluster "dave-eks"
Hi @dave-meier , Can you see what the error is in CloudFormation and post it here please? Also, if you create the clusters with the verbose option -v 4 we may be able to see more about what the problem is.
@dave-meier I tried to reproduce this in eu-north-1 with a very similar configuration and I didn't have any issues except that I got an error when trying to use t2.large instances because (those are quite old and don't exist in modern regions).
I think there is either a problem with your subnets or with the instances in that region or perhaps with the security group? I think they key must be in the CloudFormation events in the console.
@martina-if So it seems to always work if I enable full cloudwatch logs. It also worked once today with the exact same yaml file as before. So there is definitely a problem that intermittently happens. For now I will just enable cloudwatch logging to create the cluster. What is the command line command to scale back or turn off cloudwatch logging after the cluster is up and running?
Here's what the failure looks like when it happens (just showing the end bit where it fails):
2020-07-10T15:49:53Z [鈻禲聽 waiting for CloudFormation stack "eksctl-dave-eks-cluster"
2020-07-10T15:49:53Z [鈻禲聽 done after 13m51.3887136s of waiting for CloudFormation stack "eksctl-dave-eks-cluster"
2020-07-10T15:49:53Z [鈻禲聽 processing stack outputs
2020-07-10T15:49:53Z [鈻禲聽 completed task: create cluster control plane "dave-eks"
2020-07-10T15:49:53Z [鈻禲聽 started task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } }
2020-07-10T15:49:53Z [鈻禲聽 started task: install Windows VPC controller
2020-07-10T15:49:53Z [鈻禲聽 started task: install Windows VPC controller
2020-07-10T15:50:23Z [鈻禲聽 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-2020-07-10T15:50:23Z [鈻禲聽 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-2020-07-10T15:50:23Z [鈻禲聽 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-07-10T15:50:23Z [!]聽 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-07-10T15:50:23Z [鈩筣聽 to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-07-10T15:50:23Z [鉁朷聽 getting list of API resources for raw REST client: Get "https://1C658EF25478B04CAA1D7CF7964D9BD6.yl4.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 54.70.74.99:443: i/o timeout
Error: failed to create cluster "dave-eks"
@dave-meier you can use eksctl utils update-cluster-logging --disable-types:
--disable-types strings Log types to be disabled, the rest will be disabled. Supported log types: (all, none, api, audit, authenticator, controllerManager, scheduler)
Thanks @michaelbeaumont
So to recap, I now add this to my yaml to enable cloudwatch:
cloudWatch:
clusterLogging:
enableTypes: ["*"]
Then I deploy, which works most of the time. NOTE: it still fails sometimes just like it's noted above:
# Example:
# .\create-eks-cluster.ps1 -cluster_name dave-eks -eks_yaml d:\dave\eks-dave-cluster-test2.yaml -subnets "subnet-xxx,subnet-yyy" -route_table rtb-zzz
Param(
[Parameter(Mandatory=$true)][string] $cluster_name,
[Parameter(Mandatory=$true)][string] $eks_yaml,
[Parameter(Mandatory=$true)][string] $subnets,
[Parameter(Mandatory=$true)][string] $route_table
)
$subnet_list = $subnets -split ","
foreach ($subnet in $subnet_list) {
write-host "Subnet: ${subnet}"
aws ec2 create-tags --resources $subnet --tags 'Key=kubernetes.io/cluster/dave-eks,Value=shared'
aws ec2 create-tags --resources $subnet --tags 'Key=kubernetes.io/role/internal-elb,Value=1'
aws ec2 associate-route-table --route-table-id $route_table --subnet-id $subnet
}
eksctl create cluster -f $eks_yaml --install-vpc-controllers -v 4 --timeout 40m
start-sleep -seconds 60
aws eks --region us-west-2 update-kubeconfig --name dave-eks
kubectl get no -o wide
In the script above, I assign the proper labels to the subnets, plus make sure that the route table is explicitly associated with each subnet. Before I did the route table association, I was getting an error in the the vpc resource controller pod, saying that there was no route table for the subnet. Check with:
.\kubectl.exe logs vpc-resource-controller-xxx -n kube-system
Then, if I want to stop cloudwatch logging, I issue the following command after the cluster is successfully launched:
Param(
[Parameter(Mandatory=$true)][string] $cluster_name
)
eksctl utils update-cluster-logging --cluster $cluster_name --disable-types all --approve
aws logs delete-log-group --log-group-name /aws/eks/${cluster_name}/cluster
So at least now, this works most of the time, but there is still some kind of timing related problem that causes it to fail sometimes when cloudwatch is enabled, and most of the time when cloudwatch is not enabled.
Still getting the same failure. Attached cloudwatch files. This time I was using amiFamily: WindowsServer1909CoreContainer.
eks-logs.zip
Failed again today:
...
2020-08-13T19:34:37Z [鈻禲 started task: update CloudWatch logging configuration
2020-08-13T19:34:38Z [鈻禲 start waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:34:38Z [鈻禲 waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:34:55Z [鈻禲 waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:35:15Z [鈻禲 waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:35:15Z [鈻禲 done after 37.2986217s of waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:35:15Z [鉁擼 configured CloudWatch logging for cluster "dave-eks" in "us-west-2" (enabled types: api, audit, authenticator, controllerManager, scheduler & no types disabled)
2020-08-13T19:35:15Z [鈻禲 completed task: update CloudWatch logging configuration
2020-08-13T19:35:15Z [鈻禲 started task: install Windows VPC controller
2020-08-13T19:35:45Z [鈻禲 failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update C2020-08-13T19:35:45Z [鈻禲 failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, install Windows VPC controller }, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-08-13T19:35:45Z [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2ration, install Windows VPC controller }, 2 parallel sub-hasn't been created properly, you may wish to check CloudFormation console
gur20-08-13T19:35:45Z [鈩筣 to cleanup resources, run 'eks2020-08-13T19:35:45Z [鈩筣 to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-08-13T19:35:45Z [鉁朷 getting list of API resources for raw REST client: Get "https://F565C71B408D610B9BA9029B773B8728.yl4.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 54.189.25.67:443: i/o timeout
Error: failed to create cluster "dave-eks"
Still getting the same problem. On latest version of eksctl (0.26.0):
2020-08-26T23:18:47Z [鉁擼 tagged EKS cluster (Department=Dev/Cont, Environment=Dev, Name=dave-eks, Purpose=Development, Tenancy=PaaS)
2020-08-26T23:18:47Z [鈻禲 completed task: tag cluster
2020-08-26T23:18:47Z [鈻禲 started task: update CloudWatch logging configuration
2020-08-26T23:18:48Z [鈻禲 start waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:18:48Z [鈻禲 waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:06Z [鈻禲 waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:25Z [鈻禲 waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:25Z [鈻禲 done after 37.2225598s of waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:26Z [鉁擼 configured CloudWatch logging for cluster "dave-eks" in "us-west-2" (enabled types: api, audit, authenticator, controllerManager, scheduler & no types disabled)
2020-08-26T23:19:26Z [鈻禲 completed task: update CloudWatch logging configuration
2020-08-26T23:19:26Z [鈻禲 started task: install Windows VPC controller
2020-08-26T23:19:56Z [鈻禲 failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, install Windows VPC controller }, 2 parallel sub-2020-08-26T23:19:56Z [鈻禲 failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update C2020-08-26T23:19:56Z [鈻禲 failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, install Windows VPC controller }, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-08-26T23:19:56Z [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-08-26T23:19:56Z [鈩筣 to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-08-26T23:19:56Z [鉁朷 getting list of API resources for raw REST client: Get "https://FB4EC8FD77B4D5475FBBA3981F81FCB9.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 44.234.248.25:443: i/o timeout
Error: failed to create cluster "dave-eks"
yaml:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
# https://eksctl.io/usage/schema/
name: dave-eks
region: us-west-2
version: '1.17'
tags:
Department: Dev/Cont
Environment: Dev
Purpose: Development
Tenancy: PaaS
Name: dave-eks
vpc:
id: "vpc-05d662efb65a29dac"
cidr: "10.12.0.0/16"
subnets:
private:
us-west-2a:
id: "subnet-xxx"
cidr: "10.12.40.0/25"
us-west-2b:
id: "subnet-yyy"
cidr: "10.12.40.128/25"
public:
us-west-2a:
id: "subnet-aaa"
cidr: "10.12.30.0/25"
us-west-2b:
id: "subnet-bbb"
cidr: "10.12.30.128/25"
managedNodeGroups:
- name: linux-ng
instanceType: t3.medium
desiredCapacity: 1
privateNetworking: true # if only 'Private' subnets are given, this must be enabled
ssh:
publicKeyName: 'Development-01.2020-Linux'
allow: true
nodeGroups:
- name: windows-ng
instanceType: t3.large
minSize: 1
maxSize: 3
desiredCapacity: 1
privateNetworking: true # if only 'Private' subnets are given, this must be enabled
preBootstrapCommands:
- I have commands to install flex volume, set docker logging level and pull the servercore image
securityGroups:
withShared: true
withLocal: true
attachIDs: ['sg-aaa']
ssh:
publicKeyName: 'Development-01.2020-Windows'
allow: true
volumeSize: 100
amiFamily: WindowsServer1909CoreContainer
cloudWatch:
clusterLogging:
enableTypes: ["*"]
@dave-meier thanks for being patient. I was able to reproduce this issue a few times and a potential fix is out in https://github.com/weaveworks/eksctl/pull/3167. The cause of this issue is likely https://github.com/weaveworks/eksctl/issues/3166.
Most helpful comment
@dave-meier thanks for being patient. I was able to reproduce this issue a few times and a potential fix is out in https://github.com/weaveworks/eksctl/pull/3167. The cause of this issue is likely https://github.com/weaveworks/eksctl/issues/3166.