Eksctl: Unable to create windows cluster with private subnets - timeout error

Created on 30 Jun 2020  路  13Comments  路  Source: weaveworks/eksctl

Cluster creation timeout - private subnets - windows worker

What happened?
Timeout error occurred and cluster was not fully created. It works if I use public subnets, and fails if I use private subnets.

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: dave-eks2
  region: us-west-2
  version: '1.16'

vpc:
  id: "vpc-zzz"
  cidr: "10.12.0.0/16"
  subnets:
    private:
      us-west-2a:
          id: "subnet-xxx"
          cidr: "10.12.40.0/25"      
      us-west-2b:
          id: "subnet-yyy"
          cidr: "10.12.40.128/25"

managedNodeGroups:
  - name: linux-ng
    instanceType: t2.large
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true

nodeGroups:
  - name: windows-ng
    instanceType: t3.large
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    securityGroups:
      withShared: true
      withLocal: true
      attachIDs: ['sg-xxx']
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true
    volumeSize: 100
    amiFamily: WindowsServer2019CoreContainer

eksctl create cluster -f .\eks-cluster2-spec-my-vpc-with-private-subnets.yaml --install-vpc-controllers --timeout 40m --verbose=5 *> create-cluster-output.txt
[create-cluster-output.txt](https://github.com/weaveworks/eksctl/files/4848650/create-cluster-output.txt)


cat create-cluster-output.txt
2020-06-29T21:10:44Z [螕盲鈺  eksctl version 0.22.0
2020-06-29T21:10:45Z [螕盲鈺  using region us-west-2
2020-06-29T21:10:45Z [螕没鈺  DEBUG: Request sts/GetCallerIdentity Details:

...

2020-06-29T21:22:06Z [螕没鈺  completed task: create cluster control plane "dave-eks2"
2020-06-29T21:22:06Z [螕没鈺  started task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } }
2020-06-29T21:22:06Z [螕没鈺  started task: install Windows VPC controller
2020-06-29T21:22:06Z [螕没鈺  started task: install Windows VPC controller
2020-06-29T21:22:36Z [螕没鈺  failed task: install Windows VPC controller (will not run other sequential tasks)
2020-06-29T21:22:36Z [螕没鈺  failed task: install Windows VPC controller (will not run other sequential tasks)
2020-06-29T21:22:36Z [螕没鈺  failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-06-29T21:22:36Z [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-06-29T21:22:36Z [螕盲鈺  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks2'
2020-06-29T21:22:36Z [螕拢没]  getting list of API resources for raw REST client: Get "https://3C15FCAE7DAE5E25E346FD837BE3595C.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 52.34.186.59:443: i/o timeout
eksctl : Error: failed to create cluster "dave-eks2"
At line:1 char:1
+ eksctl create cluster -f .\eks-cluster2-spec-my-vpc-with-private-subn ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Error: failed t...ter "dave-eks2":String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError



md5-053b9fe9f968f1739feb5e81cb69e401



PS D:\dave> eksctl.exe version
0.22.0
PS D:\dave> .\kubectl.exe version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:40:13Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"windows/amd64"}

Logs
Include the output of the command line when running eksctl. If possible, eksctl should be run with debug logs. For example:
eksctl get clusters -v 4
Make sure you redact any sensitive information before posting.
If the output is long, please consider a Gist.

awaiting more information kinbug needs-investigation prioritimportant-soon

Most helpful comment

@dave-meier thanks for being patient. I was able to reproduce this issue a few times and a potential fix is out in https://github.com/weaveworks/eksctl/pull/3167. The cause of this issue is likely https://github.com/weaveworks/eksctl/issues/3166.

All 13 comments

I have comments on #2201 as well, but decided to open a new issue as I am stuck and unable to create a windows cluster with private subnets.

@dave-meier I will see if I can reproduce this issue

Thanks @martina-if

Here is the most recent run below. In this case only the "eksctl-dave-eks-cluster" cloud formation stack was created, and I didn't see any node group CF stacks created like previously, so it seems like it is not getting as far. So no EC2 instances are created in this case:

yaml:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: dave-eks
  region: us-west-2
  version: '1.16'

vpc:
  id: "vpc-zzz"
  cidr: "10.12.0.0/16"
  subnets:
    private:
      us-west-2a:
          id: "subnet-xxx"
          cidr: "10.12.40.0/25"      
      us-west-2b:
          id: "subnet-yyy"
          cidr: "10.12.40.128/25"

managedNodeGroups:
  - name: linux-ng
    instanceType: t3.medium
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true

nodeGroups:
  - name: windows-ng
    instanceType: t3.medium
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    securityGroups:
      withShared: true
      withLocal: true
      attachIDs: ['sg-vvv']
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true
    volumeSize: 100
    amiFamily: WindowsServer2019CoreContainer

Results:
PS D:\\dave> eksctl create cluster -f .\\eks-cluster-spec-my-vpc-with-private-subnets.yaml --install-vpc-controllers --timeout 40m [鈩筣  eksctl version 0.22.0 [鈩筣  using region us-west-2 [鉁擼  using existing VPC (vpc-zzz) and subnets (private:[subnet-xxx subnet-yyy] public:[]) [!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets [鈩筣  nodegroup "windows-ng" will use "ami-0ee42fc568b2881e1" [WindowsServer2019CoreContainer/1.16]
[鈩筣  using EC2 key pair "Development-01.2020-Windows"
[鈩筣  using EC2 key pair "Development-01.2020-Windows"
[鈩筣  using Kubernetes version 1.16
[鈩筣  creating EKS cluster "dave-eks" in "us-west-2" region with managed nodes and un-managed nodes [鈩筣  2 nodegroups (linux-ng, windows-ng) were included (based on the include/exclude rules) [鈩筣  will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s) [鈩筣  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s) [鈩筣  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=dave-eks'
[鈩筣  CloudWatch logging will not be enabled for cluster "dave-eks" in "us-west-2"
[鈩筣  you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=dave-eks'
[鈩筣  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "dave-eks" in "us-west-2"
[鈩筣  2 sequential tasks: { create cluster control plane "dave-eks", 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } } [鈩筣  building cluster stack "eksctl-dave-eks-cluster"
[鈩筣  deploying stack "eksctl-dave-eks-cluster"
[!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console [鈩筣  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
[鉁朷  getting list of API resources for raw REST client: Get "https://3A745D7D5F51ED0801EA32F68C52E0A9.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 54.149.63.46:443: i/o timeout
Error: failed to create cluster "dave-eks"

Hi @dave-meier , Can you see what the error is in CloudFormation and post it here please? Also, if you create the clusters with the verbose option -v 4 we may be able to see more about what the problem is.

@dave-meier I tried to reproduce this in eu-north-1 with a very similar configuration and I didn't have any issues except that I got an error when trying to use t2.large instances because (those are quite old and don't exist in modern regions).

I think there is either a problem with your subnets or with the instances in that region or perhaps with the security group? I think they key must be in the CloudFormation events in the console.

@martina-if So it seems to always work if I enable full cloudwatch logs. It also worked once today with the exact same yaml file as before. So there is definitely a problem that intermittently happens. For now I will just enable cloudwatch logging to create the cluster. What is the command line command to scale back or turn off cloudwatch logging after the cluster is up and running?

Here's what the failure looks like when it happens (just showing the end bit where it fails):

2020-07-10T15:49:53Z [鈻禲聽 waiting for CloudFormation stack "eksctl-dave-eks-cluster"
2020-07-10T15:49:53Z [鈻禲聽 done after 13m51.3887136s of waiting for CloudFormation stack "eksctl-dave-eks-cluster"
2020-07-10T15:49:53Z [鈻禲聽 processing stack outputs
2020-07-10T15:49:53Z [鈻禲聽 completed task: create cluster control plane "dave-eks"
2020-07-10T15:49:53Z [鈻禲聽 started task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } }
2020-07-10T15:49:53Z [鈻禲聽 started task: install Windows VPC controller
2020-07-10T15:49:53Z [鈻禲聽 started task: install Windows VPC controller
2020-07-10T15:50:23Z [鈻禲聽 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-2020-07-10T15:50:23Z [鈻禲聽 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-2020-07-10T15:50:23Z [鈻禲聽 failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-07-10T15:50:23Z [!]聽 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-07-10T15:50:23Z [鈩筣聽 to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-07-10T15:50:23Z [鉁朷聽 getting list of API resources for raw REST client: Get "https://1C658EF25478B04CAA1D7CF7964D9BD6.yl4.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 54.70.74.99:443: i/o timeout
Error: failed to create cluster "dave-eks"

@dave-meier you can use eksctl utils update-cluster-logging --disable-types:

--disable-types strings   Log types to be disabled, the rest will be disabled. Supported log types: (all, none, api, audit, authenticator, controllerManager, scheduler)

Thanks @michaelbeaumont

So to recap, I now add this to my yaml to enable cloudwatch:

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

Then I deploy, which works most of the time. NOTE: it still fails sometimes just like it's noted above:

# Example:
# .\create-eks-cluster.ps1 -cluster_name dave-eks -eks_yaml d:\dave\eks-dave-cluster-test2.yaml -subnets "subnet-xxx,subnet-yyy" -route_table rtb-zzz
Param(
  [Parameter(Mandatory=$true)][string] $cluster_name,
  [Parameter(Mandatory=$true)][string] $eks_yaml,
  [Parameter(Mandatory=$true)][string] $subnets,
  [Parameter(Mandatory=$true)][string] $route_table
)

$subnet_list = $subnets -split ","
foreach ($subnet in $subnet_list) {
  write-host "Subnet: ${subnet}"
  aws ec2 create-tags --resources $subnet --tags 'Key=kubernetes.io/cluster/dave-eks,Value=shared'
  aws ec2 create-tags --resources $subnet --tags 'Key=kubernetes.io/role/internal-elb,Value=1'
  aws ec2 associate-route-table --route-table-id $route_table --subnet-id $subnet
}

eksctl create cluster -f $eks_yaml --install-vpc-controllers -v 4 --timeout 40m
start-sleep -seconds 60
aws eks --region us-west-2 update-kubeconfig --name dave-eks
kubectl get no -o wide

In the script above, I assign the proper labels to the subnets, plus make sure that the route table is explicitly associated with each subnet. Before I did the route table association, I was getting an error in the the vpc resource controller pod, saying that there was no route table for the subnet. Check with:

.\kubectl.exe logs vpc-resource-controller-xxx -n kube-system

Then, if I want to stop cloudwatch logging, I issue the following command after the cluster is successfully launched:

Param(
  [Parameter(Mandatory=$true)][string] $cluster_name
)

eksctl utils update-cluster-logging --cluster $cluster_name --disable-types all --approve

aws logs delete-log-group --log-group-name /aws/eks/${cluster_name}/cluster

So at least now, this works most of the time, but there is still some kind of timing related problem that causes it to fail sometimes when cloudwatch is enabled, and most of the time when cloudwatch is not enabled.

Still getting the same failure. Attached cloudwatch files. This time I was using amiFamily: WindowsServer1909CoreContainer.
eks-logs.zip

Failed again today:

...
2020-08-13T19:34:37Z [鈻禲  started task: update CloudWatch logging configuration
2020-08-13T19:34:38Z [鈻禲  start waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:34:38Z [鈻禲  waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:34:55Z [鈻禲  waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:35:15Z [鈻禲  waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:35:15Z [鈻禲  done after 37.2986217s of waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-13T19:35:15Z [鉁擼  configured CloudWatch logging for cluster "dave-eks" in "us-west-2" (enabled types: api, audit, authenticator, controllerManager, scheduler & no types disabled)
2020-08-13T19:35:15Z [鈻禲  completed task: update CloudWatch logging configuration
2020-08-13T19:35:15Z [鈻禲  started task: install Windows VPC controller
2020-08-13T19:35:45Z [鈻禲  failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update C2020-08-13T19:35:45Z [鈻禲  failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, install Windows VPC controller }, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-08-13T19:35:45Z [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2ration, install Windows VPC controller }, 2 parallel sub-hasn't been created properly, you may wish to check CloudFormation console
gur20-08-13T19:35:45Z [鈩筣  to cleanup resources, run 'eks2020-08-13T19:35:45Z [鈩筣  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-08-13T19:35:45Z [鉁朷  getting list of API resources for raw REST client: Get "https://F565C71B408D610B9BA9029B773B8728.yl4.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 54.189.25.67:443: i/o timeout
Error: failed to create cluster "dave-eks"

Still getting the same problem. On latest version of eksctl (0.26.0):

2020-08-26T23:18:47Z [鉁擼  tagged EKS cluster (Department=Dev/Cont, Environment=Dev, Name=dave-eks, Purpose=Development, Tenancy=PaaS)
2020-08-26T23:18:47Z [鈻禲  completed task: tag cluster
2020-08-26T23:18:47Z [鈻禲  started task: update CloudWatch logging configuration
2020-08-26T23:18:48Z [鈻禲  start waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:18:48Z [鈻禲  waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:06Z [鈻禲  waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:25Z [鈻禲  waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:25Z [鈻禲  done after 37.2225598s of waiting for requested "LoggingUpdate" in cluster "dave-eks" to succeed
2020-08-26T23:19:26Z [鉁擼  configured CloudWatch logging for cluster "dave-eks" in "us-west-2" (enabled types: api, audit, authenticator, controllerManager, scheduler & no types disabled)
2020-08-26T23:19:26Z [鈻禲  completed task: update CloudWatch logging configuration
2020-08-26T23:19:26Z [鈻禲  started task: install Windows VPC controller
2020-08-26T23:19:56Z [鈻禲  failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, install Windows VPC controller }, 2 parallel sub-2020-08-26T23:19:56Z [鈻禲  failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update C2020-08-26T23:19:56Z [鈻禲  failed task: 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, install Windows VPC controller }, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-08-26T23:19:56Z [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-08-26T23:19:56Z [鈩筣  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-08-26T23:19:56Z [鉁朷  getting list of API resources for raw REST client: Get "https://FB4EC8FD77B4D5475FBBA3981F81FCB9.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 44.234.248.25:443: i/o timeout
Error: failed to create cluster "dave-eks"

yaml:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  # https://eksctl.io/usage/schema/
  name: dave-eks
  region: us-west-2
  version: '1.17'
  tags:
    Department: Dev/Cont
    Environment: Dev
    Purpose: Development
    Tenancy: PaaS
    Name: dave-eks

vpc:
  id: "vpc-05d662efb65a29dac"
  cidr: "10.12.0.0/16"
  subnets:
    private:
      us-west-2a:
          id: "subnet-xxx"
          cidr: "10.12.40.0/25"      
      us-west-2b:
          id: "subnet-yyy"
          cidr: "10.12.40.128/25"
    public:
      us-west-2a:
          id: "subnet-aaa"
          cidr: "10.12.30.0/25"      
      us-west-2b:
          id: "subnet-bbb"
          cidr: "10.12.30.128/25"

managedNodeGroups:
  - name: linux-ng
    instanceType: t3.medium
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    ssh:
      publicKeyName: 'Development-01.2020-Linux'
      allow: true

nodeGroups:
  - name: windows-ng
    instanceType: t3.large
    minSize: 1
    maxSize: 3
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    preBootstrapCommands:
      - I have commands to install flex volume, set docker logging level and pull the servercore image
    securityGroups:
      withShared: true
      withLocal: true
      attachIDs: ['sg-aaa']
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true
    volumeSize: 100
    amiFamily: WindowsServer1909CoreContainer

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

@dave-meier thanks for being patient. I was able to reproduce this issue a few times and a potential fix is out in https://github.com/weaveworks/eksctl/pull/3167. The cause of this issue is likely https://github.com/weaveworks/eksctl/issues/3166.

Was this page helpful?
0 / 5 - 0 ratings