Eksctl: Private Cluster creation fails if an existing VPC uses the same RouteTable

Created on 24 Jul 2020 · 9Comments · Source: weaveworks/eksctl

What happened?

When creating a Private Cluster with user-supplied VPC, if the subnets use the same RouteTable, cluster creation fails with the following error.

$ eksctl create cluster -f cluster.yaml
[ℹ]  eksctl version 0.24.0
[ℹ]  using region us-west-2
[✔]  using existing VPC (vpc-XXX...XXX) and subnets (private:[subnet-XXX...XXX subnet-XXX...XXX subnet-XXX...XXX] public:[])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  using Kubernetes version 1.16
[ℹ]  creating EKS cluster "private-cluster" in "us-west-2" region with
[ℹ]  will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=private-cluster'
[ℹ]  CloudWatch logging will not be enabled for cluster "private-cluster" in "us-west-2"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=private-cluster'
[ℹ]  Kubernetes API endpoint access will use provided values {publicAccess=true, privateAccess=true} for cluster "private-cluster" in "us-west-2"
[ℹ]  2 sequential tasks: { create cluster control plane "private-cluster", update cluster VPC endpoint access configuration }
[ℹ]  building cluster stack "eksctl-private-cluster-cluster"
[ℹ]  deploying stack "eksctl-private-cluster-cluster"
[✖]  unexpected status "ROLLBACK_COMPLETE" while waiting for CloudFormation stack "eksctl-private-cluster-cluster"
[ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
[!]  AWS::EC2::SecurityGroup/ClusterSharedNodeSecurityGroup: DELETE_IN_PROGRESS
[!]  AWS::IAM::Role/ServiceRole: DELETE_IN_PROGRESS
[✖]  AWS::EC2::SecurityGroup/ClusterSharedNodeSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖]  AWS::IAM::Role/ServiceRole: CREATE_FAILED – "Resource creation cancelled"
[✖]  AWS::EC2::SecurityGroup/ControlPlaneSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖]  AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "Property RouteTableIds contains duplicate values."
[!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=private-cluster'
[✖]  waiting for CloudFormation stack "eksctl-private-cluster-cluster": ResourceNotReady: failed waiting for successful resource state
Error: failed to create cluster "private-cluster"

This is because the same RouteTable Ids are output to RouteTableIds of VPCEndpointS3 in the generated CloudFormation template.

...
        "VPCEndpointS3": {
            "Type": "AWS::EC2::VPCEndpoint",
            "Properties": {
                "RouteTableIds": [
                    "rtb-AAA...AAA",
                    "rtb-AAA...AAA",
                    "rtb-AAA...AAA"
                ],
                "ServiceName": "com.amazonaws.us-west-2.s3",
                "VpcEndpointType": "Gateway",
                "VpcId": "vpc-XXX...XXX"
            }
        },
...

What you expected to happen?

Private Cluster creation succeed when subnets use the same RouteTable.

How to reproduce it?

1. Prepare the configuration file

Use the following configuration file "cluster.yaml".

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: private-cluster1
  region: us-west-2

privateCluster:
  enabled: true

vpc:
  subnets:
    private:
      us-west-2a:
        id: subnet-aaaa
      us-west-2b:
        id: subnet-bbbb
      us-west-2c:
        id: subnet-cccc

Subnets (subnet-aaaa, subnet-bbbb, subnet-cccc) use the same route table.

2. execute the following eksctl command

eksctl create cluster -f cluster.yaml

As a result of the above execution, the issue can be reproduced.

Versions

$ eksctl version
0.24.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T23:30:39Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.6-eks-4e7f64", GitCommit:"4e7f642f9f4cbb3c39a4fc6ee84fe341a8ade94c", GitTreeState:"clean", BuildDate:"2020-06-11T13:55:35Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

kinbug prioritimportant-soon

Source

hiraken-w

Most helpful comment

This issue is fixed in 0.26.0.

michaelbeaumont on 27 Aug 2020

👍2

All 9 comments

I am impacted by this bug when I tried to build a private cluster using 0.25.0 version.Any ETA when it will be fixed?.
I also tried to build eksctl with latest git clone and it build with a binary of eksctl version
0.26.0-dev+100e9d0b.2020-08-05T16:10:12Z

Then I tried again as here is says this bug is closed, but I still this issue,

[ℹ] deploying stack "eksctl-lakedev-cdml-private-cluster-cluster"
[✖] unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-lakedev-cdml-private-cluster-cluster"
[ℹ] fetching stack events in attempt to troubleshoot the root cause of the failure
[✖] AWS::EC2::SecurityGroup/ClusterSharedNodeSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖] AWS::EC2::SecurityGroup/ControlPlaneSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖] AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "route table rtb-09ec7cd1e8effc6cf already has a route with destination-prefix-list-id pl-68a54001 (Service: AmazonEC2; Status Code: 400; Error Code: RouteAlreadyExists; Request ID: 9815caa0-c444-46fc-9116-624b749e477a)"
[!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ] to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=lakedev-cdml-private-cluster'
[✖] waiting for CloudFormation stack "eksctl-lakedev-cdml-private-cluster-cluster": ResourceNotReady: failed waiting for successful resource state

Here's the describe stack events

    {
        "StackId": "arn:aws:cloudformation:us-west-2:701843826545:stack/eksctl-lakedev-cdml-private-cluster-cluster/24896a60-d737-11ea-9211-02e34c33f014",
        "EventId": "ClusterSharedNodeSecurityGroup-CREATE_FAILED-2020-08-05T16:17:31.277Z",
        "StackName": "eksctl-lakedev-cdml-private-cluster-cluster",
        "LogicalResourceId": "ClusterSharedNodeSecurityGroup",
        "PhysicalResourceId": "eksctl-lakedev-cdml-private-cluster-cluster-ClusterSharedNodeSecurityGroup-18UQ4X25KGARB",
        "ResourceType": "AWS::EC2::SecurityGroup",
        "Timestamp": "2020-08-05T16:17:31.277Z",
        "ResourceStatus": "CREATE_FAILED",
        "ResourceStatusReason": "Resource creation cancelled",
        "ResourceProperties": "{\"GroupDescription\":\"Communication between all nodes in the cluster\",\"VpcId\":\"vpc-0d02c75fe677fd1c6\",\"Tags\":[{\"Value\":\"eksctl-lakedev-cdml-private-cluster-cluster/ClusterSharedNodeSecurityGroup\",\"Key\":\"Name\"}]}"
    },
    {
        "StackId": "arn:aws:cloudformation:us-west-2:701843826545:stack/eksctl-lakedev-cdml-private-cluster-cluster/24896a60-d737-11ea-9211-02e34c33f014",
        "EventId": "ControlPlaneSecurityGroup-CREATE_FAILED-2020-08-05T16:17:31.212Z",
        "StackName": "eksctl-lakedev-cdml-private-cluster-cluster",
        "LogicalResourceId": "ControlPlaneSecurityGroup",
        "PhysicalResourceId": "eksctl-lakedev-cdml-private-cluster-cluster-ControlPlaneSecurityGroup-B5WD4SE2S17U",
        "ResourceType": "AWS::EC2::SecurityGroup",
        "Timestamp": "2020-08-05T16:17:31.212Z",
        "ResourceStatus": "CREATE_FAILED",
        "ResourceStatusReason": "Resource creation cancelled",
        "ResourceProperties": "{\"GroupDescription\":\"Communication between the control plane and worker nodegroups\",\"VpcId\":\"vpc-0d02c75fe677fd1c6\",\"Tags\":[{\"Value\":\"eksctl-lakedev-cdml-private-cluster-cluster/ControlPlaneSecurityGroup\",\"Key\":\"Name\"}]}"
    },
    {
        "StackId": "arn:aws:cloudformation:us-west-2:701843826545:stack/eksctl-lakedev-cdml-private-cluster-cluster/24896a60-d737-11ea-9211-02e34c33f014",
        "EventId": "VPCEndpointS3-CREATE_FAILED-2020-08-05T16:17:27.802Z",
        "StackName": "eksctl-lakedev-cdml-private-cluster-cluster",
        "LogicalResourceId": "VPCEndpointS3",
        "PhysicalResourceId": "",
        "ResourceType": "AWS::EC2::VPCEndpoint",
        "Timestamp": "2020-08-05T16:17:27.802Z",
        "ResourceStatus": "CREATE_FAILED",
        "ResourceStatusReason": "route table rtb-09ec7cd1e8effc6cf already has a route with destination-prefix-list-id pl-68a54001 (Service: AmazonEC2; Status Code: 400; Error Code: RouteAlreadyExists; Request ID: 9815caa0-c444-46fc-9116-624b749e477a)",
        "ResourceProperties": "{\"VpcId\":\"vpc-0d02c75fe677fd1c6\",\"RouteTableIds\":[\"rtb-09ec7cd1e8effc6cf\"],\"ServiceName\":\"com.amazonaws.us-west-2.s3\",\"VpcEndpointType\":\"Gateway\"}"
    },

Our VPC does have private subnets with same route table and with that vpc s3 endpoint gateway as one of the routes.

Please suggest any workaround if there's one?.

Thanks
Lucky

lkr2des on 5 Aug 2020

@lkr2des The commit mentioned above your comment fixed this issue a week ago.
The error message you're getting isn't the same one from the OP so this may be a completely different bug. Can you:

post your config
try version 0.24.0 which was released _before_ this was fixed and see if you're getting the same error as the OP?

@hiraken-w Can you confirm your fix did fix the issue for you?

michaelbeaumont on 5 Aug 2020

Hi @michaelbeaumont
Apologies if this is not the same issue.

I tried with 0.24 and I am still seeing same issue

[ℹ]  eksctl version 0.24.0
[ℹ]  using region us-west-2
[✔]  using existing VPC (vpc-0d02c75fe677fd1c6) and subnets (private:[subnet-0cdb44b4eeb37ec32 subnet-03ae58f6baa404802] public:[])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  nodegroup "ng-1" will use "ami-037843f6aeb12e236" [AmazonLinux2/1.17]
[ℹ]  using EC2 key pair "lakedevecs-keypair"
[ℹ]  using Kubernetes version 1.17
[ℹ]  creating EKS cluster "lakedev-cdml-private-cluster" in "us-west-2" region with un-managed nodes
[ℹ]  1 nodegroup (ng-1) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=lakedev-cdml-private-cluster'
[ℹ]  Kubernetes API endpoint access will use provided values {publicAccess=true, privateAccess=true} for cluster "lakedev-cdml-private-cluster" in "us-west-2"
[ℹ]  2 sequential tasks: { create cluster control plane "lakedev-cdml-private-cluster", 2 sequential sub-tasks: { 3 sequential sub-tasks: { tag cluster, update CloudWatch logging configuration, update cluster VPC endpoint access configuration }, create nodegroup "ng-1" } }
[ℹ]  building cluster stack "eksctl-lakedev-cdml-private-cluster-cluster"
[ℹ]  deploying stack "eksctl-lakedev-cdml-private-cluster-cluster"
[✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-lakedev-cdml-private-cluster-cluster"
[ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
[✖]  AWS::EC2::SecurityGroup/ControlPlaneSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖]  AWS::EC2::SecurityGroup/ClusterSharedNodeSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖]  AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "Property RouteTableIds contains duplicate values."
[!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=lakedev-cdml-private-cluster'
[✖]  waiting for CloudFormation stack "eksctl-lakedev-cdml-private-cluster-cluster": ResourceNotReady: failed waiting for successful resource state

Basically I am trying to setup this cluster with a node group on 2 private subnets in us-west-2a and us-west-2b, but both the subnets have the same routetable Id which does have a s3 endpoint as one of the destination defined.

Here's the cluster.yml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: lakedev-cdml-private-cluster
  region: us-west-2
  version: '1.17'
  tags:

privateCluster:
  enabled: true
  additionalEndpointServices:
  - "autoscaling"

vpc:
  id: "vpc-0d02c75fe677fd1c6"
  cidr: "172.23.0.0/18"
  subnets:
    private:
      us-west-2a:
          id: "subnet-03ae58f6baa404802"
          cidr: "172.23.2.0/24"
      us-west-2b:
          id: "subnet-0cdb44b4eeb37ec32"
          cidr: "172.23.18.0/24"
iam:
  serviceRoleARN: "arn:aws:iam::701843826545:role/EKSClusterRole"

nodeGroups:
  - name: ng-1
    privateNetworking: true
    desiredCapacity: 2
    minSize: 0
    maxSize: 3
    volumeSize: 100
    volumeType: gp2
    instanceType: m5a.2xlarge
    availabilityZones: ["us-west-2a", "us-west-2b"]
    labels: {role: worker-node}
    tags:
      k8s.io/cluster-autoscaler/node-template/label/lifecycle: OnDemand
      k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "false"
      k8s.io/cluster-autoscaler/node-template/label/gpu-count: "0"
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/kubeflow-us-west-2: "owned"
    ssh:
      allow: true
      publicKeyName: 'lakedevecs-keypair'
    securityGroups:
      withShared: true
      withLocal: true
      attachIDs: ['sg-00bfbe5c440fa5eb8', 'sg-0ef52b744df998490']
    iam:
      instanceProfileARN: "arn:aws:iam::701843826545:instance-profile/EKSWorkersInstanceProfile-OZBYVL0UV03F"
      instanceRoleARN: "arn:aws:iam::701843826545:role/eksworkerinstancerole"

Thanks
Lucky

lkr2des on 5 Aug 2020

Notice you're getting a completely different error with 0.24 as in your first comment.
First error on master:

AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "route table rtb-09ec7cd1e8effc6cf already has a route with destination-prefix-list-id pl-68a54001 (Service: AmazonEC2; Status Code: 400; Error Code: RouteAlreadyExists; Request ID: 9815caa0-c444-46fc-9116-624b749e477a)"

With 0.24 (same as the OP):

AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "Property RouteTableIds contains duplicate values."

which would suggest the PR to fix the OP didn't completely solve the problem.

michaelbeaumont on 5 Aug 2020

Thanks @michaelbeaumont

lkr2des on 5 Aug 2020

@michaelbeaumont

Notice you're getting a completely different error with 0.24 as in your first comment.
First error on master:

AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "route table rtb-09ec7cd1e8effc6cf already has a route with destination-prefix-list-id pl-68a54001 (Service: AmazonEC2; Status Code: 400; Error Code: RouteAlreadyExists; Request ID: 9815caa0-c444-46fc-9116-624b749e477a)"

I could replicate the above error, too. I validated it on master.

This error occurs when the S3 VPC Endpoint Route already exists in the RouteTable that is associated with the Subnet.
My fix was not for this error.

hiraken-w on 6 Aug 2020

I have the same problem

[ℹ] eksctl version 0.25.0
[ℹ] using region us-east-1
[✔] using existing VPC (vpc-xxx) and subnets (private:[subnet-xxxsubnet-xxx] public:[])
[!] custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ] using Kubernetes version 1.17
[ℹ] creating EKS cluster "dev" in "us-east-1" region with
[ℹ] will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
[ℹ] will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
[ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=dev'
[ℹ] CloudWatch logging will not be enabled for cluster "dev" in "us-east-1"
[ℹ] you can enable it with 'eksctl utils update-cluster-logging --region=us-east-1 --cluster=dev'
[ℹ] Kubernetes API endpoint access will use provided values {publicAccess=true, privateAccess=true} for cluster "dev" in "us-east-1"
[ℹ] 2 sequential tasks: { create cluster control plane "dev", update cluster VPC endpoint access configuration }
[ℹ] building cluster stack "eksctl-dev-cluster"
[ℹ] deploying stack "eksctl-dev-cluster"
[✖] unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-dev-cluster"
[ℹ] fetching stack events in attempt to troubleshoot the root cause of the failure
[✖] AWS::EC2::SecurityGroup/ControlPlaneSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖] AWS::IAM::Role/ServiceRole: CREATE_FAILED – "Resource creation cancelled"
[✖] AWS::EC2::SecurityGroup/ClusterSharedNodeSecurityGroup: CREATE_FAILED – "Resource creation cancelled"
[✖] AWS::EC2::VPCEndpoint/VPCEndpointS3: CREATE_FAILED – "Property RouteTableIds contains duplicate values."
[!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ] to cleanup resources, run 'eksctl delete cluster --region=us-east-1 --name=dev'
[✖] waiting for CloudFormation stack "eksctl-dev-cluster": ResourceNotReady: failed waiting for successful resource state
Error: failed to create cluster "dev"