Kops: LoadBalancer always reports instances as OutOfService

Created on 18 Jan 2018 · 24Comments · Source: kubernetes/kops

I expose a deployment using a service of type LoadBalancer. That creates an ELB, but the EC2 instances it has never pass the health checks, so they are marked as OutOfService and I can't access my deployment from outside.

Am I doing something wrong? Or is this a kops bug?

cloud: aws
kops version: 1.8.0
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T20:00:41Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

lifecyclrotten

Source

grimborg

👍2

Most helpful comment

Hi all, i see this issue is pretty old but i am encountering the same one. My nodes and master nodes are up as per my aws console but the api ELB does flag the master as out of service. How did you sort it out ?

Please find my configuration below :

Cluster configuration

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: mycluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole"],
          "Resource" : ["*"]
        }
      ]
    node: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole",
                       "route53:ChangeResourceRecordSets",
                       "route53:GetChange",
                       "route53:ListHostedZones",
                        "ec2:DescribeSpotInstanceRequests",
                        "ec2:CancelSpotInstanceRequests",
                        "ec2:GetConsoleOutput",
                        "ec2:RequestSpotInstances",
                        "ec2:RunInstances",
                        "ec2:StartInstances",
                        "ec2:StopInstances",
                        "ec2:TerminateInstances",
                        "ec2:CreateTags",
                        "ec2:DeleteTags",
                        "ec2:DescribeInstances",
                        "ec2:DescribeKeyPairs",
                        "ec2:DescribeRegions",
                        "ec2:DescribeImages",
                        "ec2:DescribeAvailabilityZones",
                        "ec2:DescribeSecurityGroups",
                        "ec2:DescribeSubnets",
                        "iam:ListInstanceProfilesForRole",
                        "iam:PassRole"],
          "Resource" : ["*"]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mys3bucket/mycluster
  docker:
    logDriver: ""
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer: {}
  kubelet:
    enableCustomMetrics: true
    featureGates:
      ExpandPersistentVolumes: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.8
  masterInternalName: api.internal.mycluster
  masterPublicName: api.mycluster
  networkCIDR: 10.165.14.0/24
  networkID: vpc-id
  networking:
    weave: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.165.14.96/27
    egress: nat-id
    id: subnet-id
    name: wkpr1
    type: Private
    zone: us-east-1a
  - cidr: 10.165.14.32/27
    egress: nat-id
    id: subnet-id
    name: wkpr2
    type: Private
    zone: us-east-1b
  - cidr: 10.165.14.64/27
    egress: nat-id
    id: subnet-id
    name: wkpr3
    type: Private
    zone: us-east-1c
  - cidr: 10.165.14.192/27
    id: subnet-id
    name: wkpu4
    type: Utility
    zone: us-east-1a
  - cidr: 10.165.14.128/27
    id: subnet-id
    name: wkpu2
    type: Utility
    zone: us-east-1b
  - cidr: 10.165.14.160/27
    id: subnet-id
    name: wkpu3
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

Master IG configuration

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-14T22:10:42Z
  labels:
    Application: Kubernetes cluster
    Environment: Production
    kops.k8s.io/cluster: mycluster
  name: master-us-east-1a
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - wkpr1

kops version : Version 1.11.1 (git-0f2aa8d30)
kubectl version: Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:41:50Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: EOF

angegar on 8 Mar 2019

👍4

All 24 comments

Do you think this is the same issue as #4301? I know this was reported first, but I think that issue has more info in it.

mmacfadden on 20 Jan 2018

Ultimately the actual issue is probably #4156.

mmacfadden on 21 Jan 2018

@grimborg are you using master or a released kops version?

chrislovecnm on 8 Feb 2018

I am also facing this problem using Kops 1.8.1

darrenhaken on 21 Feb 2018

👍1

Same issue with Kops 1.8.1

dasgoll on 6 Mar 2018

👍2

Same issue with Kops 1.9.0-alpha.1

snoby on 15 Mar 2018

For me this was fixed in master. Seems odd that 1.9.0-alpha.1 would still have the issue. i should be standing up another cluster shortly. I can report on what i find.

mmacfadden on 15 Mar 2018

@mmacfadden - let me save you some time, I had to use master to get my already running cluster back from this mess-up. Only master ( built on Linux) works.

snoby on 15 Mar 2018

@snoby I built on mac os using homebrew from master, and that also worked.

mmacfadden on 16 Mar 2018

I also came across this bug, using Kops 1.8.1 and K8s 1.10.0.

Then I tried reducing the number of master nodes from 3 to 1, recreated the cluster, and it worked. Far from ideal, but a good enough workaround for me while this bug gets resolved.

dggc on 4 Apr 2018

Has anyone checked whether kops 1.9.0 fixes this?

borsboom on 30 Apr 2018

I'm using kops 1.9 and having this issue as well (with kube_version 1.9.7). It was able to create the cluster fine, and master was reachable (yay!).

However, I after I updated the nodes configuration (kops edit ig nodes) and did a kops update cluster --yes && kops rolliing-update cluster --yes the master was OutOfService again. For some unknown reason it rebuilt the master, and screwed it up.

Really scary to use kops, and no _idea_ how to reach it again

awbacker on 3 May 2018

Same here. Brand new cluster with Kops 1.9.0, Kubernetes 1.10.2. Config Bucket and master and nodes in the same region.

Antiarchitect on 7 May 2018

My issue was this: https://github.com/kubernetes/kops/issues/4844, so solved by moving to m4 for master

Antiarchitect on 8 May 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 9 Aug 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 8 Sep 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 8 Oct 2018

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 8 Oct 2018

Please find my configuration below :

Cluster configuration

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: mycluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole"],
          "Resource" : ["*"]
        }
      ]
    node: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole",
                       "route53:ChangeResourceRecordSets",
                       "route53:GetChange",
                       "route53:ListHostedZones",
                        "ec2:DescribeSpotInstanceRequests",
                        "ec2:CancelSpotInstanceRequests",
                        "ec2:GetConsoleOutput",
                        "ec2:RequestSpotInstances",
                        "ec2:RunInstances",
                        "ec2:StartInstances",
                        "ec2:StopInstances",
                        "ec2:TerminateInstances",
                        "ec2:CreateTags",
                        "ec2:DeleteTags",
                        "ec2:DescribeInstances",
                        "ec2:DescribeKeyPairs",
                        "ec2:DescribeRegions",
                        "ec2:DescribeImages",
                        "ec2:DescribeAvailabilityZones",
                        "ec2:DescribeSecurityGroups",
                        "ec2:DescribeSubnets",
                        "iam:ListInstanceProfilesForRole",
                        "iam:PassRole"],
          "Resource" : ["*"]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mys3bucket/mycluster
  docker:
    logDriver: ""
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer: {}
  kubelet:
    enableCustomMetrics: true
    featureGates:
      ExpandPersistentVolumes: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.8
  masterInternalName: api.internal.mycluster
  masterPublicName: api.mycluster
  networkCIDR: 10.165.14.0/24
  networkID: vpc-id
  networking:
    weave: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.165.14.96/27
    egress: nat-id
    id: subnet-id
    name: wkpr1
    type: Private
    zone: us-east-1a
  - cidr: 10.165.14.32/27
    egress: nat-id
    id: subnet-id
    name: wkpr2
    type: Private
    zone: us-east-1b
  - cidr: 10.165.14.64/27
    egress: nat-id
    id: subnet-id
    name: wkpr3
    type: Private
    zone: us-east-1c
  - cidr: 10.165.14.192/27
    id: subnet-id
    name: wkpu4
    type: Utility
    zone: us-east-1a
  - cidr: 10.165.14.128/27
    id: subnet-id
    name: wkpu2
    type: Utility
    zone: us-east-1b
  - cidr: 10.165.14.160/27
    id: subnet-id
    name: wkpu3
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

Master IG configuration

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-14T22:10:42Z
  labels:
    Application: Kubernetes cluster
    Environment: Production
    kops.k8s.io/cluster: mycluster
  name: master-us-east-1a
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - wkpr1

angegar on 8 Mar 2019

👍4

We are also experiencing the same issue in all 3 kops v1.16.0 clusters. However, all clusters are successfully validated with kops validate cluster.

i-01a3axxxxxxx | nodes-us-east-1d.example.com | us-east-1d | InService
i-0f89f3xxxxxxx | nodes-us-east-1a.example.com | us-east-1a | InService
i-0c3943xxxxxxx | nodes-us-east-1b.example.com | us-east-1b | OutOfService

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T23:41:24Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.7", GitCommit:"be3d344ed06bff7a4fc60656200a93c74f31f9a4", GitTreeState:"clean", BuildDate:"2020-02-11T19:24:46Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

demisx on 15 Mar 2020

@demisx what are the health check settings on the load balancer? and does AWS provide any details on why the health check is failing (example: timeout or connection refused) ?

rifelpet on 15 Mar 2020

@rifelpet Thank you for your prompt response. This is what the "Health check" is set to. I am still digging for more info on why it fails the check. Strange that only 1 node is out, while the other 2 are fine. All masters are in service too.

Ping Target | HTTP:32502/healthz
Timeout | 5 seconds
Interval | 10 seconds
Unhealthy threshold | 6
Healthy threshold | 2

demisx on 15 Mar 2020

This is what I am getting back from each node. The "OutOf service" one returns "503 Service Unavailable". Any idea what may cause one of the nodes to return 503?

$ curl -I 172.20.53.122:32502/healthz
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:45:26 GMT
Content-Length: 105

$ curl -I 172.20.94.71:32502/healthz
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:47:19 GMT
Content-Length: 105

$ curl -I 172.20.127.105:32502/healthz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:47:47 GMT
Content-Length: 105

The /healthz endpoint doesn't exist on the "out of service" node:

$ curl localhost:32502/healthz
{
    "service": {
        "namespace": "default",
        "name": "nginx-ingress-controller"
    },
    "localEndpoints": 0

demisx on 15 Mar 2020

I've created a new issue #8759 since my problem is slightly different than what was described by the OP.

demisx on 16 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings