Kops: LoadBalancer always reports instances as OutOfService

Created on 18 Jan 2018  路  24Comments  路  Source: kubernetes/kops

I expose a deployment using a service of type LoadBalancer. That creates an ELB, but the EC2 instances it has never pass the health checks, so they are marked as OutOfService and I can't access my deployment from outside.

Am I doing something wrong? Or is this a kops bug?

cloud: aws
kops version: 1.8.0
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T20:00:41Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

lifecyclrotten

Most helpful comment

Hi all, i see this issue is pretty old but i am encountering the same one. My nodes and master nodes are up as per my aws console but the api ELB does flag the master as out of service. How did you sort it out ?

Please find my configuration below :

Cluster configuration

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: mycluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole"],
          "Resource" : ["*"]
        }
      ]
    node: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole",
                       "route53:ChangeResourceRecordSets",
                       "route53:GetChange",
                       "route53:ListHostedZones",
                        "ec2:DescribeSpotInstanceRequests",
                        "ec2:CancelSpotInstanceRequests",
                        "ec2:GetConsoleOutput",
                        "ec2:RequestSpotInstances",
                        "ec2:RunInstances",
                        "ec2:StartInstances",
                        "ec2:StopInstances",
                        "ec2:TerminateInstances",
                        "ec2:CreateTags",
                        "ec2:DeleteTags",
                        "ec2:DescribeInstances",
                        "ec2:DescribeKeyPairs",
                        "ec2:DescribeRegions",
                        "ec2:DescribeImages",
                        "ec2:DescribeAvailabilityZones",
                        "ec2:DescribeSecurityGroups",
                        "ec2:DescribeSubnets",
                        "iam:ListInstanceProfilesForRole",
                        "iam:PassRole"],
          "Resource" : ["*"]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mys3bucket/mycluster
  docker:
    logDriver: ""
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer: {}
  kubelet:
    enableCustomMetrics: true
    featureGates:
      ExpandPersistentVolumes: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.8
  masterInternalName: api.internal.mycluster
  masterPublicName: api.mycluster
  networkCIDR: 10.165.14.0/24
  networkID: vpc-id
  networking:
    weave: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.165.14.96/27
    egress: nat-id
    id: subnet-id
    name: wkpr1
    type: Private
    zone: us-east-1a
  - cidr: 10.165.14.32/27
    egress: nat-id
    id: subnet-id
    name: wkpr2
    type: Private
    zone: us-east-1b
  - cidr: 10.165.14.64/27
    egress: nat-id
    id: subnet-id
    name: wkpr3
    type: Private
    zone: us-east-1c
  - cidr: 10.165.14.192/27
    id: subnet-id
    name: wkpu4
    type: Utility
    zone: us-east-1a
  - cidr: 10.165.14.128/27
    id: subnet-id
    name: wkpu2
    type: Utility
    zone: us-east-1b
  - cidr: 10.165.14.160/27
    id: subnet-id
    name: wkpu3
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

Master IG configuration

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-14T22:10:42Z
  labels:
    Application: Kubernetes cluster
    Environment: Production
    kops.k8s.io/cluster: mycluster
  name: master-us-east-1a
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - wkpr1

kops version : Version 1.11.1 (git-0f2aa8d30)
kubectl version: Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:41:50Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: EOF

All 24 comments

Do you think this is the same issue as #4301? I know this was reported first, but I think that issue has more info in it.

Ultimately the actual issue is probably #4156.

@grimborg are you using master or a released kops version?

I am also facing this problem using Kops 1.8.1

Same issue with Kops 1.8.1

Same issue with Kops 1.9.0-alpha.1

For me this was fixed in master. Seems odd that 1.9.0-alpha.1 would still have the issue. i should be standing up another cluster shortly. I can report on what i find.

@mmacfadden - let me save you some time, I had to use master to get my already running cluster back from this mess-up. Only master ( built on Linux) works.

@snoby I built on mac os using homebrew from master, and that also worked.

I also came across this bug, using Kops 1.8.1 and K8s 1.10.0.

Then I tried reducing the number of master nodes from 3 to 1, recreated the cluster, and it worked. Far from ideal, but a good enough workaround for me while this bug gets resolved.

Has anyone checked whether kops 1.9.0 fixes this?

I'm using kops 1.9 and having this issue as well (with kube_version 1.9.7). It was able to create the cluster fine, and master was reachable (yay!).

However, I after I updated the nodes configuration (kops edit ig nodes) and did a kops update cluster --yes && kops rolliing-update cluster --yes the master was OutOfService again. For some unknown reason it rebuilt the master, and screwed it up.

Really scary to use kops, and no _idea_ how to reach it again

Same here. Brand new cluster with Kops 1.9.0, Kubernetes 1.10.2. Config Bucket and master and nodes in the same region.

My issue was this: https://github.com/kubernetes/kops/issues/4844, so solved by moving to m4 for master

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Hi all, i see this issue is pretty old but i am encountering the same one. My nodes and master nodes are up as per my aws console but the api ELB does flag the master as out of service. How did you sort it out ?

Please find my configuration below :

Cluster configuration

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: mycluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole"],
          "Resource" : ["*"]
        }
      ]
    node: |
      [
        {
          "Effect" : "Allow",
          "Action" : ["sts:AssumeRole",
                       "route53:ChangeResourceRecordSets",
                       "route53:GetChange",
                       "route53:ListHostedZones",
                        "ec2:DescribeSpotInstanceRequests",
                        "ec2:CancelSpotInstanceRequests",
                        "ec2:GetConsoleOutput",
                        "ec2:RequestSpotInstances",
                        "ec2:RunInstances",
                        "ec2:StartInstances",
                        "ec2:StopInstances",
                        "ec2:TerminateInstances",
                        "ec2:CreateTags",
                        "ec2:DeleteTags",
                        "ec2:DescribeInstances",
                        "ec2:DescribeKeyPairs",
                        "ec2:DescribeRegions",
                        "ec2:DescribeImages",
                        "ec2:DescribeAvailabilityZones",
                        "ec2:DescribeSecurityGroups",
                        "ec2:DescribeSubnets",
                        "iam:ListInstanceProfilesForRole",
                        "iam:PassRole"],
          "Resource" : ["*"]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mys3bucket/mycluster
  docker:
    logDriver: ""
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer: {}
  kubelet:
    enableCustomMetrics: true
    featureGates:
      ExpandPersistentVolumes: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.8
  masterInternalName: api.internal.mycluster
  masterPublicName: api.mycluster
  networkCIDR: 10.165.14.0/24
  networkID: vpc-id
  networking:
    weave: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.165.14.96/27
    egress: nat-id
    id: subnet-id
    name: wkpr1
    type: Private
    zone: us-east-1a
  - cidr: 10.165.14.32/27
    egress: nat-id
    id: subnet-id
    name: wkpr2
    type: Private
    zone: us-east-1b
  - cidr: 10.165.14.64/27
    egress: nat-id
    id: subnet-id
    name: wkpr3
    type: Private
    zone: us-east-1c
  - cidr: 10.165.14.192/27
    id: subnet-id
    name: wkpu4
    type: Utility
    zone: us-east-1a
  - cidr: 10.165.14.128/27
    id: subnet-id
    name: wkpu2
    type: Utility
    zone: us-east-1b
  - cidr: 10.165.14.160/27
    id: subnet-id
    name: wkpu3
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

Master IG configuration

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-14T22:10:42Z
  labels:
    Application: Kubernetes cluster
    Environment: Production
    kops.k8s.io/cluster: mycluster
  name: master-us-east-1a
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - wkpr1

kops version : Version 1.11.1 (git-0f2aa8d30)
kubectl version: Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:41:50Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: EOF

We are also experiencing the same issue in all 3 kops v1.16.0 clusters. However, all clusters are successfully validated with kops validate cluster.

i-01a3axxxxxxx | nodes-us-east-1d.example.com | us-east-1d | InService
i-0f89f3xxxxxxx | nodes-us-east-1a.example.com | us-east-1a | InService
i-0c3943xxxxxxx | nodes-us-east-1b.example.com | us-east-1b | OutOfService
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T23:41:24Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.7", GitCommit:"be3d344ed06bff7a4fc60656200a93c74f31f9a4", GitTreeState:"clean", BuildDate:"2020-02-11T19:24:46Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

@demisx what are the health check settings on the load balancer? and does AWS provide any details on why the health check is failing (example: timeout or connection refused) ?

@rifelpet Thank you for your prompt response. This is what the "Health check" is set to. I am still digging for more info on why it fails the check. Strange that only 1 node is out, while the other 2 are fine. All masters are in service too.

Ping Target | HTTP:32502/healthz
Timeout | 5 seconds
Interval | 10 seconds
Unhealthy threshold | 6
Healthy threshold | 2

This is what I am getting back from each node. The "OutOf service" one returns "503 Service Unavailable". Any idea what may cause one of the nodes to return 503?

$ curl -I 172.20.53.122:32502/healthz
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:45:26 GMT
Content-Length: 105

$ curl -I 172.20.94.71:32502/healthz
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:47:19 GMT
Content-Length: 105

$ curl -I 172.20.127.105:32502/healthz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:47:47 GMT
Content-Length: 105

The /healthz endpoint doesn't exist on the "out of service" node:

$ curl localhost:32502/healthz
{
    "service": {
        "namespace": "default",
        "name": "nginx-ingress-controller"
    },
    "localEndpoints": 0

I've created a new issue #8759 since my problem is slightly different than what was described by the OP.

Was this page helpful?
0 / 5 - 0 ratings