I expose a deployment using a service of type LoadBalancer. That creates an ELB, but the EC2 instances it has never pass the health checks, so they are marked as OutOfService and I can't access my deployment from outside.
Am I doing something wrong? Or is this a kops bug?
cloud: aws
kops version: 1.8.0
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T20:00:41Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Do you think this is the same issue as #4301? I know this was reported first, but I think that issue has more info in it.
Ultimately the actual issue is probably #4156.
@grimborg are you using master or a released kops version?
I am also facing this problem using Kops 1.8.1
Same issue with Kops 1.8.1
Same issue with Kops 1.9.0-alpha.1
For me this was fixed in master. Seems odd that 1.9.0-alpha.1 would still have the issue. i should be standing up another cluster shortly. I can report on what i find.
@mmacfadden - let me save you some time, I had to use master to get my already running cluster back from this mess-up. Only master ( built on Linux) works.
@snoby I built on mac os using homebrew from master, and that also worked.
I also came across this bug, using Kops 1.8.1 and K8s 1.10.0.
Then I tried reducing the number of master nodes from 3 to 1, recreated the cluster, and it worked. Far from ideal, but a good enough workaround for me while this bug gets resolved.
Has anyone checked whether kops 1.9.0 fixes this?
I'm using kops 1.9 and having this issue as well (with kube_version 1.9.7). It was able to create the cluster fine, and master was reachable (yay!).
However, I after I updated the nodes configuration (kops edit ig nodes) and did a kops update cluster --yes && kops rolliing-update cluster --yes the master was OutOfService again. For some unknown reason it rebuilt the master, and screwed it up.
Really scary to use kops, and no _idea_ how to reach it again
Same here. Brand new cluster with Kops 1.9.0, Kubernetes 1.10.2. Config Bucket and master and nodes in the same region.
My issue was this: https://github.com/kubernetes/kops/issues/4844, so solved by moving to m4 for master
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi all, i see this issue is pretty old but i am encountering the same one. My nodes and master nodes are up as per my aws console but the api ELB does flag the master as out of service. How did you sort it out ?
Please find my configuration below :
Cluster configuration
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: null
name: mycluster
spec:
additionalPolicies:
master: |
[
{
"Effect" : "Allow",
"Action" : ["sts:AssumeRole"],
"Resource" : ["*"]
}
]
node: |
[
{
"Effect" : "Allow",
"Action" : ["sts:AssumeRole",
"route53:ChangeResourceRecordSets",
"route53:GetChange",
"route53:ListHostedZones",
"ec2:DescribeSpotInstanceRequests",
"ec2:CancelSpotInstanceRequests",
"ec2:GetConsoleOutput",
"ec2:RequestSpotInstances",
"ec2:RunInstances",
"ec2:StartInstances",
"ec2:StopInstances",
"ec2:TerminateInstances",
"ec2:CreateTags",
"ec2:DeleteTags",
"ec2:DescribeInstances",
"ec2:DescribeKeyPairs",
"ec2:DescribeRegions",
"ec2:DescribeImages",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"iam:ListInstanceProfilesForRole",
"iam:PassRole"],
"Resource" : ["*"]
}
]
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://mys3bucket/mycluster
docker:
logDriver: ""
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer: {}
kubelet:
enableCustomMetrics: true
featureGates:
ExpandPersistentVolumes: "true"
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.11.8
masterInternalName: api.internal.mycluster
masterPublicName: api.mycluster
networkCIDR: 10.165.14.0/24
networkID: vpc-id
networking:
weave: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 10.165.14.96/27
egress: nat-id
id: subnet-id
name: wkpr1
type: Private
zone: us-east-1a
- cidr: 10.165.14.32/27
egress: nat-id
id: subnet-id
name: wkpr2
type: Private
zone: us-east-1b
- cidr: 10.165.14.64/27
egress: nat-id
id: subnet-id
name: wkpr3
type: Private
zone: us-east-1c
- cidr: 10.165.14.192/27
id: subnet-id
name: wkpu4
type: Utility
zone: us-east-1a
- cidr: 10.165.14.128/27
id: subnet-id
name: wkpu2
type: Utility
zone: us-east-1b
- cidr: 10.165.14.160/27
id: subnet-id
name: wkpu3
type: Utility
zone: us-east-1c
topology:
dns:
type: Public
masters: private
nodes: private
Master IG configuration
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-01-14T22:10:42Z
labels:
Application: Kubernetes cluster
Environment: Production
kops.k8s.io/cluster: mycluster
name: master-us-east-1a
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-1a
role: Master
subnets:
- wkpr1
kops version : Version 1.11.1 (git-0f2aa8d30)
kubectl version: Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:41:50Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: EOF
We are also experiencing the same issue in all 3 kops v1.16.0 clusters. However, all clusters are successfully validated with kops validate cluster.
i-01a3axxxxxxx | nodes-us-east-1d.example.com | us-east-1d | InService
i-0f89f3xxxxxxx | nodes-us-east-1a.example.com | us-east-1a | InService
i-0c3943xxxxxxx | nodes-us-east-1b.example.com | us-east-1b | OutOfService
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T23:41:24Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.7", GitCommit:"be3d344ed06bff7a4fc60656200a93c74f31f9a4", GitTreeState:"clean", BuildDate:"2020-02-11T19:24:46Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
@demisx what are the health check settings on the load balancer? and does AWS provide any details on why the health check is failing (example: timeout or connection refused) ?
@rifelpet Thank you for your prompt response. This is what the "Health check" is set to. I am still digging for more info on why it fails the check. Strange that only 1 node is out, while the other 2 are fine. All masters are in service too.
Ping Target | HTTP:32502/healthz
Timeout | 5 seconds
Interval | 10 seconds
Unhealthy threshold | 6
Healthy threshold | 2
This is what I am getting back from each node. The "OutOf service" one returns "503 Service Unavailable". Any idea what may cause one of the nodes to return 503?
$ curl -I 172.20.53.122:32502/healthz
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:45:26 GMT
Content-Length: 105
$ curl -I 172.20.94.71:32502/healthz
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:47:19 GMT
Content-Length: 105
$ curl -I 172.20.127.105:32502/healthz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Date: Sun, 15 Mar 2020 21:47:47 GMT
Content-Length: 105
The /healthz endpoint doesn't exist on the "out of service" node:
$ curl localhost:32502/healthz
{
"service": {
"namespace": "default",
"name": "nginx-ingress-controller"
},
"localEndpoints": 0
I've created a new issue #8759 since my problem is slightly different than what was described by the OP.
Most helpful comment
Hi all, i see this issue is pretty old but i am encountering the same one. My nodes and master nodes are up as per my aws console but the api ELB does flag the master as out of service. How did you sort it out ?
Please find my configuration below :
Cluster configuration
Master IG configuration
kops version : Version 1.11.1 (git-0f2aa8d30)
kubectl version: Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:41:50Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: EOF