Kops: Unable to connect to the server: EOF - Kops rolling update of 1.10.11

Created on 6 Dec 2018 · 30Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information. 1.10.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
projects$ kubectl version Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-04T07:48:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"} Unable to connect to the server: EOF server had 1.9.6

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster prod-xxx.k8s.local --state s3://prod-xxx-store --yes

5. What happened after the commands executed?
master node running in AWS but not hittable via kubectl

6. What did you expect to happen? get the nodes!

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

lifecyclrotten

Source

pkelleratwork

👍6

Most helpful comment

Hi @ssoroka

Could you describe your problems more specifically? And where are those error message come from?

In most cases I faced when try to rolling-update cluster can be classified into two parts and all solved by update ip address in Route 53:

Master node not available when validate master node by using kubectl or kops, usually cause by the public ip address in api.{$your_domain_name} in Route 53 hosted zones not update properly. In this case, you can fix it by update the ip address in Route 53 manually. The address in api.{$your_domain_name} should be the same as your public address of master node in EC2.
Worker node can't reach to master node. This could be tricky because there are so many reasons can cause those problems. You can try to check the ip address of api.internal.{$your_domain_name}, etcd.internal.{$your_domain_name} and etcd-event.internal.{$your_domain_name} is as same as private ip in master node.

cychiang on 6 Dec 2018

👍4

All 30 comments

i need help asap - this is a prod cluster

pkelleratwork on 6 Dec 2018

Hi @pkelleratwork
Did you check the ip address of api.${your_domain_name} in Route 53 is as same as your master public ip address in EC2?

cychiang on 6 Dec 2018

@cychiang no - i will do that now,thx

pkelleratwork on 6 Dec 2018

I'm having the exact same problem. After upgrading due to CVE-2018-1002105, the master is in ec2, but unreachable. Same EOF. Not sure what the Route 53 records are supposed to look like, so I'm having some trouble trying to manually fix this.

Validation fails with unexpected error during validation: error listing nodes: Get https://xxx.amazonaws.com/api/v1/nodes: EOF.

I'm new to kubernetes/kops, so that's not helping. Any suggestions welcome.
kops get --name company-platform-stage.k8s.local -o yaml output

ssoroka on 6 Dec 2018

Hi @ssoroka

Could you describe your problems more specifically? And where are those error message come from?

In most cases I faced when try to rolling-update cluster can be classified into two parts and all solved by update ip address in Route 53:

Master node not available when validate master node by using kubectl or kops, usually cause by the public ip address in api.{$your_domain_name} in Route 53 hosted zones not update properly. In this case, you can fix it by update the ip address in Route 53 manually. The address in api.{$your_domain_name} should be the same as your public address of master node in EC2.
Worker node can't reach to master node. This could be tricky because there are so many reasons can cause those problems. You can try to check the ip address of api.internal.{$your_domain_name}, etcd.internal.{$your_domain_name} and etcd-event.internal.{$your_domain_name} is as same as private ip in master node.

cychiang on 6 Dec 2018

👍4

Just FYI, we upgraded 3 of our clusters to 1.10.11 from 1.10.7 and had no issues at all. cc'ing myself for visibility into the issues other people are facing.

rajatjindal on 6 Dec 2018

@rajatjindal Same here, no issues upgrading from 1.10.7.

dzoeteman on 6 Dec 2018

@rajatjindal Upgrade from 1.10.6 to 1.10.11 with no issues.

cychiang on 6 Dec 2018

@ssoroka your attached cluster.txt indicates you have set kubernetesVersion: 1.13.0
Kops does not support 1.13.0, and probably won't for an extended period of time, or until a corporation invests in paying for someone to work full-time on upgrades.

I would roll back to 1.10.11 which is the most recent known-good combination of kops and Kubernetes. Make sure to also roll back your versions of the image on the InstanceGroup

BrianChristie on 6 Dec 2018

👍1

I had no issues going from 1.9.11 -> 1.10.11 on 2 different clusters.

I would definitely recommend trying a version of kubectl which more closely matches your cluster though (Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0" indicates you are using kubectl v1.13).

mattparkes on 6 Dec 2018

Just wanted to add that I just ran into the same issue following the guide from https://github.com/kubernetes/kops/blob/master/docs/aws.md

Seems that the master node is failing ELB health checks and never getting put into service.

Update: I'm seeing the following line in journalctl -u kops-configuration.service

Dec 06 15:56:42 ip-172-20-41-62 nodeup[702]: I1206 15:56:42.065540     702 s3fs.go:219] Reading file "s3://<redacted>/<redacted>.k8s.local/cluster.spec: AccessDenied: Access Denied

So, yeah, that probably has something to do with it for me. Investigating...

Update 2: Yeah, I put my S3 bucket in US East and we're working fine now.

scjudd on 6 Dec 2018

👍1

I'm seeing an issue that smells the same: the API ELB has SSL:443 "ping" healthchecks that fail with a fresh cluster created with 'kops create'. Switching to TCP:443 makes the healthchecks pass. This is only an issue in us-east-2; us-east-1 works fine.

markine on 6 Dec 2018

🎉1

This change is related to the choice between TCP and SSL healthchecks for the API:

https://github.com/kubernetes/kops/commit/2accc73a724e287d0dd540856acae4382ed0263b

Note that the affected test cluster in us-east-2 that fails to see the master hosts as healthy logs "http: TLS handshake error" errors in kube-apiserver.log when the ELB uses TCP or SSL (TCP passes, SSL fails).

markine on 6 Dec 2018

Part of my problem is updating from 1.8.x -> 1.13.0. :/
I downgraded as advised, and did a rolling update and it seems to have resolved most of my problems.

If anyone's curious, I switched kubernetes-cli versions like so:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.10.11/bin/darwin/amd64/kubectl
chmod 555 kubectl
sudo mv kubectl /usr/local/bin/kubectl

Thanks!

ssoroka on 6 Dec 2018

👍1

We should probably add a warning if you're running a version of k8s outside of the known-good range for a particular version of kops.

Looking to see if I can reproduce your issue @markine - I'm guessing it's --topology=private, anything else "unusual" that might help me reproduce the issue?

justinsb on 7 Dec 2018

I was able to reproduce the issue @markine - opened a separate issue for it: https://github.com/kubernetes/kops/issues/6181

Also opened #6182 for "warn if using a newer version of kubernetes"

Do we think there are any other issues? If not, I'd like to close this one and track in the specific issues (I'm trying to get more disciplined about issue management!)

justinsb on 7 Dec 2018

Starting this morning, I've had a similar issue - I can't connect to my cluster (kops or kubectl) - constantly getting Unable to connect to the server: EOF or unexpected error during validation: error listing nodes: Get https://api.cluster_name/api/v1/nodes: EOF

But I didn't do any changes or upgrades - I am using kops 1.9 and kubernetes 1.8 - it simply stopped working this morning. I am not sure if the new issue opened #6181 addresses this case (I am using an ALB on sa-east-1).

y86 on 8 Dec 2018

Same error for me upgrading from 1.9.10 to 1.10.11, switched to TCP healthcheck on ELB manually and now all is good. But SSL does not work.

mpagnucco on 10 Dec 2018

👍1

What version of KOPS did you use to upgrade to 1.10?
I used Kops 1.10 for my 1.9.11 -> 1.10.11 upgrade and had no issues (on 2 clusters) in AWS (with private topologies).

Following this closely because I'm planning to upgrade a production cluster to 1.10.11 tomorrow morning...

mattparkes on 10 Dec 2018

It was kops version 1.10. It was a development tier cluster with private topology. I've postponed production until I know more about this.

mpagnucco on 11 Dec 2018

👍1

After downgrading kubectl to 1.10:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

My EOF errors have decreased if not gone away entirely (pending more testing/usage I'll update here after a few days).

This has significantly improved our CICD pipeline for deployment to a kops created cluster

tanishq-dubey on 14 Dec 2018

I am facing the same issue. I created a cluster and then edited it to change the etcd version and then applied rolling update.

Now I get the same issue:

I0208 19:27:18.009377   29054 instancegroups.go:209] Validating the cluster.
I0208 19:27:18.396086   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:27:48.583165   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:18.588650   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:48.597613   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:18.589114   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:48.594966   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.

abhyuditjain on 8 Feb 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 9 May 2019

I am facing the same issue again after kops rolling-update --yes.
My kops version earlier was 1.11.1 and kubernetes version on server was 1.11.9.
I downloaded kops 1.14.0-alpha.1 to upgrade the kubernetes version to 1.14.1.

The rolling update was stuck in validation state (logs were similar so I copied the logs from the previous comment):

I0208 19:27:18.009377   29054 instancegroups.go:209] Validating the cluster.
I0208 19:27:18.396086   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:27:48.583165   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:18.588650   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:48.597613   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:18.589114   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:48.594966   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.

abhyuditjain on 12 May 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 11 Jun 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 11 Jul 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 11 Jul 2019

@abhyuditjain
Got same error upgrading from 1.10.12 today to 1.11.10

Warning

Maybe you got more issues or something different because you updated too many versions at a time.

Note: KOPS upgrade doc (see https://github.com/kubernetes/kops/blob/master/docs/upgrade.md#automated-update) should probably mention:

that master should not be upgraded more than one major version at a time
that nodes should not be upgraded more than two major versions at a time

So masters should go from 1.10.x to 1.11.y then to 1.12.z then ... and finally 1.14.1.

Also, by default, kops upgrade sets the last available version (let's say 1.14.1) and it should probably set 1.11.10 (or any 1.11.y version) when 1.10.12 is running and not 1.14.1 !

To solve it

In my case, after checking syslog looking for errors, searching quite a lot on the internet and finally, after seeing the error unknown flag: --enable-custom-metrics, I found things about this issue https://github.com/kubernetes/kops/issues/6449

So from my cluster definition I removed this:

  kubelet:
    enableCustomMetrics: true

that is preventing kubelet from starting on masters and then kops update cluster $CLUSTER_NAME --yes.

Then I should have run kops rolling-update cluster $CLUSTER_NAME --yes --instance-group master-<az> for all AZs where I messed with my master because of that option.

But it got rejected because cluster was not reachable.
Good thing is that this change is in cluster spec on s3 that is loaded by nodeup during cloud init. So I decided to trigger a reboot for my 2 updated master nodes and it worked: after a couple minutes I could see kubernetes services running again on those masters and run kubectl get pods again.

f-ld on 2 Aug 2019

hi team below types of error coming ---please help for solution
unexpected error during validation: error listing nodes: Get https://api.acc.poc.example.com/api/v1/nodes: EOF

PraveenKT12 on 18 Oct 2019

Same error when executing kubectl get nodes

Unable to connect to the server: EOF

Note: Just After creating the kops cluster it works well, but after a period of time it gives the above error !

smaillns on 17 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cluster create fails with kops-version.txt not found

mikejoh · 3Comments

kopeio vs kopeio-vxlan

olalonde · 4Comments

Allow opt-in to etcd3

justinsb · 4Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments

kube-dns pods cannot be scheduled on master instances

georgebuckerfield · 4Comments