Kops: Unable to connect to the server: EOF - Kops rolling update of 1.10.11

Created on 6 Dec 2018  路  30Comments  路  Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.
1.10.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

projects$ kubectl version Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-04T07:48:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"} Unable to connect to the server: EOF server had 1.9.6

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster prod-xxx.k8s.local --state s3://prod-xxx-store --yes

5. What happened after the commands executed?
master node running in AWS but not hittable via kubectl

6. What did you expect to happen? get the nodes!

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

lifecyclrotten

Most helpful comment

Hi @ssoroka

Could you describe your problems more specifically? And where are those error message come from?

In most cases I faced when try to rolling-update cluster can be classified into two parts and all solved by update ip address in Route 53:

  1. Master node not available when validate master node by using kubectl or kops, usually cause by the public ip address in api.{$your_domain_name} in Route 53 hosted zones not update properly. In this case, you can fix it by update the ip address in Route 53 manually. The address in api.{$your_domain_name} should be the same as your public address of master node in EC2.
  2. Worker node can't reach to master node. This could be tricky because there are so many reasons can cause those problems. You can try to check the ip address of api.internal.{$your_domain_name}, etcd.internal.{$your_domain_name} and etcd-event.internal.{$your_domain_name} is as same as private ip in master node.

All 30 comments

i need help asap - this is a prod cluster

Hi @pkelleratwork
Did you check the ip address of api.${your_domain_name} in Route 53 is as same as your master public ip address in EC2?

@cychiang no - i will do that now,thx

I'm having the exact same problem. After upgrading due to CVE-2018-1002105, the master is in ec2, but unreachable. Same EOF. Not sure what the Route 53 records are supposed to look like, so I'm having some trouble trying to manually fix this.

Validation fails with unexpected error during validation: error listing nodes: Get https://xxx.amazonaws.com/api/v1/nodes: EOF.

I'm new to kubernetes/kops, so that's not helping. Any suggestions welcome.
kops get --name company-platform-stage.k8s.local -o yaml output

Hi @ssoroka

Could you describe your problems more specifically? And where are those error message come from?

In most cases I faced when try to rolling-update cluster can be classified into two parts and all solved by update ip address in Route 53:

  1. Master node not available when validate master node by using kubectl or kops, usually cause by the public ip address in api.{$your_domain_name} in Route 53 hosted zones not update properly. In this case, you can fix it by update the ip address in Route 53 manually. The address in api.{$your_domain_name} should be the same as your public address of master node in EC2.
  2. Worker node can't reach to master node. This could be tricky because there are so many reasons can cause those problems. You can try to check the ip address of api.internal.{$your_domain_name}, etcd.internal.{$your_domain_name} and etcd-event.internal.{$your_domain_name} is as same as private ip in master node.

Just FYI, we upgraded 3 of our clusters to 1.10.11 from 1.10.7 and had no issues at all. cc'ing myself for visibility into the issues other people are facing.

@rajatjindal Same here, no issues upgrading from 1.10.7.

@rajatjindal Upgrade from 1.10.6 to 1.10.11 with no issues.

@ssoroka your attached cluster.txt indicates you have set kubernetesVersion: 1.13.0
Kops does not support 1.13.0, and probably won't for an extended period of time, or until a corporation invests in paying for someone to work full-time on upgrades.

I would roll back to 1.10.11 which is the most recent known-good combination of kops and Kubernetes. Make sure to also roll back your versions of the image on the InstanceGroup

I had no issues going from 1.9.11 -> 1.10.11 on 2 different clusters.

I would definitely recommend trying a version of kubectl which more closely matches your cluster though (Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0" indicates you are using kubectl v1.13).

Just wanted to add that I just ran into the same issue following the guide from https://github.com/kubernetes/kops/blob/master/docs/aws.md

Seems that the master node is failing ELB health checks and never getting put into service.

Update: I'm seeing the following line in journalctl -u kops-configuration.service

Dec 06 15:56:42 ip-172-20-41-62 nodeup[702]: I1206 15:56:42.065540     702 s3fs.go:219] Reading file "s3://<redacted>/<redacted>.k8s.local/cluster.spec: AccessDenied: Access Denied

So, yeah, that probably has something to do with it for me. Investigating...

Update 2: Yeah, I put my S3 bucket in US East and we're working fine now.

I'm seeing an issue that smells the same: the API ELB has SSL:443 "ping" healthchecks that fail with a fresh cluster created with 'kops create'. Switching to TCP:443 makes the healthchecks pass. This is only an issue in us-east-2; us-east-1 works fine.

This change is related to the choice between TCP and SSL healthchecks for the API:

https://github.com/kubernetes/kops/commit/2accc73a724e287d0dd540856acae4382ed0263b

Note that the affected test cluster in us-east-2 that fails to see the master hosts as healthy logs "http: TLS handshake error" errors in kube-apiserver.log when the ELB uses TCP or SSL (TCP passes, SSL fails).

Part of my problem is updating from 1.8.x -> 1.13.0. :/
I downgraded as advised, and did a rolling update and it seems to have resolved most of my problems.

If anyone's curious, I switched kubernetes-cli versions like so:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.10.11/bin/darwin/amd64/kubectl
chmod 555 kubectl
sudo mv kubectl /usr/local/bin/kubectl

Thanks!

We should probably add a warning if you're running a version of k8s outside of the known-good range for a particular version of kops.

Looking to see if I can reproduce your issue @markine - I'm guessing it's --topology=private, anything else "unusual" that might help me reproduce the issue?

I was able to reproduce the issue @markine - opened a separate issue for it: https://github.com/kubernetes/kops/issues/6181

Also opened #6182 for "warn if using a newer version of kubernetes"

Do we think there are any other issues? If not, I'd like to close this one and track in the specific issues (I'm trying to get more disciplined about issue management!)

Starting this morning, I've had a similar issue - I can't connect to my cluster (kops or kubectl) - constantly getting Unable to connect to the server: EOF or unexpected error during validation: error listing nodes: Get https://api.cluster_name/api/v1/nodes: EOF

But I didn't do any changes or upgrades - I am using kops 1.9 and kubernetes 1.8 - it simply stopped working this morning. I am not sure if the new issue opened #6181 addresses this case (I am using an ALB on sa-east-1).

Same error for me upgrading from 1.9.10 to 1.10.11, switched to TCP healthcheck on ELB manually and now all is good. But SSL does not work.

What version of KOPS did you use to upgrade to 1.10?
I used Kops 1.10 for my 1.9.11 -> 1.10.11 upgrade and had no issues (on 2 clusters) in AWS (with private topologies).

Following this closely because I'm planning to upgrade a production cluster to 1.10.11 tomorrow morning...

It was kops version 1.10. It was a development tier cluster with private topology. I've postponed production until I know more about this.

After downgrading kubectl to 1.10:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

My EOF errors have decreased if not gone away entirely (pending more testing/usage I'll update here after a few days).

This has significantly improved our CICD pipeline for deployment to a kops created cluster

I am facing the same issue. I created a cluster and then edited it to change the etcd version and then applied rolling update.

Now I get the same issue:

I0208 19:27:18.009377   29054 instancegroups.go:209] Validating the cluster.
I0208 19:27:18.396086   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:27:48.583165   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:18.588650   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:48.597613   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:18.589114   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:48.594966   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

I am facing the same issue again after kops rolling-update --yes.
My kops version earlier was 1.11.1 and kubernetes version on server was 1.11.9.
I downloaded kops 1.14.0-alpha.1 to upgrade the kubernetes version to 1.14.1.

The rolling update was stuck in validation state (logs were similar so I copied the logs from the previous comment):

I0208 19:27:18.009377   29054 instancegroups.go:209] Validating the cluster.
I0208 19:27:18.396086   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:27:48.583165   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:18.588650   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:28:48.597613   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:18.589114   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.
I0208 19:29:48.594966   29054 instancegroups.go:270] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://<DOMAIN>/api/v1/nodes: EOF.

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhyuditjain
Got same error upgrading from 1.10.12 today to 1.11.10

Warning

Maybe you got more issues or something different because you updated too many versions at a time.

Note: KOPS upgrade doc (see https://github.com/kubernetes/kops/blob/master/docs/upgrade.md#automated-update) should probably mention:

  • that master should not be upgraded more than one major version at a time
  • that nodes should not be upgraded more than two major versions at a time

So masters should go from 1.10.x to 1.11.y then to 1.12.z then ... and finally 1.14.1.

Also, by default, kops upgrade sets the last available version (let's say 1.14.1) and it should probably set 1.11.10 (or any 1.11.y version) when 1.10.12 is running and not 1.14.1 !

To solve it

In my case, after checking syslog looking for errors, searching quite a lot on the internet and finally, after seeing the error unknown flag: --enable-custom-metrics, I found things about this issue https://github.com/kubernetes/kops/issues/6449

So from my cluster definition I removed this:

  kubelet:
    enableCustomMetrics: true

that is preventing kubelet from starting on masters and then kops update cluster $CLUSTER_NAME --yes.

Then I should have run kops rolling-update cluster $CLUSTER_NAME --yes --instance-group master-<az> for all AZs where I messed with my master because of that option.

But it got rejected because cluster was not reachable.
Good thing is that this change is in cluster spec on s3 that is loaded by nodeup during cloud init. So I decided to trigger a reboot for my 2 updated master nodes and it worked: after a couple minutes I could see kubernetes services running again on those masters and run kubectl get pods again.

hi team below types of error coming ---please help for solution
unexpected error during validation: error listing nodes: Get https://api.acc.poc.example.com/api/v1/nodes: EOF

Same error when executing kubectl get nodes

Unable to connect to the server: EOF

Note: Just After creating the kops cluster it works well, but after a period of time it gives the above error !

Was this page helpful?
0 / 5 - 0 ratings