Kops: New masters could not join the cluster during rolling update and nodeup

Created on 10 Sep 2018 · 8Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.

Version 1.10.0 (git-782ff1358)

This is branch release-1.10 with one commit cherry-picked

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.16", GitCommit:"e8846c1d7e7e632d4bd5ed46160eff3dc4c993c5", GitTreeState:"clean", BuildDate:"2018-04-04T08:55:07Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:52:34Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Edit instance type using:

kops edit ig master*
kops edit ig nodes
kops update cluster --yes
kops rolling-update cluster --yes

This introduced a lot of modification because using kops 1.10 rather than maybe 1.5.x when creating this cluster.

5. What happened after the commands executed?

Two masters drained but one of them has not been deleted, two new instances created but did not join the cluster.

In the meantime, etcd is unhealthy because only one master left, so I could only read the status of the cluster but can not modify it.

After failing at first try, using low versions of kops would not help, and using kops under 1.8.1 kops update cluster would report some cert errors.

error running task "keypair/master" (9m46s remaining to succeed): error loading certificate "s3://s3-bucket/cluster-name/pki/issued/master/keyset.yaml": could not parse certificate

So I ssh into the new instances, found protokube is the only container running.

sudo docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
protokube           1.10.0              757f84ea7739        3 weeks ago         278.5 MB

sudo docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
01c03c9f13a8        protokube:1.10.0    "/usr/bin/protokube -"   43 minutes ago      Up 43 minutes                           distracted_stonebraker

And most logs of protokube is like:

kubelet systemd service not running
error querying kubernetes version: Get http://127.0.0.1:8080/version: dial tcp 127.0.0.1:8080: getsockopt: connection refused
Get http://127.0.0.1:8080/api/v1/nodes?labelSelector=kubernetes.io%2Frole%3Dmaster: dial tcp 127.0.0.1:8080: getsockopt: connection refused

More logs at https://gist.github.com/idealhack/384cbb319448f4e58d160c1a97a51451

6. What did you expect to happen?

The rolling update should work as normal.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

https://gist.github.com/idealhack/8c5f8aaaf6e942a312eadc03209306b6

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

lifecyclstale

Source

idealhack

👍1

Most helpful comment

I am having a similar problem going from 1.9 to 1.10, it seems the masters don't have etcd or api manifests in /etc/kubernetes/manifests, also DNS records for the etcd nodes and internal API are not updated in route53.

AlexRRR on 11 Sep 2018

👍3

All 8 comments

After failing at first try, using low versions of kops would not help, and using kops under 1.8.1 kops update cluster would report some cert errors.

It seems this is another issue, and there are issues like this had been reported before.

idealhack on 10 Sep 2018

AlexRRR on 11 Sep 2018

👍3

I updated my cluster after updating Kops to 1.10 and everything went well:

kops update cluster sandbox.example.com --yes
kops rolling-update cluster sandbox.example.com --yes

I am now having the same issue while trying to update the instance type of a single master instance group. I have 3 instance groups for my masters. I did the following:

kops edit ig master-us-east-1a
kops update cluster sandbox.example.com --yes
kops rolling-update cluster sandbox.example.com --instance-group master-us-east-1a --yes

While rolling update, the master-us-east-1a instance is killed and recreated with the new instance type I specified. It is not able to join the cluster and I don't know what to look for anymore.

Kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-08T16:31:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T20:55:30Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Kops version
Version 1.10.0

prichrd on 12 Sep 2018

@AlexRRR It seems your new master instances are not quite like mine? Also, after those found, did you try to fix it manually?

@pric Maybe you can look into the new master instances like AlexRRR and I did.

idealhack on 13 Sep 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 12 Dec 2018

I have the same issue, and ssh to node, use journalctl -f to tail log and found:
F0103 23:42:39.063663 8109 server.go:148] unknown flag: --enable-custom-metrics

It could find solution: https://github.com/kubernetes/kops/issues/1467

timothyliu on 4 Jan 2019

👍1

We should never upgrade between multiple versions at a time.

Since there’s nothing to be done here...
/close

idealhack on 10 Jan 2019

@idealhack: Closing this issue.

In response to this:

We should never upgrade multiple version at a time.

Since there’s nothing to be done here...
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.