Kops: masters not ready when using kops 1.6.0-alpha.2 to upgrade a 1.5 cluster to 1.6 (due to weave upgrade failing)

Created on 18 Apr 2017 · 14Comments · Source: kubernetes/kops

I am trying to upgrade a test cluster in AWS running Kubernetes 1.5.2 that was created using kops 1.5.1 to Kubernetes 1.6.1 using kops 1.6.0-alpha.2. Prior to the upgrade attempt, I made sure to create the kube-dns configmap mentioned in the kops 1.6.0-alpha.1 release notes first (since the alpha.2 releases notes say to look at the alpha.1 release notes). This cluster uses private topology and weave networking, and has 3 masters and 2 nodes.

When I do the upgrade and rolling update, I always seem to end up with two of my masters saying that they are not ready forever. However, if I bring up the Kubernetes 1.6.1 cluster from scratch using kops 1.6.0-alpha.2, everything seems to come up fine. I noticed on the upgraded cluster that weave is not running on the masters that are in "not ready" status. I did some searching and found this issue which talks about a change to update the weave daemonset to use tolerations as a field instead of an annotation.

I took a look at the weave-net daemonset deployed to the cluster. I noticed in the 1.6.1 cluster I brought up from scratch, it appears to be this daemonset (which uses weave 1.9.4 and has tolerations as a field). However, on the cluster I upgraded from 1.5.2, the weave-net daemonset appears to have not been upgraded. It looks like this one from kops 1.5.1.

If I manually upgrade the daemonset with kubectl (by deleting the old one and then creating the new one), weave comes up everywhere, and all the systems finally say they are ready.

I have tried this several times and seem to be pretty consistently getting the same results. Curiously, after the upgrade, weave does end up running on one master (although it appears to be the old version).

When running kops update cluster command without --yes, I did note that it had this in the output:

  ManagedFile/devitopskube.ontsys.com-addons-networking.weave-k8s-1.6
    Location                addons/networking.weave/k8s-1.6.yaml

  ManagedFile/devitopskube.ontsys.com-addons-networking.weave-pre-k8s-1.6
    Location                addons/networking.weave/pre-k8s-1.6.yaml

~However, when I look in the s3 bucket, the only file in that location is v1.8.2.yaml.~ EDIT: I was wrong on that last sentence... the k8s-1.6.yaml and pre-k8s-1.6.yaml are in the bucket in addition to the v1.8.2.yaml. (Apparently it takes a refresh of the browser to make sure you are seeing everything). However, everything else mentioned above still stands.

P1 blocks-next

Source

kashook

Most helpful comment

Same issue here upgrading from 1.5 to latest kops. One master node (our of three) was NotReady. Terminating and recreating the node did not solve.

Weave pod was running on all nodes except the NotReady master.

Solutions suggested by @keiths-osc worked:

I deleted the weave daemonset
I applied this

mfornasa on 12 Jun 2017

👍3

All 14 comments

I discovered that protokube running on the masters is the thing that applies the cluster addons. After doing the cluster upgrade and rolling update, I noticed this over and over in the docker logs for protokube on one of the masters:

I0420 18:01:39.248077    1452 apply.go:66] Running command: kubectl apply -f /tmp/channel105077004/manifest.yaml
I0420 18:01:39.490693    1452 apply.go:69] error running kubectl apply -f /tmp/channel105077004/manifest.yaml
I0420 18:01:39.490717    1452 apply.go:70] clusterrole "weave-net" configured
serviceaccount "weave-net" configured
clusterrolebinding "weave-net" configured
The DaemonSet "weave-net" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"name":"weave-net"}: `selector` does not match template `labels`
Error: error updating "networking.weave": error applying update from "networking.weave/k8s-1.6.yaml": error running kubectl
Usage:
  channels apply channel [flags]

Flags:
      --f stringSlice   Apply from a local file
      --yes             Apply update

Global Flags:
      --alsologtostderr                  log to standard error as well as files
      --config string                    config file (default is $HOME/.channels.yaml)
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files (default false)
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs (default 0)
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging


error updating "networking.weave": error applying update from "networking.weave/k8s-1.6.yaml": error running kubectl

kashook on 20 Apr 2017

@justinsb is this a protokube mismatch?

chrislovecnm on 23 Apr 2017

This issue, at least, needs to be triaged before the 1.6 release.

chrislovecnm on 23 Apr 2017

@keiths-osc did you try this with a full custom build? Building a new version of protokube may be required.

chrislovecnm on 23 Apr 2017

@chrislovecnm I did not try with a custom build, just the 1.6.0-alpha.2 binary downloaded from github. I ran a docker inspect of the protokube container running on the kube masters and did notice that image used was protokube:1.6.0-alpha.2.

After I noticed the error message in the protokube docker logs, I got the weave manifest from here and tried to apply it to the kube cluster manually using kubectl apply with kubectl version 1.6.1 (since I was upgrading the cluster to Kubernetes 1.6.1). I get the exact same error message that protokube got:

clusterrole "weave-net" configured
serviceaccount "weave-net" configured
clusterrolebinding "weave-net" configured
The DaemonSet "weave-net" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"name":"weave-net"}: `selector` does not match template `labels`

To get it to work I had to delete the old daemonset first. After that the apply worked fine.

I can try a custom build if you would like. If so I would appreciate some pointers on the best way to make sure that the protokube version I build actually gets used on the servers that are created by kops.

kashook on 24 Apr 2017

See https://github.com/kubernetes/kops/blob/master/hack/dev-build.sh is what I use. https://github.com/kubernetes/kops/issues/2307 is an issue about making the developer builds easier.

you need a public s3 bucket to upload the files into, or protect it with IAM ... I do not
you need run the make target that uploads your file
you need to use the correct ENV variables to run kops so that it tells kops to use protokube and nodeup from the bucket.

We have developer office hours on every other Friday if you want to swing by. The info is in the project README.

chrislovecnm on 25 Apr 2017

I just upgraded 1.5.3 to 1.6.0 and this issue came up, but the only fix I had to do was kubectl edit ds waeve-net --namespace kube-system and change the version string from 1.8.3 (it hink) to 1.9.0 and it came up without issues. did not even have to nuke the pods.

I think you could do this (for now) right after you do kops cluster-upgrade but before you do kops rolling-update

hollowimage on 25 Apr 2017

@chrislovecnm That dev-build.sh script was helpful for seeing how to do everything. I did two tests. First, I made sure I could reproduce the original problem using my own local builds. I built the 1.5.1 tag of kops and created a cluster running kube 1.5.2. I then built the 1.6.0-alpha.2 tag of kops and upgraded the cluster to kube 1.6.1. Protokube gave the same error as before. After this, I started over with my kube 1.5.1 cluster, and then built the kops master branch and used that to upgrade the cluster. I did not see any errors regarding weave in the protokube logs, and the weave daemonset appears to have upgraded.

After that, I took another look at the history for the weave daemonset manifest, and I ended up finding this recent changeset. I believe that fixes this issue.

kashook on 26 Apr 2017

I suspect this is what just hit me on an upgrade from 1.5.6 to 1.6.2 using kops 1.6-beta1. After the upgrade my nodes kept breaking with PLEG unhealthy failures, and I eventually traced that down to weave being totally broken and dropping connections all the time.

I was in a rush so I moved to flannel in order to fix this on the cluster in question. I've got two more clusters to upgrade to 1.6 so I can verify that the daemonset is bust when I do those.

apenney on 4 May 2017

I think this was fixed with kops1.6-alpha2 beause when i upgraded my production cluster from 1.5.2 --> 1.6.2 it worked without issues.

hollowimage on 4 May 2017

I tested upgrading a cluster running Kubernetes 1.5.2 that was created using kops 1.5.1 to Kubernetes 1.6.2 using kops 1.6.0-beta.1 twice, and I ended up with masters saying they were not ready both times. The cluster had 3 masters and 2 nodes. Weave was not running on the masters that were stuck as NotReady. (The first time one master was not ready, the second time two masters were not ready). Unlike before however, I did _not_ see the above error in protokube, and the weave-net daemonset appears to have applied to the cluster. (I dumped out the manifest with kubectl and it referred to weave 1.9.4). I used kubectl to delete the weave-net daemoneset and then manually applied the weave manifest, and weave came up on all the systems and everything started saying it was ready.

When I first logged this issue, I was upgrading a cluster running Kubernetes 1.5.2 that was created using kops 1.5.1 to Kubernetes 1.6.1 using kops 1.6.0-alpha.2 (the latest at the time), and the error in protokube (posted above) occurred every time. (I tried more than a dozen times). Protokube was attempting to apply this daemonset on top of this daemonset, which was not able to succeed. (I was able to reproduce the same error message just using kubectl apply). The changes made by this changeset appear to have fixed that problem, and it looks like that changeset was first released in the 1.6.0-beta.1 release. (The commit comments on that changeset mention the exact same error message I was seeing in protokube). That issue existed in 1.6.0-alpha.2 and went away in 1.6.0-beta.1. However, it seems that there is still some problem that ultimately results in the same symptom of masters being stuck as NotReady.

Before when I reported I had tried this same test using a local build from the master and that it seemed to work fine, the git revision I tested was 02fa859b20a2656b073ca4aa968544d0ad59b9c8 (the latest on that day). That revision is somewhere in between the 1.6.0-alpha.2 and 1.6.0-beta.1 tags. Since I was asked to try that before, I did a similar test again using both the latest commit on the master branch and the 1.6.0-beta.1 tag. With the 1.6.0-beta.1 build, I once again had a master stuck in NotReady status with weave not running on it. Manually deleting the weave-net daemonset and recreating seemed to get things unstuck. With the master (revision 2e5fa8916719ef1b77f4f9bc374ad67518a690c8) all systems seemed to upgrade fine. (None got stuck in NotReady status). There are only a handful of commits between 1.6.0-beta.1 and that revision and if one of them happened to fix the issue it wasn't obvious to me which one at a quick glance.

kashook on 5 May 2017

Same issue here upgrading from 1.5 to latest kops. One master node (our of three) was NotReady. Terminating and recreating the node did not solve.

Weave pod was running on all nodes except the NotReady master.

Solutions suggested by @keiths-osc worked:

I deleted the weave daemonset
I applied this

mfornasa on 12 Jun 2017

👍3

Upgrading from 1.5 to 1.6.2 using kops 1.6.2. I was able to fix it using @keiths-osc suggestions. I tried with this version but it didn't work. I always got this error:

$ kubectl --namespace=kube-system apply -f weave.yml
clusterrole "weave-net" configured
serviceaccount "weave-net" configured
clusterrolebinding "weave-net" configured
error: error converting YAML to JSON: yaml: line 54: could not find expected ':'