Rke: Calico node failed to start after upgrading the cluster

Created on 21 Mar 2019 · 11Comments · Source: rancher/rke

RKE version:
v0.1.17

Got a K8 cluster of 3 nodes on bare metal. The cluster was initialized with RKE 0.1.16 and running v1.12.4-rancher1-1. Yesterday, I updated RKE and thereafter the cluster to the latest K8 supported version (v1.13.4-rancher1-1).

Everything seemed fine but one canal pod is in CrashLoopBackOff status and the log for the calico-node container shows:

2019-03-20 23:57:58.163 [INFO][8] startup.go 244: Early log level set to info
2019-03-20 23:57:58.163 [INFO][8] startup.go 260: Using NODENAME environment for node name
2019-03-20 23:57:58.163 [INFO][8] startup.go 272: Determined node name: kube1
2019-03-20 23:57:58.164 [INFO][8] startup.go 304: Checking datastore connection
2019-03-20 23:57:58.235 [INFO][8] startup.go 328: Datastore connection verified
2019-03-20 23:57:58.235 [INFO][8] startup.go 92: Datastore is ready
2019-03-20 23:57:58.542 [INFO][8] node.go 77: Error updating Node resource error=nodes "kube1" is forbidden: User "system:serviceaccount:kube-system:canal" cannot update resource "nodes/status" in API group "" at the cluster scope
2019-03-20 23:57:58.542 [ERROR][8] startup.go 152: Unable to set node resource configuration error=connection is unauthorized: nodes "kube1" is forbidden: User "system:serviceaccount:kube-system:canal" cannot update resource "nodes/status" in API group "" at the cluster scope
2019-03-20 23:57:58.542 [WARNING][8] startup.go 1004: Terminating
Calico node failed to start

Any idea what might be the issue ? Thanks in advance.

To Triage kinbug priorit1

Source

tlvenn

👍4

Most helpful comment

@Darkeye9 may be a team member could confirm, but I think that the best practice is to not set any image version, unless you're really sure what you do.
When upgrading RKE, be aware that default kubernetes version changes (https://github.com/rancher/rke/releases), so setting it explicitly is probably a good idea.

remche on 1 Apr 2019

👍2

All 11 comments

Same here with brand new cluster, calico CNI. Verions 0.1.17 and 0.2.0. Still working with 0.1.16

remche on 22 Mar 2019

For 1.12, there is no nodes/status update permission in the template..
https://github.com/rancher/rke/blob/e79da956e948c16231d823715f1f0f7a297d8ef2/templates/calico.go

Don't know why it's broken with 0.2.0 too thoug, as the permission seems to be set
https://github.com/rancher/rke/blob/e79da956e948c16231d823715f1f0f7a297d8ef2/templates/calico.go#L565

remche on 22 Mar 2019

@tlvenn did you check that images version are not harcoded in the cluster.yml.

remche on 22 Mar 2019

👍1

yes, only specified the kubernetes_version which i had updated to v1.13.4-rancher1-1 once I updated RKE to the latest version. Somehow I ran rke up one or 2 more times after and the problem disappeared..

tlvenn on 22 Mar 2019

I hit this after upgrading from 1.11.6 to 1.11.9. But it does not seem to affect connectivity, since I suppose there is nothing to update, and the flannel container has started just fine.

@remche One question regarding upgrade procedure. What I do is to let the new RKE binary generate a new config file and then diff it against my saved cluster.yml and merge the relevant changes. I saw the kubernetes_version parameter, but searching through the source code of RKE I did not see anything handling it (Maybe my bad).
So, if I understood correctly, I can blank the docker images parameters and only specifiy this kubernetes_version parameter?

Currently, I have the following images defined, could you just eyeball them, in case I have some conflict which eventually led to this incident with calico?

  etcd: rancher/coreos-etcd:v3.2.18
  alpine: rancher/rke-tools:v0.1.15
  nginx_proxy: rancher/rke-tools:v0.1.15
  cert_downloader: rancher/rke-tools:v0.1.15
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.15
  kubedns: rancher/k8s-dns-kube-dns-amd64:1.14.10
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny-amd64:1.14.10
  kubedns_sidecar: rancher/k8s-dns-sidecar-amd64:1.14.10
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
  coredns: ""
  coredns_autoscaler: ""
  kubernetes: rancher/hyperkube:v1.11.9-rancher1
  flannel: rancher/coreos-flannel:v0.10.0
  flannel_cni: rancher/coreos-flannel-cni:v0.3.0
  calico_node: rancher/calico-node:v3.1.3
  calico_cni: rancher/calico-cni:v3.1.3
  calico_controllers: ""
  calico_ctl: rancher/calico-ctl:v2.0.0
  canal_node: rancher/calico-node:v3.1.3
  canal_cni: rancher/calico-cni:v3.1.3
  canal_flannel: rancher/coreos-flannel:v0.10.0
  weave_node: weaveworks/weave-kube:2.1.2
  weave_cni: weaveworks/weave-npc:2.1.2
  pod_infra_container: rancher/pause-amd64:3.1
  ingress: rancher/nginx-ingress-controller:0.16.2-rancher1
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.4
  metrics_server: rancher/metrics-server-amd64:v0.2.1

Darkeye9 on 31 Mar 2019

remche on 1 Apr 2019

👍2

@remche Thanks, that seemed to fix the issue, although I did not see any new service image pulling, calico works right now. Anyway, I can't assure this to be the fix, or maybe it's due to several tries, as @tlvenn commented.

One way or another, it gave me several hours of downtime and big headache... Thanks all!

Darkeye9 on 3 Apr 2019

I think you should only pin the kubernetes_version if you want for rke to automatically manage the images/addons templates, if you want to manage the images yourself you can always check the supported images using rke config --system-images however you still need to pin the kubernetes_version for addon templates to be updated correctly

galal-hussein on 4 Apr 2019

I confirm that I had to run rke up twice. One calico pod was in CrashLoopBackOff.

remche on 8 Apr 2019

Oh, I hit the same issue, so we need to wait for v0.3.0?

rke version v0.2.7
Kubernetes version v1.14.5

rke v0.2.8 with the same issue

CrashLoopBackOff: Back-off 5m0s restarting failed container=calico-node pod=canal-5zfmc_kube-system(8c8e2fb6-d005-11e9-8e2e-0050569ea834)

Aisuko on 5 Sep 2019

I just had the same issue in one of my rke/kubernetes clusters. canal wouldn't work, same errors as in the original report in this issue. Also coredns was in faulty state.

While investigating the issue I noticed my cluster.yml didn't have "kubernetes_version" option set, ie. I only had specific "system_image" versions listed, but no "kubernetes_version" configured at all.

My "system_image" versions were set to specific version (v1.11.9-rancher1) and related addon versions, but that seems that make rke confused. When checking "cluster.rkestate" file I found that "kubernetes_version" was set to v1.15.5 there, and probably because of that, for example coredns was deployed by rke, and that's not correct.

So I defined "kubernetes_version" in cluster.yml to "v1.11.9-rancher1" (matching my "system_images" versions), re-run "rke up", and now coredns was deleted, and after that canal started working OK aswell.

Problem solved for me.