rke 0.2.8 fails to deploy k8s 1.14.6, rke-network-plugin deploy fails + workaround

Created on 19 Sep 2019 · 16Comments · Source: rancher/rke

rke v0.2.8 doesn't seem to be able to deploy new kubernetes v1.14.6-rancher1 clusters. Earlier kubernetes versions seem to deploy/work OK for me. Problems show up when I'm trying to deploy the latest k8s v1.14.6 version (which is default in rke v0.2.8), as rke's "rke-network-plugin" deployment task/job always fails with a FATAL error.

Environment:

docker hosts: CentOS 7.6 VMs, running native docker 1.13.1-102 (selinux enabled).
rke v0.2.8.
kubernetes v1.14.6-rancher1 (default).
networking: canal (default).

I'm using stock rke "cluster.yml" configuration generated with "rke config" for 3 nodes, except I've added bastion host info to cluster.yml, to be able to access the 3x docker hosts (fresh CentOS 7.6 VMs) I have.

"rke --debug up" always fails like this:

INFO[0271] [addons] Executing deploy job rke-network-plugin
DEBU[0271] [k8s] waiting for job rke-network-plugin-deploy-job to complete..
FATA[0301] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

The FATAL error shows up in around ~20 seconds after the "rke-network-plugin" deployment job was started. If I check the logs on the docker hosts, all seems good, there are no errors that I can find, and networking seems to be succesfully configured/enabled.

I spent quite many hours debugging and troubleshooting this until I figured the logs contained lines about "docker.io/rancher/pause:3.1" container image not being there, so it was pulled as part of the "rke-network-plugin" deploy job..

..which made me wonder if this was some kind of timing/timeout issue. So I decided to pre-pull the container image on all the docker hosts before running "rke up", and that was it, now "rke-network-plugin" deployment job/task succeeds properly (in around 5 seconds) !!

So it seems the "rke-network-plugin" deployment job clearly is broken with the "complete status" detection. If there's a maximum timeout configured, it needs to be increased, as that would allow the "rke up" to succeed without me first pre-pulling the container image manually on all the hosts.

I didn't take a look at the rke sources yet, but hopefully this is enough information for the first brave soul who wants to investigate and fix this issue :)

Some of the similar/related issues:

Source

pasikarkkainen

👍1

Most helpful comment

I'll investigate more and post the info here.

Another workaround: If I run "rke up" (it fails with the FATAL error as described above), and then re-run "rke up" again, and this time it'll succeed, probably because now the pause-image is already on the hosts (so it's pretty much the same scenario as when I pre-pull the image on the hosts before running rke up).

Something I noticed from the logs with a quick look:

when rke up fails with the FATAL error about rke-network-plugin deployment job, the nodes are still working on bringing the networking up.. only some 30-60 seconds AFTER rke has already failed the nodes get the log entry "Calico node started successfully".

pasikarkkainen on 19 Sep 2019

👍3

All 16 comments

The pause image is the same across (almost) all Kubernetes versions, and is used as "parent" container for other containers, in this case CNI/network is the first deployed and it is pulled. How long did the pull take when you pulled it manually? And what are the logs from the kubelet? Is the pull actually slow or is there some issue with the kubelet? What node specifications are you using?

This sounds identical to https://github.com/rancher/rancher/issues/19713#issuecomment-488006346, where a reference to pulling the pause container solved the issue as well.

superseb on 19 Sep 2019

I'll investigate more and post the info here.

Something I noticed from the logs with a quick look:

pasikarkkainen on 19 Sep 2019

👍3

Same for 0.2.7 and 0.3.0 - changing old cluster works (image already there). Adding new node or provisioning new cluster - ends with error. Second attempt is always successful.

marcinbojko on 4 Oct 2019

Please specify the requested info in https://github.com/rancher/rke/issues/1652#issuecomment-533101626, or use the emoji reaction +1 if you just want to upvote the issue.

superseb on 4 Oct 2019

Just ran into the same issue repeatedly trying to deploy a 3 node rke (v0.3.0) cluster. Considering it reliably works by running the command again, I agree with the previous posters that this seems to be an issue with the way the timeout is set and/or completion of rke-network-plugin-deploy-job is determined. I never ran into this when deploying onto rancherOS nodes, which I guess would already have the image cached while ubuntu doesn't.

rke:

...
INFO[0172] Starting container [rke-log-cleaner] on host [NODE3], try #1 
INFO[0172] Starting container [rke-log-cleaner] on host [NODE2], try #1 
INFO[0172] Starting container [rke-log-cleaner] on host [NODE1], try #1 
INFO[0172] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE2] 
INFO[0172] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE3] 
INFO[0172] Removing container [rke-log-cleaner] on host [NODE3], try #1 
INFO[0172] Removing container [rke-log-cleaner] on host [NODE2], try #1 
INFO[0172] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE1] 
INFO[0173] Removing container [rke-log-cleaner] on host [NODE1], try #1 
INFO[0173] [remove/rke-log-cleaner] Successfully removed container on host [NODE2] 
INFO[0173] [remove/rke-log-cleaner] Successfully removed container on host [NODE3] 
INFO[0173] [remove/rke-log-cleaner] Successfully removed container on host [NODE1] 
INFO[0173] [sync] Syncing nodes Labels and Taints       
INFO[0173] [sync] Successfully synced nodes Labels and Taints 
INFO[0173] [network] Setting up network plugin: canal   
INFO[0173] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes 
INFO[0173] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes 
INFO[0173] [addons] Executing deploy job rke-network-plugin 
FATA[0203] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

kube status a few seconds after the error:

> kubectl get nodes
  kubectl get pods --all-namespaces
NAME      STATUS   ROLES                      AGE    VERSION
[NODE3]   Ready    controlplane,etcd,worker   101s   v1.16.1
[NODE2]    Ready    controlplane,etcd,worker   81s    v1.16.1
[NODE1]    Ready    controlplane,etcd,worker   82s    v1.16.1
NAMESPACE     NAME                                  READY   STATUS      RESTARTS   AGE
kube-system   canal-7hcvd                           2/2     Running     0          47s
kube-system   canal-jr8qz                           2/2     Running     0          47s
kube-system   canal-xmk27                           1/2     Running     0          47s
kube-system   rke-network-plugin-deploy-job-pxm46   0/1     Completed   0          74s 
> kubectl logs rke-network-plugin-deploy-job-pxm46 -n kube-system
configmap/canal-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/canal created
serviceaccount/canal created

running `rke up --config cluster.yml` again

...
[0042] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE3] 
INFO[0042] Removing container [rke-log-cleaner] on host [NODE3], try #1 
INFO[0042] [remove/rke-log-cleaner] Successfully removed container on host [NODE1] 
INFO[0042] [remove/rke-log-cleaner] Successfully removed container on host [NODE2] 
INFO[0042] [remove/rke-log-cleaner] Successfully removed container on host [NODE3] 
INFO[0042] [sync] Syncing nodes Labels and Taints       
INFO[0042] [sync] Successfully synced nodes Labels and Taints 
INFO[0042] [network] Setting up network plugin: canal   
INFO[0042] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes 
INFO[0042] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes 
INFO[0042] [addons] Executing deploy job rke-network-plugin 
INFO[0042] [addons] Setting up coredns                  
INFO[0042] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes 
INFO[0042] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes 
INFO[0042] [addons] Executing deploy job rke-coredns-addon 
INFO[0048] [addons] CoreDNS deployed successfully..     
INFO[0048] [dns] DNS provider coredns deployed successfully 
INFO[0048] [addons] Setting up Metrics Server           
INFO[0048] [addons] Saving ConfigMap for addon rke-metrics-addon to Kubernetes 
INFO[0048] [addons] Successfully saved ConfigMap for addon rke-metrics-addon to Kubernetes 
INFO[0048] [addons] Executing deploy job rke-metrics-addon 
INFO[0053] [addons] Metrics Server deployed successfully 
INFO[0053] [ingress] Setting up nginx ingress controller 
INFO[0053] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes 
INFO[0053] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes 
INFO[0053] [addons] Executing deploy job rke-ingress-controller 
INFO[0058] [ingress] ingress controller nginx deployed successfully 
INFO[0058] [addons] Setting up user addons              
INFO[0058] [addons] no user addons defined              
INFO[0058] Finished building Kubernetes cluster successfully

node specs

Node 1: 8 cores, 30GB RAM, SSD storage
Node 2: 6 cores, 32GB RAM, SSD storage
Node 3: 4 cores, 8GB RAM, SSD storage

Node OS: Ubuntu Ubuntu 18.04.3 LTS
Node Docker Version: 18.09.7

cluster.yml

nodes:
  - address: NODE1
    user: root
    role: [controlplane,worker,etcd]
    hostname_override: NODE1
  - address: NODE2
    user: root
    role: [controlplane,worker,etcd]
    hostname_override: NODE2
  - address: NODE3
    user: root
    role: [worker,controlplane,etcd]
    hostname_override: NODE3

cluster_name: test

ssh_agent_auth: true

kubernetes_version: v1.16.1-rancher1-1

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h
  kubelet:
    extra_args:
      volume-plugin-dir: /var/lib/kubelet/volumeplugins
      resolv-conf: "/run/systemd/resolve/resolv.conf"
      feature-gates: "VolumeSnapshotDataSource=true, CSIDriverRegistry=true, PersistentLocalVolumes=true"
    extra_binds:
      - /var/lib/kubelet/volumeplugins:/var/lib/kubelet/volumeplugins
  kubeapi:
    extra_args:
      feature-gates: "VolumeSnapshotDataSource=true, CSIDriverRegistry=true, PersistentLocalVolumes=true"

related issue in rancher: rancher/rancher#19713

yannleretaille on 17 Oct 2019

ll succeed, probably because now the pause-image is already on the hosts (so it's pretty much the same scenario as when I pre-pull the image on the hosts before running rke up).

Something I noticed from the logs with a quick look:

when rke up fails with the FATAL error about rke-network-plugin deployment job, the nodes are still working on bringing the networking up.. only some 30-60 seconds AFTER rke has already failed the nodes get the log entry "Calico node started successfully".

yeah,rerun the rke again solve the problem.you save my day

LaysDragon on 13 Jan 2020

Hmm, does rke "addon_job_timeout" affect rke-network-plugin deploy? I guess I'll try that..

pasikarkkainen on 13 Jan 2020

following this as i am hitting the same issue with: rke version v1.0.2

Re-running it fixes the issue but this is not possible when using terraform-rke-provider: https://github.com/rancher/terraform-provider-rke/issues/152

olegTarassov on 30 Jan 2020

I forgot to post the latest status here earlier. So here goes.. I managed to workaround this problem with configuring the rke "addon_job_timeout" to larger value than the default 30 seconds. It seems in my environment 30 seconds addon job timeout is not long enough to allow the rke-network-plugin deploy job to finish successfully.

When I configure bigger rke "addon_job_timeout" I don't need to do the "pre-pull rancher/pause:3.1 image" hack either.

pasikarkkainen on 23 Apr 2020

oh, and I had the same issue with more recent versions aswell (rke v0.2.10 and kubernetes v1.14.10), but like said, configuring rke "addon_job_timeout" to higher value seems to fix the problem for me.

pasikarkkainen on 23 Apr 2020

I solved this by turning off selinux @yannleretaille

yeongjet on 16 May 2020

I had the same issue, and these two steps solved my problem

Increase addon_job_timeout
Check node free space (at lease 15%)

In my case, one of the nodes had DiskPressure state

AliMD on 16 May 2020

I hit the same issue after I add the customer certificate to the kube-controller in the cluster.yaml, there is no rancher/pause container running

Aisuko on 17 Jul 2020

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.