rke v0.2.8 doesn't seem to be able to deploy new kubernetes v1.14.6-rancher1 clusters. Earlier kubernetes versions seem to deploy/work OK for me. Problems show up when I'm trying to deploy the latest k8s v1.14.6 version (which is default in rke v0.2.8), as rke's "rke-network-plugin" deployment task/job always fails with a FATAL error.
Environment:
I'm using stock rke "cluster.yml" configuration generated with "rke config" for 3 nodes, except I've added bastion host info to cluster.yml, to be able to access the 3x docker hosts (fresh CentOS 7.6 VMs) I have.
"rke --debug up" always fails like this:
INFO[0271] [addons] Executing deploy job rke-network-plugin
DEBU[0271] [k8s] waiting for job rke-network-plugin-deploy-job to complete..
FATA[0301] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
The FATAL error shows up in around ~20 seconds after the "rke-network-plugin" deployment job was started. If I check the logs on the docker hosts, all seems good, there are no errors that I can find, and networking seems to be succesfully configured/enabled.
I spent quite many hours debugging and troubleshooting this until I figured the logs contained lines about "docker.io/rancher/pause:3.1" container image not being there, so it was pulled as part of the "rke-network-plugin" deploy job..
..which made me wonder if this was some kind of timing/timeout issue. So I decided to pre-pull the container image on all the docker hosts before running "rke up", and that was it, now "rke-network-plugin" deployment job/task succeeds properly (in around 5 seconds) !!
So it seems the "rke-network-plugin" deployment job clearly is broken with the "complete status" detection. If there's a maximum timeout configured, it needs to be increased, as that would allow the "rke up" to succeed without me first pre-pulling the container image manually on all the hosts.
I didn't take a look at the rke sources yet, but hopefully this is enough information for the first brave soul who wants to investigate and fix this issue :)
Some of the similar/related issues:
The pause image is the same across (almost) all Kubernetes versions, and is used as "parent" container for other containers, in this case CNI/network is the first deployed and it is pulled. How long did the pull take when you pulled it manually? And what are the logs from the kubelet? Is the pull actually slow or is there some issue with the kubelet? What node specifications are you using?
This sounds identical to https://github.com/rancher/rancher/issues/19713#issuecomment-488006346, where a reference to pulling the pause container solved the issue as well.
I'll investigate more and post the info here.
Another workaround: If I run "rke up" (it fails with the FATAL error as described above), and then re-run "rke up" again, and this time it'll succeed, probably because now the pause-image is already on the hosts (so it's pretty much the same scenario as when I pre-pull the image on the hosts before running rke up).
Something I noticed from the logs with a quick look:
when rke up fails with the FATAL error about rke-network-plugin deployment job, the nodes are still working on bringing the networking up.. only some 30-60 seconds AFTER rke has already failed the nodes get the log entry "Calico node started successfully".
Same for 0.2.7 and 0.3.0 - changing old cluster works (image already there). Adding new node or provisioning new cluster - ends with error. Second attempt is always successful.
Please specify the requested info in https://github.com/rancher/rke/issues/1652#issuecomment-533101626, or use the emoji reaction +1 if you just want to upvote the issue.
Just ran into the same issue repeatedly trying to deploy a 3 node rke (v0.3.0) cluster. Considering it reliably works by running the command again, I agree with the previous posters that this seems to be an issue with the way the timeout is set and/or completion of rke-network-plugin-deploy-job is determined. I never ran into this when deploying onto rancherOS nodes, which I guess would already have the image cached while ubuntu doesn't.
...
INFO[0172] Starting container [rke-log-cleaner] on host [NODE3], try #1
INFO[0172] Starting container [rke-log-cleaner] on host [NODE2], try #1
INFO[0172] Starting container [rke-log-cleaner] on host [NODE1], try #1
INFO[0172] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE2]
INFO[0172] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE3]
INFO[0172] Removing container [rke-log-cleaner] on host [NODE3], try #1
INFO[0172] Removing container [rke-log-cleaner] on host [NODE2], try #1
INFO[0172] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE1]
INFO[0173] Removing container [rke-log-cleaner] on host [NODE1], try #1
INFO[0173] [remove/rke-log-cleaner] Successfully removed container on host [NODE2]
INFO[0173] [remove/rke-log-cleaner] Successfully removed container on host [NODE3]
INFO[0173] [remove/rke-log-cleaner] Successfully removed container on host [NODE1]
INFO[0173] [sync] Syncing nodes Labels and Taints
INFO[0173] [sync] Successfully synced nodes Labels and Taints
INFO[0173] [network] Setting up network plugin: canal
INFO[0173] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0173] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0173] [addons] Executing deploy job rke-network-plugin
FATA[0203] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
> kubectl get nodes
kubectl get pods --all-namespaces
NAME STATUS ROLES AGE VERSION
[NODE3] Ready controlplane,etcd,worker 101s v1.16.1
[NODE2] Ready controlplane,etcd,worker 81s v1.16.1
[NODE1] Ready controlplane,etcd,worker 82s v1.16.1
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system canal-7hcvd 2/2 Running 0 47s
kube-system canal-jr8qz 2/2 Running 0 47s
kube-system canal-xmk27 1/2 Running 0 47s
kube-system rke-network-plugin-deploy-job-pxm46 0/1 Completed 0 74s
> kubectl logs rke-network-plugin-deploy-job-pxm46 -n kube-system
configmap/canal-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/canal created
serviceaccount/canal created
rke up --config cluster.yml again...
[0042] [cleanup] Successfully started [rke-log-cleaner] container on host [NODE3]
INFO[0042] Removing container [rke-log-cleaner] on host [NODE3], try #1
INFO[0042] [remove/rke-log-cleaner] Successfully removed container on host [NODE1]
INFO[0042] [remove/rke-log-cleaner] Successfully removed container on host [NODE2]
INFO[0042] [remove/rke-log-cleaner] Successfully removed container on host [NODE3]
INFO[0042] [sync] Syncing nodes Labels and Taints
INFO[0042] [sync] Successfully synced nodes Labels and Taints
INFO[0042] [network] Setting up network plugin: canal
INFO[0042] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0042] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0042] [addons] Executing deploy job rke-network-plugin
INFO[0042] [addons] Setting up coredns
INFO[0042] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0042] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0042] [addons] Executing deploy job rke-coredns-addon
INFO[0048] [addons] CoreDNS deployed successfully..
INFO[0048] [dns] DNS provider coredns deployed successfully
INFO[0048] [addons] Setting up Metrics Server
INFO[0048] [addons] Saving ConfigMap for addon rke-metrics-addon to Kubernetes
INFO[0048] [addons] Successfully saved ConfigMap for addon rke-metrics-addon to Kubernetes
INFO[0048] [addons] Executing deploy job rke-metrics-addon
INFO[0053] [addons] Metrics Server deployed successfully
INFO[0053] [ingress] Setting up nginx ingress controller
INFO[0053] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0053] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0053] [addons] Executing deploy job rke-ingress-controller
INFO[0058] [ingress] ingress controller nginx deployed successfully
INFO[0058] [addons] Setting up user addons
INFO[0058] [addons] no user addons defined
INFO[0058] Finished building Kubernetes cluster successfully
Node OS: Ubuntu Ubuntu 18.04.3 LTS
Node Docker Version: 18.09.7
nodes:
- address: NODE1
user: root
role: [controlplane,worker,etcd]
hostname_override: NODE1
- address: NODE2
user: root
role: [controlplane,worker,etcd]
hostname_override: NODE2
- address: NODE3
user: root
role: [worker,controlplane,etcd]
hostname_override: NODE3
cluster_name: test
ssh_agent_auth: true
kubernetes_version: v1.16.1-rancher1-1
services:
etcd:
snapshot: true
creation: 6h
retention: 24h
kubelet:
extra_args:
volume-plugin-dir: /var/lib/kubelet/volumeplugins
resolv-conf: "/run/systemd/resolve/resolv.conf"
feature-gates: "VolumeSnapshotDataSource=true, CSIDriverRegistry=true, PersistentLocalVolumes=true"
extra_binds:
- /var/lib/kubelet/volumeplugins:/var/lib/kubelet/volumeplugins
kubeapi:
extra_args:
feature-gates: "VolumeSnapshotDataSource=true, CSIDriverRegistry=true, PersistentLocalVolumes=true"
related issue in rancher: rancher/rancher#19713
ll succeed, probably because now the pause-image is already on the hosts (so it's pretty much the same scenario as when I pre-pull the image on the hosts before running rke up).
Something I noticed from the logs with a quick look:
when rke up fails with the FATAL error about rke-network-plugin deployment job, the nodes are still working on bringing the networking up.. only some 30-60 seconds AFTER rke has already failed the nodes get the log entry "Calico node started successfully".
yeah,rerun the rke again solve the problem.you save my day
Hmm, does rke "addon_job_timeout" affect rke-network-plugin deploy? I guess I'll try that..
following this as i am hitting the same issue with: rke version v1.0.2
Re-running it fixes the issue but this is not possible when using terraform-rke-provider: https://github.com/rancher/terraform-provider-rke/issues/152
I forgot to post the latest status here earlier. So here goes.. I managed to workaround this problem with configuring the rke "addon_job_timeout" to larger value than the default 30 seconds. It seems in my environment 30 seconds addon job timeout is not long enough to allow the rke-network-plugin deploy job to finish successfully.
When I configure bigger rke "addon_job_timeout" I don't need to do the "pre-pull rancher/pause:3.1 image" hack either.
oh, and I had the same issue with more recent versions aswell (rke v0.2.10 and kubernetes v1.14.10), but like said, configuring rke "addon_job_timeout" to higher value seems to fix the problem for me.
I solved this by turning off selinux @yannleretaille
I had the same issue, and these two steps solved my problem
addon_job_timeoutIn my case, one of the nodes had DiskPressure state
I hit the same issue after I add the customer certificate to the kube-controller in the cluster.yaml, there is no rancher/pause container running
This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
this is still a valid issue, the default rke addon job timeout needs increasing!
Merging into https://github.com/rancher/rke/issues/2143
Most helpful comment
I'll investigate more and post the info here.
Another workaround: If I run "rke up" (it fails with the FATAL error as described above), and then re-run "rke up" again, and this time it'll succeed, probably because now the pause-image is already on the hosts (so it's pretty much the same scenario as when I pre-pull the image on the hosts before running rke up).
Something I noticed from the logs with a quick look:
when rke up fails with the FATAL error about rke-network-plugin deployment job, the nodes are still working on bringing the networking up.. only some 30-60 seconds AFTER rke has already failed the nodes get the log entry "Calico node started successfully".