Rke: Job rke-network-plugin-deploy-job never completes (virtualbox)

Created on 27 Mar 2019 · 13Comments · Source: rancher/rke

RKE version: v0.2.0

Docker version: (docker version,docker info preferred)

Client:
 Version:           18.09.3
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        774a1f4
 Built:             Thu Feb 28 06:53:11 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.3
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       774a1f4
  Built:            Thu Feb 28 05:59:55 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Containers: 20
 Running: 7
 Paused: 0
 Stopped: 13
Images: 4
Server Version: 18.09.3
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-46-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 985.5MiB
Name: kanuahs
ID: 5EFK:2KX7:R64P:YT56:WCYV:653P:AFWT:TAS4:PMGA:YCOR:3FPX:4D2N
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support



md5-9567005ce853803d36ec6b03ca977232



NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

4.15.0-46-generic



md5-d6a4b4cff24d0d43452a10fcd56a9683




**Steps to Reproduce:**

1. Create a fresh ubuntu 18.04 server virtualbox VM using [this ISO](http://cdimage.ubuntu.com/ubuntu/releases/18.04/release/ubuntu-18.04.2-server-amd64.iso), Install docker, etcd, Generate ssh keys
2. rke up


**Results:**



md5-0d2481584392488595f7176dc43e508e







md5-2cfaa1d85bf5b74686406473114324c0

kubectl describe jobs -n kube-system
Name: rke-network-plugin-deploy-job
Namespace: kube-system
Selector: controller-uid=9a9d01ed-5069-11e9-b158-080027e84c2b
Labels: controller-uid=9a9d01ed-5069-11e9-b158-080027e84c2b
job-name=rke-network-plugin-deploy-job
Annotations:
Parallelism: 1
Completions: 1
Start Time: Wed, 27 Mar 2019 13:53:30 +0530
Pods Statuses: 1 Running / 0 Succeeded / 4 Failed
Pod Template:
Labels: controller-uid=9a9d01ed-5069-11e9-b158-080027e84c2b
job-name=rke-network-plugin-deploy-job
Service Account: rke-job-deployer
Containers:
rke-network-plugin-pod:
Image: rancher/hyperkube:v1.13.4-rancher1
Port:
Host Port:
Command:
kubectl
apply
-f
/etc/config/rke-network-plugin.yaml
Environment:
Mounts:
/etc/config from config-volume (rw)
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rke-network-plugin
Optional: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 2m2s job-controller Created pod: rke-network-plugin-deploy-job-pk298
Normal SuccessfulCreate 102s job-controller Created pod: rke-network-plugin-deploy-job-n8wx2
Normal SuccessfulCreate 71s job-controller Created pod: rke-network-plugin-deploy-job-f6d9h
Normal SuccessfulCreate 50s job-controller Created pod: rke-network-plugin-deploy-job-6bcqv
Normal SuccessfulCreate 10s job-controller Created pod: rke-network-plugin-deploy-job-7kzkg

kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
rke-network-plugin-deploy-job-6bcqv 0/1 Error 0 84s
rke-network-plugin-deploy-job-7kzkg 0/1 Error 0 44s
rke-network-plugin-deploy-job-f6d9h 0/1 Error 0 105s
rke-network-plugin-deploy-job-n8wx2 0/1 Error 0 2m16s
rke-network-plugin-deploy-job-pk298 0/1 Error 0 2m37s

kubectl logs -n kube-system rke-network-plugin-deploy-job-6bcqv

...
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: connection refused
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: connection refused
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: connection refused
```

Additional Info:

The rke binary is inside VM. I'm trying to create a single node cluster from inside the VM, with the VM itself as the node.

rke up --local (with no config file) causes the same problem

To Triage kinbug priorit1

Source

kanuahs

Most helpful comment

Using Ubuntu or RancherOS it works fine but problem with CentOS 7x.

trankchung on 2 Apr 2019

👍4

All 13 comments

I'm having the exact same problem. Not sure what to do next. My output is as follow.

➜ rancher git:(master) ✗ k --kubeconfig kube_config_cluster.yml get pod --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-system rke-network-plugin-deploy-job-48ms4 0/1 Pending 0 19s

trankchung on 30 Mar 2019

We hit the same issue with v0.2.1. As this is the way to check (CI/CD) verify rke/k8s release, this is preventing us upgrading.

mtparet on 2 Apr 2019

As a workaround, I used the hostname of the VM instead of 127.0.0.1 and it worked in Ubuntu 16.04 and 18.04 . It seems to be related to pods not being able to reach the api server..

kanuahs on 2 Apr 2019

Using Ubuntu or RancherOS it works fine but problem with CentOS 7x.

trankchung on 2 Apr 2019

👍4

we have the same problem here.

camilo-schoeningh-sociomantic on 23 May 2019

I changed the address in cluster.yml to the servers ip (192.168.1.10 in my case) instead of 127.0.0.1 and it started working for me.

dahlo on 12 Jun 2019

Getting the same kind of issues, as well in Virtualbox. My setup is as:

3 nodes with RancherOS
All nodes being worker and etcd, two of them being controllers
rke is installed on a 4th node, and cluster.yaml is referring the three k8s nodes with their ip address

rke v0.2.4, installing kubernetes 1.13.5

chtardif on 21 Jun 2019

As mentioned by @trankchung, CentOS/RHEL doesn't work properly, I have to run the pipeline twice to get a succesful job

...
[info] [sync] Syncing nodes Labels and Taints
[info] [sync] Successfully synced nodes Labels and Taints
[info] [network] Setting up network plugin: calico
[info] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
[info] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
[info] [addons] Executing deploy job rke-network-plugin

Error: 
Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

ulm0 on 26 Aug 2019

I had a similar issue. The rke-network-plugin was never deployed. My nodes got just about 25% free disc storage left. Once I deleted old data and had about 75% of free disc space on all of my 3 nodes the rke-network-plugin was deployed successfully.

KetoStheno on 29 Oct 2019

It seems for me the issue was too low default value of rke "addon_job_timeout" (default is 30 seconds).. I increased the value, and rke network plugin deploy job starting being successful (https://github.com/rancher/rke/issues/1652).

pasikarkkainen on 24 Apr 2020

I had the same issue, and these two steps solved my problem

Increase addon_job_timeout
Check node free space (at lease 15%)

In my case, one of the nodes had DiskPressure state

AliMD on 16 May 2020

In my case, this answer ~is the workaround~ had progress.

Setting addon_job_timeout to a long time didn't help, it just keeps failing the whole time. kubectl describe node shows that DiskPressure is False as well. I had to run docker network create --driver=bridge --subnet=10.43.0.0/16 br0_rke before rke up.

Keep note that a comment mentioned this would mean that RKE is broken.

quick update on my situation

logs of coredns and calico-kube-controller pods imply that there still seems to be no route to 10.43.0.1 for internal traffic across pods...

E0528 02:15:57.160530       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: no route to host
E0528 02:15:57.160530       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: no route to host
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-799dffd9c4-lndpq.unknownuser.log.ERROR.20200528-021557.1: no such file or directory

2020-05-28 02:21:10.062 [ERROR][1] client.go 238: Error getting cluster information config ClusterInformation="default" error=Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: connect: no route to host
2020-05-28 02:21:10.062 [FATAL][1] main.go 117: Failed to initialize Calico datastore error=Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: connect: no route to host

alfieyfc on 28 May 2020

You should not need the br0_rke network — all this is doing is applying a bandaid fix over your issue.

A more “suitable” workaround for this is to create a default route on the host network namespace — even if it doesn’t route anywhere due to you being airgapped, the iptables rules that are used for service IP resolution with rke will then start to work, and thus you’ll be able to route to 10.43.0.0/16