Rke: Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

Created on 9 Jul 2019 · 15Comments · Source: rancher/rke

I have tried the work around from rancher/rke#1295 and it didn't work for me, it produces the same error.

I'm not sure that #19189 would relate as this is a new cluster and not an upgrade to an existing one.

This is my first time using rke and setting up a k8s cluster so please let me know if I'm missing something obvious or if you need more information from me!

RKE version:
rke version v0.2.4

Docker version: (docker version,docker info preferred)

Same for both node 1 and node 2.

Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:23:02 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Containers: 10
 Running: 7
 Paused: 0
 Stopped: 3
Images: 4
Server Version: 18.09.7
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-54-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.789GiB
Name: k8s-node1
ID: ZC43:K3I7:HP2S:LFUA:JXMF:EXV2:V7UJ:H7QN:27IJ:S3DC:6XYW:CS2P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

Same for both node 1 and node 2.

NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

4.15.0-54-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

VM in Hyper-V

cluster.yml file:

# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: 10.0.1.74
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - etcd
  - worker
  hostname_override: k8s-node1
  user: k8s
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
- address: 10.0.1.75
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - etcd
  - worker
  hostname_override: k8s-node2
  user: k8s
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    service_cluster_ip_range: 172.24.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_cidr: 172.25.0.0/24
    service_cluster_ip_range: 172.24.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
  kubelet:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_domain: k8s.adminarsenal.net
    infra_container_image: ""
    cluster_dns_server: 172.24.0.10
    fail_swap_on: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
network:
  plugin: flannel
  options:
    flannel_backend_type: vxlan
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
system_images:
  etcd: rancher/coreos-etcd:v3.3.10-rancher1
  alpine: rancher/rke-tools:v0.1.34
  nginx_proxy: rancher/rke-tools:v0.1.34
  cert_downloader: rancher/rke-tools:v0.1.34
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.34
  kubedns: rancher/k8s-dns-kube-dns:1.15.0
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
  kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.3.0
  coredns: rancher/coredns-coredns:1.3.1
  coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.3.0
  kubernetes: rancher/hyperkube:v1.14.3-rancher1
  flannel: rancher/coreos-flannel:v0.10.0-rancher1
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher1
  calico_node: rancher/calico-node:v3.4.0
  calico_cni: rancher/calico-cni:v3.4.0
  calico_controllers: ""
  calico_ctl: rancher/calico-ctl:v2.0.0
  canal_node: rancher/calico-node:v3.4.0
  canal_cni: rancher/calico-cni:v3.4.0
  canal_flannel: rancher/coreos-flannel:v0.10.0
  weave_node: weaveworks/weave-kube:2.5.0
  weave_cni: weaveworks/weave-npc:2.5.0
  pod_infra_container: rancher/pause:3.1
  ingress: rancher/nginx-ingress-controller:0.21.0-rancher3
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
  metrics_server: rancher/metrics-server:v0.3.1
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
addon_job_timeout: 30
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
restore:
  restore: false
  snapshot_name: ""
dns: null

Steps to Reproduce:

Save config and then run ./rke up

Results:

INFO[0075] [remove/rke-log-cleaner] Successfully removed container on host [10.0.1.74]
INFO[0075] [remove/rke-log-cleaner] Successfully removed container on host [10.0.1.75]
INFO[0075] [sync] Syncing nodes Labels and Taints
INFO[0075] [sync] Successfully synced nodes Labels and Taints
INFO[0075] [network] Setting up network plugin: flannel
INFO[0075] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0075] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0075] [addons] Executing deploy job rke-network-plugin
FATA[0105] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

statustale

Source

PDQDakota

👍1

Most helpful comment

I had the same issue, and these two steps solved my problem

Increase addon_job_timeout
Check node free space (at lease 15%)

In my case, one of the nodes had DiskPressure state

AliMD on 16 May 2020

👍2

All 15 comments

I'm having the same problem with rke + CentOS 7.6 VMs, running native docker 1.13.1 (selinux enabled).

rke v0.2.8 + kubernetes 1.13.10: works OK.
rke v0.2.8 + kubernetes 1.14.6: "rke up" fails with FATAL error "Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system". If I re-run "rke up", then rke finishes successfully and kubernetes cluster works OK.

pasikarkkainen on 18 Sep 2019

I did get the same error.
I'am using ubuntu 19.10 on 5 hosts and rke v0.3.1.
I did so an rke - up with an plane yml file. and everything went ok.
then I change the image to rancher/hyperkube:v1.16.2-rancher1 and run rke -up.
everything went ok :)

If you intened to deploy Kubernetes in an air-gapped environment,

please consult the documentation on how to configure custom RKE images.

nodes:

address: 192.168.1.120
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
  
  hostname_override: ""
  
  user: bjorn
  
  docker_socket: /var/run/docker.sock
  
  ssh_key: ""
  
  ssh_key_path: ~/.ssh/id_rsa
  
  ssh_cert: ""
  
  ssh_cert_path: ""
  
  labels: {}
  
  taints: []
address: 192.168.1.122
port: "22"
internal_address: ""
role:
- worker
  
  hostname_override: ""
  
  user: bjorn
  
  docker_socket: /var/run/docker.sock
  
  ssh_key: ""
  
  ssh_key_path: ~/.ssh/id_rsa
  
  ssh_cert: ""
  
  ssh_cert_path: ""
  
  labels: {}
  
  taints: []
address: 192.168.1.123
port: "22"
internal_address: ""
role:
- worker
  
  hostname_override: ""
  
  user: bjorn
  
  docker_socket: /var/run/docker.sock
  
  ssh_key: ""
  
  ssh_key_path: ~/.ssh/id_rsa
  
  ssh_cert: ""
  
  ssh_cert_path: ""
  
  labels: {}
  
  taints: []
address: 192.168.1.124
port: "22"
internal_address: ""
role:
- worker
  
  hostname_override: ""
  
  user: bjorn
  
  docker_socket: /var/run/docker.sock
  
  ssh_key: ""
  
  ssh_key_path: ~/.ssh/id_rsa
  
  ssh_cert: ""
  
  ssh_cert_path: ""
  
  labels: {}
  
  taints: []
address: 192.168.1.125
port: "22"
internal_address: ""
role:
- worker
  
  hostname_override: ""
  
  user: bjorn
  
  docker_socket: /var/run/docker.sock
  
  ssh_key: ""
  
  ssh_key_path: ~/.ssh/id_rsa
  
  ssh_cert: ""
  
  ssh_cert_path: ""
  
  labels: {}
  
  taints: []
  
  services:
  
  etcd:
  
  image: ""
  
  extra_args: {}
  
  extra_binds: []
  
  extra_env: []
  
  external_urls: []
  
  ca_cert: ""
  
  cert: ""
  
  key: ""
  
  path: ""
  
  uid: 0
  
  gid: 0
  
  snapshot: null
  
  retention: ""
  
  creation: ""
  
  backup_config: null
  
  kube-api:
  
  image: ""
  
  extra_args: {}
  
  extra_binds: []
  
  extra_env: []
  
  service_cluster_ip_range: 10.43.0.0/16
  
  service_node_port_range: ""
  
  pod_security_policy: false
  
  always_pull_images: false
  
  kube-controller:
  
  image: ""
  
  extra_args: {}
  
  extra_binds: []
  
  extra_env: []
  
  cluster_cidr: 10.42.0.0/16
  
  service_cluster_ip_range: 10.43.0.0/16
  
  scheduler:
  
  image: ""
  
  extra_args: {}
  
  extra_binds: []
  
  extra_env: []
  
  kubelet:
  
  image: ""
  
  extra_args: {}
  
  extra_binds: []
  
  extra_env: []
  
  cluster_domain: cluster.local
  
  infra_container_image: ""
  
  cluster_dns_server: 10.43.0.10
  
  fail_swap_on: false
  
  kubeproxy:
  
  image: ""
  
  extra_args: {}
  
  extra_binds: []
  
  extra_env: []
  
  network:
  
  plugin: weave
  
  options: {}
  
  node_selector: {}
  
  authentication:
  
  strategy: x509
  
  sans: []
  
  webhook: null
  
  addons: ""
  
  addons_include: []
  
  system_images:
  
  etcd: ""
  
  alpine: ""
  
  nginx_proxy: ""
  
  cert_downloader: ""
  
  kubernetes_services_sidecar: ""
  
  kubedns: ""
  
  dnsmasq: ""
  
  kubedns_sidecar: ""
  
  kubedns_autoscaler: ""
  
  coredns: ""
  
  coredns_autoscaler: ""
  
  kubernetes: "rancher/hyperkube:v1.16.2-rancher1"
  
  flannel: ""
  
  flannel_cni: ""
  
  calico_node: ""
  
  calico_cni: ""
  
  calico_controllers: ""
  
  calico_ctl: ""
  
  calico_flexvol: ""
  
  canal_node: ""
  
  canal_cni: ""
  
  canal_flannel: ""
  
  canal_flexvol: ""
  
  weave_node: ""
  
  weave_cni: ""
  
  pod_infra_container: ""
  
  ingress: ""
  
  ingress_backend: ""
  
  metrics_server: ""
  
  windows_pod_infra_container: ""
  
  ssh_key_path: ~/.ssh/id_rsa
  
  ssh_cert_path: ""
  
  ssh_agent_auth: false
  
  authorization:
  
  mode: rbac
  
  options: {}
  
  ignore_docker_version: false
  
  kubernetes_version: ""
  
  private_registries: []
  
  ingress:
  
  provider: ""
  
  options: {}
  
  node_selector: {}
  
  extra_args: {}
  
  dns_policy: ""
  
  cluster_name: ""
  
  cloud_provider:
  
  name: ""
  
  prefix_path: ""
  
  addon_job_timeout: 0
  
  bastion_host:
  
  address: ""
  
  port: ""
  
  user: ""
  
  ssh_key: ""
  
  ssh_key_path: ""
  
  ssh_cert: ""
  
  ssh_cert_path: ""
  
  monitoring:
  
  provider: ""
  
  options: {}
  
  node_selector: {}
  
  restore:
  
  restore: false
  
  snapshot_name: ""
  
  dns: null

bjornjorgensen on 21 Oct 2019

another me too, I'm getting this attempting to just run rke against docker locally as a test. re-running doesn't solve the issue though, it never resolves or installs completely:

cluster_name: local
dns:
  provider: coredns
nodes:
  - address: 127.0.0.1
    user: tessa
    role:
      - controlplane
      - etcd
      - worker
ssh_agent_auth: true

nergdron on 11 Feb 2020

I have this problem too

after I run rke up command I'm gettin an error like below:
FATA[0058] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

When I check pods, _rke-network-plugin-deploy-job_ still _ContainerCreating_ status

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                  READY   STATUS              RESTARTS   AGE
kube-system   rke-network-plugin-deploy-job-4482c   0/1     ContainerCreating   0          56m

In other it's takes about 5m.. Anyone can help me?

burakkiymaz on 12 Feb 2020

I am having the same problem.
[root@rancher01 ~]# docker ps -a | grep ebcb7b662f69
ebcb7b662f69 00405a225ef9 "kubectl apply -f ..." 28 minutes ago Exited (1) 28 minutes ago k8s_rke-network-plugin-pod_rke-network-plugin-deploy-job-8ndtx_kube-system_78c45434-bad1-4e2a-bdec-a769d8cf93fa_0

[root@rancher01 ~]# docker logs ebcb7b662f69 -f
error: the path "/etc/config/rke-network-plugin.yaml" cannot be accessed: stat /etc/config/rke-network-plugin.yaml: permission denied

lucky-sideburn on 20 Feb 2020

Same here on vagrant Centos 1905.1 (hello, @lucky-sideburn).
Disabling SELinux is the key.

aijanai on 23 Mar 2020

😄1 👍1

I ran into this issue and it was due to a node taking too long to become ready. I just had to wait until kubectl get nodes reported all as ready, and then run rke up --update-only to finish the cluster deployment.

nevinsm on 15 Apr 2020

It seems for me the issue was too low default value of rke "addon_job_timeout" (default is 30 seconds).. I increased the value, and rke network plugin deploy job starting being successful (https://github.com/rancher/rke/issues/1652).

pasikarkkainen on 24 Apr 2020

I had the same issue, and these two steps solved my problem

Increase addon_job_timeout
Check node free space (at lease 15%)

In my case, one of the nodes had DiskPressure state

AliMD on 16 May 2020

👍2

I'm hitting this too. The rke-network-plugin-deploy-job job never completes and doesn't give any logs. The nodes are all NotReady. No pods are up. I set addon_job_timeout to 180 and my nodes have 97% free space (around 190GB free).

RKE v1.1.3
kubectl v1.18.3
cluster.yml is using:

kubernetes_version: v1.18.3-rancher2-2
calico plugin

joereyna on 16 Jul 2020

Watch out for SELinux or firewalls between kubelet (10250, if I don't go wrong) and apiserver (6443)

aijanai on 16 Jul 2020

I've disabled firewalls and apparmor on Ubuntu 18 and still can't get CNI job to complete. Nodes are NotReady and CNI job won't complete. Also, why no logs???

kubectl logs -l rke-network-plugin-deploy-job -n kube-system

Looking in docker logs for now.

joereyna on 16 Jul 2020

Same issue, logs say:
$ kubectl -nkube-system logs pod/rke-network-plugin-deploy-job-6bn62

Error from server: no preferred addresses found; known addresses: []

ilya310300 on 16 Aug 2020

As @aijanai said, there can be firewall problems, so I added this two ports and it fix the issue

sudo ufw allow 6443
sudo ufw allow 10250

ilya310300 on 16 Aug 2020

🎉1

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.