I have tried the work around from rancher/rke#1295 and it didn't work for me, it produces the same error.
I'm not sure that #19189 would relate as this is a new cluster and not an upgrade to an existing one.
This is my first time using rke and setting up a k8s cluster so please let me know if I'm missing something obvious or if you need more information from me!
RKE version:
rke version v0.2.4
Docker version: (docker version,docker info preferred)
Same for both node 1 and node 2.
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:23 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.7
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:23:02 2019
OS/Arch: linux/amd64
Experimental: false
Containers: 10
Running: 7
Paused: 0
Stopped: 3
Images: 4
Server Version: 18.09.7
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-54-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.789GiB
Name: k8s-node1
ID: ZC43:K3I7:HP2S:LFUA:JXMF:EXV2:V7UJ:H7QN:27IJ:S3DC:6XYW:CS2P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit support
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Same for both node 1 and node 2.
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
4.15.0-54-generic
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
VM in Hyper-V
cluster.yml file:
# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: 10.0.1.74
port: "22"
internal_address: ""
role:
- controlplane
- etcd
- worker
hostname_override: k8s-node1
user: k8s
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
- address: 10.0.1.75
port: "22"
internal_address: ""
role:
- controlplane
- etcd
- worker
hostname_override: k8s-node2
user: k8s
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
services:
etcd:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
snapshot: null
retention: ""
creation: ""
backup_config: null
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
service_cluster_ip_range: 172.24.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
kube-controller:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_cidr: 172.25.0.0/24
service_cluster_ip_range: 172.24.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
kubelet:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_domain: k8s.adminarsenal.net
infra_container_image: ""
cluster_dns_server: 172.24.0.10
fail_swap_on: false
kubeproxy:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
network:
plugin: flannel
options:
flannel_backend_type: vxlan
authentication:
strategy: x509
sans: []
webhook: null
addons: ""
addons_include: []
system_images:
etcd: rancher/coreos-etcd:v3.3.10-rancher1
alpine: rancher/rke-tools:v0.1.34
nginx_proxy: rancher/rke-tools:v0.1.34
cert_downloader: rancher/rke-tools:v0.1.34
kubernetes_services_sidecar: rancher/rke-tools:v0.1.34
kubedns: rancher/k8s-dns-kube-dns:1.15.0
dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.3.0
coredns: rancher/coredns-coredns:1.3.1
coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.3.0
kubernetes: rancher/hyperkube:v1.14.3-rancher1
flannel: rancher/coreos-flannel:v0.10.0-rancher1
flannel_cni: rancher/flannel-cni:v0.3.0-rancher1
calico_node: rancher/calico-node:v3.4.0
calico_cni: rancher/calico-cni:v3.4.0
calico_controllers: ""
calico_ctl: rancher/calico-ctl:v2.0.0
canal_node: rancher/calico-node:v3.4.0
canal_cni: rancher/calico-cni:v3.4.0
canal_flannel: rancher/coreos-flannel:v0.10.0
weave_node: weaveworks/weave-kube:2.5.0
weave_cni: weaveworks/weave-npc:2.5.0
pod_infra_container: rancher/pause:3.1
ingress: rancher/nginx-ingress-controller:0.21.0-rancher3
ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: rancher/metrics-server:v0.3.1
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
cluster_name: ""
cloud_provider:
name: ""
prefix_path: ""
addon_job_timeout: 30
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
monitoring:
provider: ""
options: {}
restore:
restore: false
snapshot_name: ""
dns: null
Steps to Reproduce:
Save config and then run ./rke up
Results:
INFO[0075] [remove/rke-log-cleaner] Successfully removed container on host [10.0.1.74]
INFO[0075] [remove/rke-log-cleaner] Successfully removed container on host [10.0.1.75]
INFO[0075] [sync] Syncing nodes Labels and Taints
INFO[0075] [sync] Successfully synced nodes Labels and Taints
INFO[0075] [network] Setting up network plugin: flannel
INFO[0075] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0075] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0075] [addons] Executing deploy job rke-network-plugin
FATA[0105] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
I'm having the same problem with rke + CentOS 7.6 VMs, running native docker 1.13.1 (selinux enabled).
rke v0.2.8 + kubernetes 1.13.10: works OK.
rke v0.2.8 + kubernetes 1.14.6: "rke up" fails with FATAL error "Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system". If I re-run "rke up", then rke finishes successfully and kubernetes cluster works OK.
I did get the same error.
I'am using ubuntu 19.10 on 5 hosts and rke v0.3.1.
I did so an rke - up with an plane yml file. and everything went ok.
then I change the image to rancher/hyperkube:v1.16.2-rancher1 and run rke -up.
everything went ok :)
If you intened to deploy Kubernetes in an air-gapped environment,
nodes:
another me too, I'm getting this attempting to just run rke against docker locally as a test. re-running doesn't solve the issue though, it never resolves or installs completely:
cluster_name: local
dns:
provider: coredns
nodes:
- address: 127.0.0.1
user: tessa
role:
- controlplane
- etcd
- worker
ssh_agent_auth: true
I have this problem too
after I run rke up command I'm gettin an error like below:
FATA[0058] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
When I check pods, _rke-network-plugin-deploy-job_ still _ContainerCreating_ status
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system rke-network-plugin-deploy-job-4482c 0/1 ContainerCreating 0 56m
In other it's takes about 5m.. Anyone can help me?
I am having the same problem.
[root@rancher01 ~]# docker ps -a | grep ebcb7b662f69
ebcb7b662f69 00405a225ef9 "kubectl apply -f ..." 28 minutes ago Exited (1) 28 minutes ago k8s_rke-network-plugin-pod_rke-network-plugin-deploy-job-8ndtx_kube-system_78c45434-bad1-4e2a-bdec-a769d8cf93fa_0
[root@rancher01 ~]# docker logs ebcb7b662f69 -f
error: the path "/etc/config/rke-network-plugin.yaml" cannot be accessed: stat /etc/config/rke-network-plugin.yaml: permission denied
It is related to https://github.com/rancher/rancher/issues/23662
Same here on vagrant Centos 1905.1 (hello, @lucky-sideburn).
Disabling SELinux is the key.
I ran into this issue and it was due to a node taking too long to become ready. I just had to wait until kubectl get nodes reported all as ready, and then run rke up --update-only to finish the cluster deployment.
It seems for me the issue was too low default value of rke "addon_job_timeout" (default is 30 seconds).. I increased the value, and rke network plugin deploy job starting being successful (https://github.com/rancher/rke/issues/1652).
I had the same issue, and these two steps solved my problem
addon_job_timeoutIn my case, one of the nodes had DiskPressure state
I'm hitting this too. The rke-network-plugin-deploy-job job never completes and doesn't give any logs. The nodes are all NotReady. No pods are up. I set addon_job_timeout to 180 and my nodes have 97% free space (around 190GB free).
RKE v1.1.3
kubectl v1.18.3
cluster.yml is using:
Watch out for SELinux or firewalls between kubelet (10250, if I don't go wrong) and apiserver (6443)
I've disabled firewalls and apparmor on Ubuntu 18 and still can't get CNI job to complete. Nodes are NotReady and CNI job won't complete. Also, why no logs???
kubectl logs -l rke-network-plugin-deploy-job -n kube-system
Looking in docker logs for now.
Same issue, logs say:
$ kubectl -nkube-system logs pod/rke-network-plugin-deploy-job-6bn62
Error from server: no preferred addresses found; known addresses: []
As @aijanai said, there can be firewall problems, so I added this two ports and it fix the issue
sudo ufw allow 6443
sudo ufw allow 10250
This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Most helpful comment
I had the same issue, and these two steps solved my problem
addon_job_timeoutIn my case, one of the nodes had
DiskPressurestate