Version:
k3s version v1.0.0 (18bd921c)
Describe the bug
After installing k3s everything was working perfectly until I rebooted the whole cluster, now all nodes are in NotReady state and I can't find a reason why it's happening
To Reproduce
After getting 2 raspberry pi 4 with 4 gb of ram and 32g sd cards...
cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory to /boot/cmdline.txtExpected behavior
I expected nodes to be healthy
Actual behavior
Nodes are not healthy
Additional context
k3s check-config:
Verifying binaries in /var/lib/rancher/k3s/data/93417efda3f1bfb0977d22d68559e0cccf71afecdf7dfc6b2df045c00421d7fa/bin:
- sha256sum: good
- links: good
System:
- /usr/sbin iptables v1.8.2 (nf_tables): should be older than v1.8.0 or in legacy mode (fail)
- swap: disabled
- routes: ok
Limits:
- /proc/sys/kernel/keys/root_maxkeys: 1000000
info: reading kernel config from /proc/config.gz ...
Generally Necessary:
- cgroup hierarchy: properly mounted [/sys/fs/cgroup]
- CONFIG_NAMESPACES: enabled
- CONFIG_NET_NS: enabled
- CONFIG_PID_NS: enabled
- CONFIG_IPC_NS: enabled
- CONFIG_UTS_NS: enabled
- CONFIG_CGROUPS: enabled
- CONFIG_CGROUP_CPUACCT: enabled
- CONFIG_CGROUP_DEVICE: enabled
- CONFIG_CGROUP_FREEZER: enabled
- CONFIG_CGROUP_SCHED: enabled
- CONFIG_CPUSETS: enabled
- CONFIG_MEMCG: enabled
- CONFIG_KEYS: enabled
- CONFIG_VETH: enabled (as module)
- CONFIG_BRIDGE: enabled (as module)
- CONFIG_BRIDGE_NETFILTER: enabled (as module)
- CONFIG_NF_NAT_IPV4: enabled (as module)
- CONFIG_IP_NF_FILTER: enabled (as module)
- CONFIG_IP_NF_TARGET_MASQUERADE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_IPVS: enabled (as module)
- CONFIG_IP_NF_NAT: enabled (as module)
- CONFIG_NF_NAT: enabled (as module)
- CONFIG_NF_NAT_NEEDED: enabled
- CONFIG_POSIX_MQUEUE: enabled
Optional Features:
- CONFIG_USER_NS: enabled
- CONFIG_SECCOMP: enabled
- CONFIG_CGROUP_PIDS: enabled
- CONFIG_BLK_CGROUP: enabled
- CONFIG_BLK_DEV_THROTTLING: enabled
- CONFIG_CGROUP_PERF: missing
- CONFIG_CGROUP_HUGETLB: missing
- CONFIG_NET_CLS_CGROUP: enabled (as module)
- CONFIG_CGROUP_NET_PRIO: missing
- CONFIG_CFS_BANDWIDTH: missing
- CONFIG_FAIR_GROUP_SCHED: enabled
- CONFIG_RT_GROUP_SCHED: missing
- CONFIG_IP_NF_TARGET_REDIRECT: enabled (as module)
- CONFIG_IP_VS: enabled (as module)
- CONFIG_IP_VS_NFCT: enabled
- CONFIG_IP_VS_PROTO_TCP: enabled
- CONFIG_IP_VS_PROTO_UDP: enabled
- CONFIG_IP_VS_RR: enabled (as module)
- CONFIG_EXT4_FS: enabled
- CONFIG_EXT4_FS_POSIX_ACL: enabled
- CONFIG_EXT4_FS_SECURITY: enabled
- Network Drivers:
- "overlay":
- CONFIG_VXLAN: enabled (as module)
Optional (for encrypted networks):
- CONFIG_CRYPTO: enabled
- CONFIG_CRYPTO_AEAD: enabled (as module)
- CONFIG_CRYPTO_GCM: enabled (as module)
- CONFIG_CRYPTO_SEQIV: enabled (as module)
- CONFIG_CRYPTO_GHASH: enabled (as module)
- CONFIG_XFRM: enabled
- CONFIG_XFRM_USER: enabled
- CONFIG_XFRM_ALGO: enabled
- CONFIG_INET_ESP: enabled (as module)
- CONFIG_INET_XFRM_MODE_TRANSPORT: enabled (as module)
- Storage Drivers:
- "overlay":
- CONFIG_OVERLAY_FS: enabled (as module)
STATUS: 1 (fail)
sudo kubectl get nodes -o wide:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master NotReady master 23h v1.16.3-k3s.2 192.168.0.201 <none> Raspbian GNU/Linux 10 (buster) 4.19.75-v7l+ containerd://1.3.0-k3s.4
worker1 NotReady node 19h v1.16.3-k3s.2 192.168.0.202 <none> Raspbian GNU/Linux 10 (buster) 4.19.75-v7l+ containerd://1.3.0-k3s.4
sudo kubectl describe node master:
Name: master
Roles: master
Labels: beta.kubernetes.io/arch=arm
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
k3s.io/hostname=master
k3s.io/internal-ip=192.168.0.201
kubernetes.io/arch=arm
kubernetes.io/hostname=master
kubernetes.io/os=linux
kubernetes.io/role=master
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"be:1c:4b:13:c6:4b"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.0.201
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 11 Dec 2019 18:40:42 +0000
Taints: node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 11 Dec 2019 23:04:18 +0000 Wed, 11 Dec 2019 23:04:18 +0000 FlannelIsUp Flannel is running on this node
MemoryPressure Unknown Wed, 11 Dec 2019 23:08:28 +0000 Thu, 12 Dec 2019 17:33:57 +0000 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 11 Dec 2019 23:08:28 +0000 Thu, 12 Dec 2019 17:33:57 +0000 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 11 Dec 2019 23:08:28 +0000 Thu, 12 Dec 2019 17:33:57 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 11 Dec 2019 23:08:28 +0000 Thu, 12 Dec 2019 17:33:57 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 192.168.0.201
Hostname: master
Capacity:
cpu: 4
ephemeral-storage: 29567140Ki
memory: 3999784Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 28762913770
memory: 3999784Ki
pods: 110
System Info:
Machine ID: 7b2fdbf071984c60abe0ba09b3b020e9
System UUID: 7b2fdbf071984c60abe0ba09b3b020e9
Boot ID: fbd64ddb-699a-4062-8356-f8127a5aeea3
Kernel Version: 4.19.75-v7l+
OS Image: Raspbian GNU/Linux 10 (buster)
Operating System: linux
Architecture: arm
Container Runtime Version: containerd://1.3.0-k3s.4
Kubelet Version: v1.16.3-k3s.2
Kube-Proxy Version: v1.16.3-k3s.2
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
ProviderID: k3s://master
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system svclb-traefik-rmrjg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23h
kube-system metrics-server-6d684c7b5-kkx98 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23h
kube-system local-path-provisioner-58fb86bdfd-52llb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23h
kube-system coredns-d798c9dd-vnkkr 100m (2%) 0 (0%) 70Mi (1%) 170Mi (4%) 23h
kube-system traefik-65bccdc4bd-fpnf8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 100m (2%) 0 (0%)
memory 70Mi (1%) 170Mi (4%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 23h kubelet, master Starting kubelet.
Warning InvalidDiskCapacity 23h kubelet, master invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 23h kubelet, master Node master status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 23h kubelet, master Node master status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 23h kubelet, master Node master status is now: NodeHasSufficientPID
Normal Starting 23h kube-proxy, master Starting kube-proxy.
Normal NodeAllocatableEnforced 23h kubelet, master Updated Node Allocatable limit across pods
Normal NodeReady 23h kubelet, master Node master status is now: NodeReady
Normal Starting 19h kubelet, master Starting kubelet.
Warning InvalidDiskCapacity 19h kubelet, master invalid capacity 0 on image filesystem
Normal Starting 19h kube-proxy, master Starting kube-proxy.
Warning Rebooted 19h kubelet, master Node master has been rebooted, boot id: fbd64ddb-699a-4062-8356-f8127a5aeea3
Normal NodeNotSchedulable 19h kubelet, master Node master status is now: NodeNotSchedulable
Normal NodeHasSufficientMemory 19h (x2 over 19h) kubelet, master Node master status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 19h (x2 over 19h) kubelet, master Node master status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 19h (x2 over 19h) kubelet, master Node master status is now: NodeHasSufficientPID
Normal NodeNotReady 19h kubelet, master Node master status is now: NodeNotReady
Normal NodeAllocatableEnforced 19h kubelet, master Updated Node Allocatable limit across pods
Normal NodeReady 19h kubelet, master Node master status is now: NodeReady
Normal NodeSchedulable 19h kubelet, master Node master status is now: NodeSchedulable
systemctl status k3s
β k3s.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2019-12-12 18:13:46 GMT; 18min ago
Docs: https://k3s.io
Process: 494 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 498 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 502 (k3s-server)
Tasks: 144
Memory: 371.6M
CGroup: /system.slice/k3s.service
ββ502 /usr/local/bin/k3s server
ββ691 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/a
Dec 12 18:32:14 master k3s[502]: time="2019-12-12T18:32:14.816288667Z" level=error msg="Failed to connect to proxy" error="dial tcp: address 2a02:a31a:a23e:c980:4918:a950:8309:2fa:6
Dec 12 18:32:14 master k3s[502]: time="2019-12-12T18:32:14.816401314Z" level=error msg="Remotedialer proxy error" error="dial tcp: address 2a02:a31a:a23e:c980:4918:a950:8309:2fa:644
Dec 12 18:32:19 master k3s[502]: time="2019-12-12T18:32:19.816706836Z" level=info msg="Connecting to proxy" url="wss://2a02:a31a:a23e:c980:4918:a950:8309:2fa:6443/v1-k3s/connect"
Dec 12 18:32:19 master k3s[502]: time="2019-12-12T18:32:19.816926205Z" level=error msg="Failed to connect to proxy" error="dial tcp: address 2a02:a31a:a23e:c980:4918:a950:8309:2fa:6
Dec 12 18:32:19 master k3s[502]: time="2019-12-12T18:32:19.817039316Z" level=error msg="Remotedialer proxy error" error="dial tcp: address 2a02:a31a:a23e:c980:4918:a950:8309:2fa:644
Dec 12 18:32:24 master k3s[502]: time="2019-12-12T18:32:24.817324245Z" level=info msg="Connecting to proxy" url="wss://2a02:a31a:a23e:c980:4918:a950:8309:2fa:6443/v1-k3s/connect"
Dec 12 18:32:24 master k3s[502]: time="2019-12-12T18:32:24.817549929Z" level=error msg="Failed to connect to proxy" error="dial tcp: address 2a02:a31a:a23e:c980:4918:a950:8309:2fa:6
Dec 12 18:32:24 master k3s[502]: time="2019-12-12T18:32:24.817674928Z" level=error msg="Remotedialer proxy error" error="dial tcp: address 2a02:a31a:a23e:c980:4918:a950:8309:2fa:644
Dec 12 18:32:25 master k3s[502]: E1212 18:32:25.992906 502 resource_quota_controller.go:407] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the ser
Dec 12 18:32:27 master k3s[502]: W1212 18:32:27.196677 502 garbagecollector.go:640] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to
I think that ipv6 addresses should be wrapped with [...] as it's not url="wss://2a02:a31a:a23e:c980:4918:a950:8309:2fa:6443/v1-k3s/connect"
After uninstalling server & agent and installing it once again everything was Ready but after reboot it stayed Readyfor few minutes and went to NotReady :|
I finally got some logs from the server, seems like either a cert or dns issue related to ipv6:
Dec 16 23:28:24 master k3s[521]: time="2019-12-16T23:28:24.898927242+01:00" level=info msg="Connecting to proxy" url="wss://2a02:a31a:a23e:c980:2286:b08b:e636:c655:6443/v1-k3s/connect"
Dec 16 23:28:24 master k3s[521]: time="2019-12-16T23:28:24.899175927+01:00" level=error msg="Failed to connect to proxy" error="dial tcp: address 2a02:a31a:a23e:c980:2286:b08b:e636:c655:6443: too many colons in address"
Dec 16 23:28:24 master k3s[521]: time="2019-12-16T23:28:24.899294946+01:00" level=error msg="Remotedialer proxy error" error="dial tcp: address 2a02:a31a:a23e:c980:2286:b08b:e636:c655:6443: too many colons in address"
Dec 16 23:28:25 master k3s[521]: http: TLS handshake error from 192.168.0.202:43392: remote error: tls: bad certificate
Dec 16 23:28:25 master k3s[521]: I1216 23:28:25.656211 521 node_lifecycle_controller.go:1208] Initializing eviction metric for zone:
Dec 16 23:28:27 master k3s[521]: http: TLS handshake error from 192.168.0.202:43400: remote error: tls: bad certificate
Dec 16 23:28:29 master k3s[521]: http: TLS handshake error from 192.168.0.202:43408: remote error: tls: bad certificate
Dec 16 23:28:29 master k3s[521]: time="2019-12-16T23:28:29.899611646+01:00" level=info msg="Connecting to proxy" url="wss://2a02:a31a:a23e:c980:2286:b08b:e636:c655:6443/v1-k3s/connect"
Dec 16 23:28:29 master k3s[521]: time="2019-12-16T23:28:29.899863035+01:00" level=e
Suprisingly, it constantly logs errors about failing to connect to proxy due to too many colons but after shutting it down with systemctl stop k3s and running sudo k3s server it works, there are only bad certificate errors. Ofc. after reboot it does not, it goes back again to ipv6 issues :(
Turning off ipv6 on raspberry pi level helped although seems like a bug π€·ββ
Add net.ipv6.conf.all.disable_ipv6 = 1 to /etc/sysctl.conf to every worker & master
although seems like a bug
Itβs actually a bug
Seems like ipv6 addresses have to be in [ ... ]
The error message re ipv6 issue looks like it should be addressed with #1198, but it seems like there may be a bigger issue that the systemd service file is launching k3s before ipv4 networking is available.
Also, iptables should be legacy mode. :)
I changed to legacy during investigation without any difference. And yes, it only stopped working when booting up, running k3s after boot was ok
I've faced the same issue before in #811. The solution for me was to turn off ipv6 in my entire internal network. Not ideal, I know, but it worked.
In my case it wasn't possible so I disabled ipv6 on every raspberrypi
Same for me when setting up my cluster mid/end of december 2019.
If someone searches on how to disable IPv6, this worked for me
# act as root
sudo su -
echo "ipv6.disable=1 cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory" >> /boot/cmdline.txt
cat <<EOF > /etc/modprobe.d/ipv6.conf
# Don't load ipv6 by default
alias net-pf-10 off
alias ipv6 off
options ipv6 disable_ipv6=1
EOF
# Comment out IPv6 Hosts
nano /etc/hosts
reboot
Hopefully this helps someone. Maybe there are better ways, I'am no sys-admin -.-
In my case adding net.ipv6.conf.all.disable_ipv6 = 1 to /etc/sysctl.conf (on every node & master ) worked perfectly and then reboot
Working on raspbian I was able to fix this by switching iptables to legacy mode.
sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
After a reboot, everything is passing nicely!
using legacy iptables is required no matter if you use ipv6 or ipv4
Most helpful comment
Turning off ipv6 on raspberry pi level helped although seems like a bug π€·ββ
Add
net.ipv6.conf.all.disable_ipv6 = 1to/etc/sysctl.confto every worker & master