k3s master does a not allow re-entry of worker node once deleted.

Created on 7 Oct 2019 · 20Comments · Source: k3s-io/k3s

Version:
k3s version v0.9.1 (755bd1c6)
Describe the bug
Post delete of node in k3s cluster. node unable to re-join.
Same token as used to join previously.
rpi4 4GB version
32 GB microSD
Raspbian Buster Lite 2019-07-10

worker node was joined twice (same ip) just hostname error.
Master allowed both to exist in cluster at same time (same ip, different hostname)
pi@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
kworker01 Ready worker 88d v1.15.4-k3s.1
pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 cordoned
error: unable to drain node "kworker01", aborting command...

There are pending nodes to be drained:
kworker01
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/svclb-traefik-9pc65
pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 already cordoned
error: unable to drain node "kworker01", aborting command...

A clear and concise description of what the bug is.
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
kworker01 Ready worker 88d v1.15.4-k3s.1

pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 cordoned
error: unable to drain node "kworker01", aborting command...

There are pending nodes to be drained:
kworker01
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/svclb-traefik-9pc65

i@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker01 NotReady,SchedulingDisabled worker 88d v1.15.4-k3s.1
kworker03 Ready worker 67s v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
pi@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker01 NotReady,SchedulingDisabled worker 88d v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
kworker03 Ready worker 101s v1.15.4-k3s.1
pi@kmaster:~ $ sudo kubectl delete node worker01
node "worker01" deleted
pi@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
kworker01 NotReady,SchedulingDisabled worker 88d v1.15.4-k3s.1
kworker03 Ready worker 2m2s v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
To Reproduce

Expected behavior
After running
delete node.
Should you have to manually un-install sh on worker node?
Should it not try to run it by itself?
I ran it manually on worker node1 after deleting it.

pi@kworker01:~ $ sudo /usr/local/bin/k3s-agent-uninstall.sh

id -u
[ 0 -eq 0 ]
/usr/local/bin/k3s-killall.sh
id -u
[ 0 -eq 0 ]
[ -d /var/lib/rancher/k3s/data/43998d048ba0336d929ad6cb3c2308722839fea65bdbedfd818422e2daf2e201/bin/ ]
export PATH=/var/lib/rancher/k3s/data/43998d048ba0336d929ad6cb3c2308722839fea65bdbedfd818422e2daf2e201/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[ -s /etc/systemd/system/k3s-agent.service ]
basename /etc/systemd/system/k3s-agent.service
systemctl stop k3s-agent.service
[ -x /etc/init.d/k3s* ]
lsof
sed -e s/^[^0-9]*//g; s/ */\t/g
sort -n -u
grep -w k3s/data/[^/]*/bin/containerd-shim
cut -f1
killtree
[ 0 -ne 0 ]
do_unmount /run/k3s
cat /proc/self/mounts
awk {print $2}
- sort -r
  
  grep ^/run/k3s
MOUNTS=
[ -n ]
do_unmount /var/lib/rancher/k3s
cat /proc/self/mounts
awk {print $2}
grep ^/var/lib/rancher/k3s
sort -r
MOUNTS=
[ -n ]
ip link show
grep master cni0
- sed -e s|@.*||
  
  awk -F: {print $2}
nets=
ip link delete cni0
Cannot find device "cni0"
ip link delete flannel.1
Cannot find device "flannel.1"
rm -rf /var/lib/cni/
which systemctl
/bin/systemctl
systemctl disable k3s-agent
Removed /etc/systemd/system/multi-user.target.wants/k3s-agent.service.
systemctl reset-failed k3s-agent
systemctl daemon-reload
which rc-update
rm -f /etc/systemd/system/k3s-agent.service
rm -f /etc/systemd/system/k3s-agent.service.env
trap remove_uninstall EXIT
[ -L /usr/local/bin/kubectl ]
rm -f /usr/local/bin/kubectl
[ -L /usr/local/bin/crictl ]
rm -f /usr/local/bin/crictl
[ -L /usr/local/bin/ctr ]
rm -f /usr/local/bin/ctr
rm -rf /etc/rancher/k3s
rm -rf /var/lib/rancher/k3s
rm -f /usr/local/bin/k3s
rm -f /usr/local/bin/k3s-killall.sh
remove_uninstall
rm -f /usr/local/bin/k3s-agent-uninstall.sh

Actual behavior
rebooted showed node not in cluster.
attempt to rejoin failed.
Same token as previous cluster, because clusterID and token didn't change.

Additional context
Add any other context about the problem here.

Unscheduled

Source

tvuongp

👍1

Most helpful comment

I found this:
Node Registration
Agents will register with the server using the node cluster secret along with a randomly generated password for the node, stored at /var/lib/rancher/k3s/agent/node-password.txt. The server will store the passwords for individual nodes at /var/lib/rancher/k3s/server/cred/node-passwd, and any subsequent attempts must use the same password. If the data directory of an agent is removed the password file should be recreated for the agent, or the entry removed from the server.

tvuongp on 7 Oct 2019

👍4

All 20 comments

tvuongp on 7 Oct 2019

👍4

I am experiencing a similar problem. @tvuongp I tried what you said about node password, but it didn't work for me. Did it work for you?

marcusreese on 11 Oct 2019

Never mind, I discovered my other problem by typing journalctl -u k3s-agent.service | tail on my worker:

Oct 11 22:08:02 Pi4-2GB-56 k3s[7287]: time="2019-10-11T22:08:02.874509112+01:00" level=error msg="no default routes found in \"/proc/net/route\" or \"/proc/net/ipv6_route\""

Fixed that by adding a fake gateway line onto my static ip address configuration file.
If your issue is solved @tvuongp , I think you are allowed to close the issue.

marcusreese on 11 Oct 2019

I was experiencing this problem, too. It seems like once a node has left the cluster using kubectl delete node, there is no way of bringing it back, which is a serious problem. Looking at the logs of k3s-agent on the worker node, I got continuous log messages complaining about a wrong password or so. Unfortunately, I cannot reproduce the exact log message at this moment.

christian-schlichtherle on 17 Oct 2019

I found this:
Node Registration
Agents will register with the server using the node cluster secret along with a randomly generated password for the node, stored at /var/lib/rancher/k3s/agent/node-password.txt. The server will store the passwords for individual nodes at /var/lib/rancher/k3s/server/cred/node-passwd, and any subsequent attempts must use the same password. If the data directory of an agent is removed the password file should be recreated for the agent, or the entry removed from the server.

I was having the same issue. Fixed by accessing the server and remove the node entry on the passwd file.

Thanks!

freitasskeeled on 18 Oct 2019

👍3

huapox on 31 Oct 2019

wc-matteo on 23 Jan 2020

Thanks @tvuongp had the same problem deleting old nodes entries in /var/lib/rancher/k3s/agent/node-password worked for me.

guillaumemaka on 29 Jan 2020

I encountered the same issue. My error message in the journal is level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool". Did anybody have the same error message?
I solved my problem by just joining the node with a different node-name.

DoGab on 10 Apr 2020

i just tried to rejoin the node with the same node after removing /var/lib/rancher/k3s/server/cred/node-passwd but it didn't work. Still have the same error level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool".
I don't find the file /var/lib/rancher/k3s/agent/node-password on my master.

DoGab on 10 Apr 2020

Having the same issue as @DoGab . I suspect this is however different issue since I don't have any entries in the node-passwd file in server mentioned before, nor does it exist in agent. I have this issue with a completely new cluster.

mladedav on 17 Apr 2020

I've got the same problem as @DoGab nad @mladedav. In addition on the master node in journalctl -u k3s I see:

[...] http: TLS handshake error from <agent_ip>:46142: remote error: tls: bad certificate

Trojan295 on 20 Apr 2020

huapox on 21 Apr 2020

Trojan295 on 21 Apr 2020

👍2

dweomer on 4 May 2020

alexellis on 21 Oct 2020

brandond on 21 Oct 2020

alexellis on 21 Oct 2020

brandond on 22 Oct 2020

erikwilson on 22 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

failed to get CA certs at https://server_add:6443/cacerts - When trying to add new k3s worker node

kcns008 · 3Comments

Remotedialer proxy error when using Google Cloud VM as k3s server

ashrafgt · 3Comments

k3s doesn't work on Chrome OS Crostini

wpwoodjr · 3Comments

Allows to preload a docker image on the k3s node agents

dduportal · 4Comments

K3s & cloud-provider-openstack

giezi · 3Comments