k3s master does a not allow re-entry of worker node once deleted.

Created on 7 Oct 2019  路  20Comments  路  Source: k3s-io/k3s

Version:
k3s version v0.9.1 (755bd1c6)
Describe the bug
Post delete of node in k3s cluster. node unable to re-join.
Same token as used to join previously.
rpi4 4GB version
32 GB microSD
Raspbian Buster Lite 2019-07-10

worker node was joined twice (same ip) just hostname error.
Master allowed both to exist in cluster at same time (same ip, different hostname)
pi@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
kworker01 Ready worker 88d v1.15.4-k3s.1
pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 cordoned
error: unable to drain node "kworker01", aborting command...

There are pending nodes to be drained:
kworker01
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/svclb-traefik-9pc65
pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 already cordoned
error: unable to drain node "kworker01", aborting command...

A clear and concise description of what the bug is.
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
kworker01 Ready worker 88d v1.15.4-k3s.1

pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 cordoned
error: unable to drain node "kworker01", aborting command...

There are pending nodes to be drained:
kworker01
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/svclb-traefik-9pc65
pi@kmaster:~ $ sudo kubectl drain kworker01
node/kworker01 already cordoned
error: unable to drain node "kworker01", aborting command...

There are pending nodes to be drained:
kworker01
error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/svclb-traefik-9pc65

i@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker01 NotReady,SchedulingDisabled worker 88d v1.15.4-k3s.1
kworker03 Ready worker 67s v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
pi@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker01 NotReady worker 88d v1.15.4-k3s.1
kworker01 NotReady,SchedulingDisabled worker 88d v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
kworker03 Ready worker 101s v1.15.4-k3s.1
pi@kmaster:~ $ sudo kubectl delete node worker01
node "worker01" deleted
pi@kmaster:~ $ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
kworker01 NotReady,SchedulingDisabled worker 88d v1.15.4-k3s.1
kworker03 Ready worker 2m2s v1.15.4-k3s.1
kworker02 Ready worker 88d v1.15.4-k3s.1
kmaster Ready master 88d v1.15.4-k3s.1
To Reproduce

Expected behavior
After running
delete node.
Should you have to manually un-install sh on worker node?
Should it not try to run it by itself?
I ran it manually on worker node1 after deleting it.

pi@kworker01:~ $ sudo /usr/local/bin/k3s-agent-uninstall.sh

  • id -u
  • [ 0 -eq 0 ]
  • /usr/local/bin/k3s-killall.sh
  • id -u
  • [ 0 -eq 0 ]
  • [ -d /var/lib/rancher/k3s/data/43998d048ba0336d929ad6cb3c2308722839fea65bdbedfd818422e2daf2e201/bin/ ]
  • export PATH=/var/lib/rancher/k3s/data/43998d048ba0336d929ad6cb3c2308722839fea65bdbedfd818422e2daf2e201/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  • [ -s /etc/systemd/system/k3s-agent.service ]
  • basename /etc/systemd/system/k3s-agent.service
  • systemctl stop k3s-agent.service
  • [ -x /etc/init.d/k3s* ]
  • lsof
  • sed -e s/^[^0-9]*//g; s/ */\t/g
  • sort -n -u
  • grep -w k3s/data/[^/]*/bin/containerd-shim
  • cut -f1
  • killtree
  • [ 0 -ne 0 ]
  • do_unmount /run/k3s
  • cat /proc/self/mounts
  • awk {print $2}

    • sort -r

      grep ^/run/k3s

  • MOUNTS=
  • [ -n ]
  • do_unmount /var/lib/rancher/k3s
  • cat /proc/self/mounts
  • awk {print $2}
  • grep ^/var/lib/rancher/k3s
  • sort -r
  • MOUNTS=
  • [ -n ]
  • ip link show
  • grep master cni0

    • sed -e s|@.*||

      awk -F: {print $2}

  • nets=
  • ip link delete cni0
    Cannot find device "cni0"
  • ip link delete flannel.1
    Cannot find device "flannel.1"
  • rm -rf /var/lib/cni/
  • which systemctl
    /bin/systemctl
  • systemctl disable k3s-agent
    Removed /etc/systemd/system/multi-user.target.wants/k3s-agent.service.
  • systemctl reset-failed k3s-agent
  • systemctl daemon-reload
  • which rc-update
  • rm -f /etc/systemd/system/k3s-agent.service
  • rm -f /etc/systemd/system/k3s-agent.service.env
  • trap remove_uninstall EXIT
  • [ -L /usr/local/bin/kubectl ]
  • rm -f /usr/local/bin/kubectl
  • [ -L /usr/local/bin/crictl ]
  • rm -f /usr/local/bin/crictl
  • [ -L /usr/local/bin/ctr ]
  • rm -f /usr/local/bin/ctr
  • rm -rf /etc/rancher/k3s
  • rm -rf /var/lib/rancher/k3s
  • rm -f /usr/local/bin/k3s
  • rm -f /usr/local/bin/k3s-killall.sh
  • remove_uninstall
  • rm -f /usr/local/bin/k3s-agent-uninstall.sh

Actual behavior
rebooted showed node not in cluster.
attempt to rejoin failed.
Same token as previous cluster, because clusterID and token didn't change.

Additional context
Add any other context about the problem here.

Unscheduled

Most helpful comment

I found this:
Node Registration
Agents will register with the server using the node cluster secret along with a randomly generated password for the node, stored at /var/lib/rancher/k3s/agent/node-password.txt. The server will store the passwords for individual nodes at /var/lib/rancher/k3s/server/cred/node-passwd, and any subsequent attempts must use the same password. If the data directory of an agent is removed the password file should be recreated for the agent, or the entry removed from the server.

All 20 comments

I found this:
Node Registration
Agents will register with the server using the node cluster secret along with a randomly generated password for the node, stored at /var/lib/rancher/k3s/agent/node-password.txt. The server will store the passwords for individual nodes at /var/lib/rancher/k3s/server/cred/node-passwd, and any subsequent attempts must use the same password. If the data directory of an agent is removed the password file should be recreated for the agent, or the entry removed from the server.

I am experiencing a similar problem. @tvuongp I tried what you said about node password, but it didn't work for me. Did it work for you?

Never mind, I discovered my other problem by typing journalctl -u k3s-agent.service | tail on my worker:

Oct 11 22:08:02 Pi4-2GB-56 k3s[7287]: time="2019-10-11T22:08:02.874509112+01:00" level=error msg="no default routes found in \"/proc/net/route\" or \"/proc/net/ipv6_route\""

Fixed that by adding a fake gateway line onto my static ip address configuration file.
If your issue is solved @tvuongp , I think you are allowed to close the issue.

I was experiencing this problem, too. It seems like once a node has left the cluster using kubectl delete node, there is no way of bringing it back, which is a serious problem. Looking at the logs of k3s-agent on the worker node, I got continuous log messages complaining about a wrong password or so. Unfortunately, I cannot reproduce the exact log message at this moment.

I found this:
Node Registration
Agents will register with the server using the node cluster secret along with a randomly generated password for the node, stored at /var/lib/rancher/k3s/agent/node-password.txt. The server will store the passwords for individual nodes at /var/lib/rancher/k3s/server/cred/node-passwd, and any subsequent attempts must use the same password. If the data directory of an agent is removed the password file should be recreated for the agent, or the entry removed from the server.

I was having the same issue. Fixed by accessing the server and remove the node entry on the passwd file.

Thanks!

+1

+1

Thanks @tvuongp had the same problem deleting old nodes entries in /var/lib/rancher/k3s/agent/node-password worked for me.

I encountered the same issue. My error message in the journal is level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool". Did anybody have the same error message?
I solved my problem by just joining the node with a different node-name.

i just tried to rejoin the node with the same node after removing /var/lib/rancher/k3s/server/cred/node-passwd but it didn't work. Still have the same error level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool".
I don't find the file /var/lib/rancher/k3s/agent/node-password on my master.

Having the same issue as @DoGab . I suspect this is however different issue since I don't have any entries in the node-passwd file in server mentioned before, nor does it exist in agent. I have this issue with a completely new cluster.

I've got the same problem as @DoGab nad @mladedav. In addition on the master node in journalctl -u k3s I see:

[...] http: TLS handshake error from <agent_ip>:46142: remote error: tls: bad certificate

+2

Was this page helpful?
0 / 5 - 0 ratings