K3s: Removing node doesn't remove node password

Created on 11 Sep 2019 · 16Comments · Source: k3s-io/k3s

I'm not sure if that's the right place for the bug report, because the error message I got has only one google results and it's pointing to commit message that added password validation below, so here it is.

I have a few rpis: 1, 0W. I installed hypriot on them and after installation of k3s changed some of their hostnames. I changed hostname of black-pearl to rpi1, removed black-pearl node from k3s-server, created another black-pearl on RPI 0W and here comes the problem, k3s of rpi0 (black-pearl) couldn't join cluster because password didn't match:

k3s-agent:

level=info msg="Running load balancer 127.0.0.1:41241 ->[k3s.local:6443]"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"

I spent some time trying to fix it and noticed that old password for black-pearl which is now rpi1 is still in
/var/lib/rancher/k3s/server/cred/node-passwd despite running kubectl delete black-pearl.

Seems that removing node should also remove password for that node in case another node with same hostname (OS is reinstalled?) re-joins the cluster.

Done kinbug kindocumentation

Source

agilob

👍9

Most helpful comment

In my case, I uninstalled (via script) and removed the node via kubectl. Then upon reinstall this issue popped up.

Uninstalling again and then removing the entry from {data-dir}/server/cred/node-passwd (default /var/lib/rancher/k3s/server/cred/node-passwd) worked for me.

maxirus on 28 Jan 2020

👍6 🎉4

All 16 comments

Removing a kubernetes node using kubectl is not supposed to clean up the files generated by k3s, to fully uninstall k3s from a node you might want to use /usr/local/bin/k3s-uninstall.sh script that should be installed in the system.

galal-hussein on 16 Sep 2019

To improve user experience kubectl should remove hostname:password from /var/lib/rancher/k3s/server/cred/node-passwd when node is deleted? As it was my first time with KxS it took me a while to figure out where the password is stored and why it's not removed. I'm happy to close it if you disagree, at least this will be some help to other users.

agilob on 16 Sep 2019

👍2

It probably should, we are cleaning up CoreDNS hosts entry here: https://github.com/rancher/k3s/blob/36ca6060733725953b7a4cd2b53a295d11aea684/pkg/node/controller.go#L36

The issue isn't with cleaning up the node, it is with cleaning up node-passwd on the server.

erikwilson on 16 Sep 2019

👍6

In my case, I uninstalled (via script) and removed the node via kubectl. Then upon reinstall this issue popped up.

Uninstalling again and then removing the entry from {data-dir}/server/cred/node-passwd (default /var/lib/rancher/k3s/server/cred/node-passwd) worked for me.

maxirus on 28 Jan 2020

👍6 🎉4

@ibuildthecloud I ran into this issue too, and it was really confusing.

Uninstall k3s-agent / reinstall - no effect
Eventually the logs of k3s-agent on the node got me here to this error.

alexellis on 1 May 2020

👍2

Just to add a comment in support of doing this cleanup.

I set up a clean install of k3s on 5 raspberry pi 4s.

Unfortunately, I had to reimage the OS completely on my last node (hostname: hive-node-4). After I got it all set up again and got the node to join via k3sup, I noticed it was never actually joining even though the install was fresh. So the above instructions to run uninstall make no sense.

I'm running kubectl from my laptop with a KUBECONFIG set and trying to get the new hive-node-4 into the cluster. But the duplicate hostname causes this issue. There definitely needs to be a better way to clean this up. I don't think reusing a hostname is an uncommon thing.

thebouv on 9 Aug 2020

This cleanup would allow my infrastructure to be far more immutable. My first intention was having my nodes automatically join the cluster on first boot, but this causes issues after a reimage.

David-Igou on 8 Sep 2020

Can something be done via kubectl node delete?
I think the node-controller might API might be ok for removing the entry from the file.
The docs suggest that this is done via a cloud controller .
https://kubernetes.io/docs/concepts/architecture/controller/#direct-control

ieugen on 17 Sep 2020

Note: Possible RKE2 impact? Was disabled in RKE2? Investigate as this gets addressed.

davidnuzik on 8 Oct 2020

Facing the issue mentioned by @thebouv here, deleted nodes from cluster and reused hostnames with fresh VMs. But all i got is

time="2020-10-30T04:28:32.280855039Z" level=error msg="Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

ulm0 on 30 Oct 2020

i solved it by removing the old node entries
ssh into the master node
sudo vi /var/lib/rancher/k3s/server/cred/node-passwd
delete the deprecated node entry then save and a few seconds later you will get your new node in the cluster

Alex77g on 3 Dec 2020

@erikwilson I don't see a backport PR for this into 1.19 branch. This is just for 1.20?
We're already working on shipping 1.19.5 out so I'm bumping this out.

davidnuzik on 11 Dec 2020

❤1

@erikwilson An issue to cover RKE2 was opened here as well: https://github.com/rancher/rke2/issues/616
Not sure what this entails - like if it's just a pull-thru PR or requires more work. However, we'd like to also get this fixed in RKE2 as well for our next patch release there as well (planned for 1/13/21)

davidnuzik on 16 Dec 2020

Reproduced the issue using k3s version v1.19.5+k3s1, new node with same hostname cannot be joined after node with the hostname was deleted

kubectl get nodes 
NAME               STATUS   ROLES    AGE     VERSION
ip-172-31-16-236   Ready    <none>   9m30s   v1.19.5+k3s1
ip-172-31-29-156   Ready    master   14m     v1.19.5+k3s1
ubuntu@ip-172-31-29-156:~$ kubectl delete node ip-172-31-16-236 
node "ip-172-31-16-236" deleted

k3s-master
Dec 17 07:49:42 ip-172-31-29-156 k3s[2150]: time="2020-12-17T07:49:42.676275874Z" level=error msg="Node password validation failed for 'ip-172-31-16-236', using passwd file '/var/lib/rancher/k3s/server/cred/node-passwd'"

k3s-agent

Dec 17 07:03:50 ip-172-31-16-236 k3s[2782]: time="2020-12-17T07:03:50.833861337Z" level=error msg="Failed to retrieve agent config: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

Validated node with same hostname can be joined after being deleted.
On k3s version v1.20.0-rc4+k3s1

kubectl get nodes -o wide
NAME               STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
ip-172-31-16-217   Ready    control-plane,master   27m   v1.20.0-rc4+k3s1   172.31.16.217   <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1
ip-172-31-17-227   Ready    <none>                 66s   v1.20.0-rc4+k3s1   172.31.17.227   <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1

kubectl delete node ip-172-31-17-227

kubectl get nodes -o wide
NAME               STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
ip-172-31-16-217   Ready    control-plane,master   28m   v1.20.0-rc4+k3s1   172.31.16.217   <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1
ip-172-31-17-227   Ready    <none>                 38s   v1.20.0-rc4+k3s1   172.31.25.36    <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1

ShylajaDevadiga on 17 Dec 2020

Noticed on the node that was added, logs show below error msg every 10 seconds.

Dec 21 16:34:50 ip-172-31-12-56 k3s[2171]: E1221 16:34:50.936837    2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Dec 21 16:35:00 ip-172-31-12-56 k3s[2171]: E1221 16:35:00.964220    2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Dec 21 16:35:11 ip-172-31-12-56 k3s[2171]: E1221 16:35:11.037132    2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again

ShylajaDevadiga on 21 Dec 2020

Above error is seen as the node that was deleted in kubernetes was still running. Shutting the old node down resolved the msg seen in logs.

Steps followed.