K3s: Unable to bring cluster up after rebooting using v1.19.1-k3s1 w/ embedded etcd

Created on 18 Sep 2020  Â·  9Comments  Â·  Source: k3s-io/k3s

Environmental Info:
K3s Version:
1.19.1-k3s1

Node(s) CPU architecture, OS, and Version:

Linux k8s-master-c 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Linux k8s-master-a 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Linux k8s-master-b 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 masters all same OS and arch, 0 workers

Describe the bug:
Duplicate of https://github.com/rancher/k3s/issues/2249 but this is on amd64

Steps To Reproduce:

  • Installed K3s w/ embedded etcd on 3 master nodes
  • Verify kubectl output shows all nodes k get nodes -o wide
  • Reboot
  • kubectl commands now show error:
    Error from server (ServiceUnavailable): the server is currently unable to handle the request

Expected behavior:
After a shutdown, all master nodes should reconnect. At least, one of then should start correctly

Actual behavior:
Connection error to nodes

Additional context / logs:
See my comments in https://github.com/rancher/k3s/issues/2249#issuecomment-694540765 and https://github.com/rancher/k3s/issues/2249#issuecomment-694941471

Most helpful comment

Yes, this is on my short list to fix although it will probably involve more than just swapping a couple lines of code. The workaround is easy enough in the mean time.

The whole team is excited to see the embedded etcd mature quickly. There are a lot of new features here to document and iron the kinks out of.

All 9 comments

Just to confirm, you're on Ubuntu whereas #2249 was on Raspbian?

Yes, Ubuntu 20.04.1. I think that guy is also using Ubuntu 20.04 but on arm64.

R nodes rebooted one by one?

@liyimeng no, I turned them all off at once. The expectation is that this would work after they are all powered back on again. I will try the work-around in https://github.com/rancher/k3s/issues/2249 after I am done tinkering with other projects.

@onedr0p they of cause won’t be able to bring up since since u have shutoff entire cluster and force it lose quorums. Please google “etcd lost quorum” for the theory behind this.

I know enough about lost quorum. Maybe the Rancher folks need to provide more documentation around that, as well as ways to backup and restore etcd state with their embedded option??

But in any case @liyimeng it seems as the work-around is to update the systemd files https://github.com/rancher/k3s/issues/2249#issuecomment-695181813 and then you can reboot, shutdown all you want.

@onedr0p Thanks a lot! Providing it is an experimental feature. Maybe we should not push them too hard :D. they have been busy in getting 1.19.1 out to release. But you scream dose make this happening faster. LOL. Thanks for the info! I love to git it a try!

Yes, this is on my short list to fix although it will probably involve more than just swapping a couple lines of code. The workaround is easy enough in the mean time.

The whole team is excited to see the embedded etcd mature quickly. There are a lot of new features here to document and iron the kinks out of.

Since this is confirmed as a software issue, not architecture or operating system bug I will close this issue. Main issue is https://github.com/rancher/k3s/issues/2249

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pierreozoux picture pierreozoux  Â·  4Comments

wpwoodjr picture wpwoodjr  Â·  3Comments

seanmalloy picture seanmalloy  Â·  3Comments

gilkotton picture gilkotton  Â·  3Comments

giezi picture giezi  Â·  3Comments