Version: k3s version v1.17.2+k3s1 (cdab19b0)
Description:
k3s master fails to start with in the log "starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000024946269-0000000024946590: found 321 entries (expected 322)"
This has happened after the machines were forcefully shut down (power loss). There's no info on the web on how to resolve this error or what to do next.
To Reproduce:
Expected behavior:
Actual behavior:
Additional context
uname -a
Linux ariana 5.4.7-sunxi64 #19.11.6 SMP Sat Jan 4 19:40:10 CET 2020 aarch64 GNU/Linux
lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster
cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
[Service]
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --cluster-init --write-kubeconfig-mode 664
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
/var/log/syslog:
...
Feb 9 00:00:12 ariana systemd[1]: Starting Lightweight Kubernetes...
Feb 9 00:00:12 ariana systemd[1]: Started Lightweight Kubernetes.
Feb 9 00:00:13 ariana k3s[3961]: time="2020-02-09T00:00:13.429349422Z" level=info msg="Starting k3s v1.17.2+k3s1 (cdab19b0)"
Feb 9 00:00:16 ariana k3s[3961]: time="2020-02-09T00:00:16.592512841Z" level=fatal msg="starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000024946269-0000000024946590: found 321 entries (expected 322)"
Feb 9 00:00:16 ariana systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Feb 9 00:00:16 ariana systemd[1]: k3s.service: Failed with result 'exit-code'.
Feb 9 00:00:21 ariana systemd[1]: k3s.service: Service RestartSec=5s expired, scheduling restart.
Feb 9 00:00:21 ariana systemd[1]: k3s.service: Scheduled restart job, restart counter is at 5380.
Feb 9 00:00:21 ariana systemd[1]: Stopped Lightweight Kubernetes.
...
seeing the same issues, I was purposefully deleting master nodes at various intervals and discovered this on reboot after a couple of times.
This appears to be the upstream dqlite issue: https://github.com/canonical/dqlite/issues/190
dqlite is still experimental; there does not appear to be a way to recover from this at the moment. If you need more production-ready HA you should probably be using an external DB.
Also, a two-node dqlite cluster won't meet Raft consensus requirements (no quorum if one goes down) so this setup probably won't ever work as expected.
Most helpful comment
This appears to be the upstream dqlite issue: https://github.com/canonical/dqlite/issues/190
dqlite is still experimental; there does not appear to be a way to recover from this at the moment. If you need more production-ready HA you should probably be using an external DB.
Also, a two-node dqlite cluster won't meet Raft consensus requirements (no quorum if one goes down) so this setup probably won't ever work as expected.