Version:
$ k3s --version
k3s version v1.17.2+k3s1 (cdab19b0)
Describe the bug
I freshly installed a 3 server, 3 agent cluster.
This was around 2020-02-19 ~ 6 am.
About 25 hours later the service on one server has crashed (2020-02-20T08:06:07). while all other servers continued.
NOTE
I use the experimental dqlite feature
My Cluster:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
kcluster-master-01.example.com Ready master 2d2h v1.17.2+k3s1 192.168.100.62 <none> k3OS v0.9.0 5.0.0-37-generic containerd://1.3.3-k3s1
kcluster-master-02.example.com Ready master 2d2h v1.17.2+k3s1 192.168.100.67 <none> k3OS v0.9.0 5.0.0-37-generic containerd://1.3.3-k3s1
kcluster-master-03.example.com Ready master 2d2h v1.17.2+k3s1 192.168.100.66 <none> k3OS v0.9.0 5.0.0-37-generic containerd://1.3.3-k3s1
kcluster-agent-01.example.com Ready <none> 2d2h v1.17.2+k3s1 192.168.100.68 <none> k3OS v0.9.0 5.0.0-37-generic containerd://1.3.3-k3s1
kcluster-agent-02.example.com Ready <none> 2d2h v1.17.2+k3s1 192.168.100.65 <none> k3OS v0.9.0 5.0.0-37-generic containerd://1.3.3-k3s1
kcluster-agent-03.example.com Ready <none> 2d2h v1.17.2+k3s1 192.168.100.64 <none> k3OS v0.9.0 5.0.0-37-generic containerd://1.3.3-k3s1
The Last 30 lines of log by kcluster-master-03.example.com.
$ tail -n 30 /var/log/k3s-service.log
I0220 08:05:52.129507 3346 httplog.go:90] GET /api/v1/namespaces/kube-system/configmaps/k3s: (4.854292ms) 200 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b 127.0.0.1:36068]
I0220 08:05:53.370205 3346 eviction_manager.go:229] eviction manager: synchronize housekeeping
time="2020-02-20T08:06:07.732728435Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:09.902470532Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:09.896929157Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:09.904094730Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:09.904552889Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
time="2020-02-20T08:06:09.904593722Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
time="2020-02-20T08:06:09.905370241Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:09.907188764Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:09.907223308Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:10.064041279Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:10.067489076Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
I0220 08:06:10.084167 3346 container.go:532] [/kubepods/burstable] Housekeeping took 6.09953788s
I0220 08:06:10.136688 3346 httplog.go:90] GET /apis/policy/v1beta1/poddisruptionbudgets?allowWatchBookmarks=true&resourceVersion=485225&timeout=9m0s&timeoutSeconds=540&watch=true: (9m10.099369084s) 0 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b/scheduler 127.0.0.1:36184]
time="2020-02-20T08:06:10.195428733Z" level=info msg="error in remotedialer server [400]: read tcp 192.168.100.66:6443->192.168.100.66:59694: i/o timeout"
time="2020-02-20T08:06:10.200127554Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:10.202119262Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:10.202231975Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
I0220 08:06:10.302492 3346 container.go:532] [/systemd] Housekeeping took 2.276718901s
I0220 08:06:10.310717 3346 httplog.go:90] GET /apis/k3s.cattle.io/v1/addons?allowWatchBookmarks=true&resourceVersion=485272&timeout=6m3s&timeoutSeconds=363&watch=true: (6m10.07195573s) 0 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b 127.0.0.1:36068]
I0220 08:06:10.313706 3346 container.go:532] [/kubepods/besteffort] Housekeeping took 1.793505722s
I0220 08:06:10.331482 3346 container.go:532] [/kubepods] Housekeeping took 1.585620489s
containerd: signal: killed
To Reproduce
I was not yet be able to.
Actual behavior
Services crashed.
Now all masters went down.
After restarting the k3s services It tries to recover but failed with the healthcheck and:
7606]
panic: runtime error: index out of range [4319] with length 0
goroutine 372 [running]:
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
/go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc01334c828, 0x10, 0x3899bc0)
/go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc01334c800, 0x42ff6a, 0x7f6d269b9008)
/go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
database/sql.(*Rows).close.func1()
/usr/local/go/src/database/sql/sql.go:3076 +0x3c
database/sql.withLock(0x4756540, 0xc013d0cd00, 0xc015433b20)
/usr/local/go/src/database/sql/sql.go:3184 +0x6d
database/sql.(*Rows).close(0xc01334c880, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/database/sql/sql.go:3075 +0x129
database/sql.(*Rows).Close(...)
/usr/local/go/src/database/sql/sql.go:3059
database/sql.(*Rows).Next(0xc01334c880, 0xc004eb9040)
/usr/local/go/src/database/sql/sql.go:2748 +0xb8
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.RowsToEvents(0xc01334c880, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:233 +0xfa
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).poll(0xc0007f1000, 0xc000d19380, 0x11cd3f)
/go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:325 +0x228
created by github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).startWatch
/go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:289 +0xfa
I experience the same behavior using k3os on my cluster using embedded multi master.
I'm also seeing multi-master failure after a couple of hours. Perhaps a locking issue with dqlite?
Here are the full logs
Here is the automation that produced it, in case you want to try and reproduce
Most helpful comment
Now all masters went down.
After restarting the k3s services It tries to recover but failed with the healthcheck and: