K3s: [k3s] running on [k3OS] services crashed

Created on 21 Feb 2020 · 3Comments · Source: k3s-io/k3s

Version:

$ k3s --version
k3s version v1.17.2+k3s1 (cdab19b0)

Describe the bug
I freshly installed a 3 server, 3 agent cluster.
This was around 2020-02-19 ~ 6 am.
About 25 hours later the service on one server has crashed (2020-02-20T08:06:07). while all other servers continued.

NOTE
I use the experimental dqlite feature

My Cluster:

$ kubectl get nodes -o wide
NAME                             STATUS   ROLES    AGE    VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE      KERNEL-VERSION     CONTAINER-RUNTIME
kcluster-master-01.example.com   Ready    master   2d2h   v1.17.2+k3s1   192.168.100.62   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-master-02.example.com   Ready    master   2d2h   v1.17.2+k3s1   192.168.100.67   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-master-03.example.com   Ready    master   2d2h   v1.17.2+k3s1   192.168.100.66   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-agent-01.example.com    Ready    <none>   2d2h   v1.17.2+k3s1   192.168.100.68   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-agent-02.example.com    Ready    <none>   2d2h   v1.17.2+k3s1   192.168.100.65   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-agent-03.example.com    Ready    <none>   2d2h   v1.17.2+k3s1   192.168.100.64   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1

The Last 30 lines of log by kcluster-master-03.example.com.

$ tail -n 30 /var/log/k3s-service.log
I0220 08:05:52.129507    3346 httplog.go:90] GET /api/v1/namespaces/kube-system/configmaps/k3s: (4.854292ms) 200 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b 127.0.0.1:36068]
I0220 08:05:53.370205    3346 eviction_manager.go:229] eviction manager: synchronize housekeeping
time="2020-02-20T08:06:07.732728435Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:09.902470532Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:09.896929157Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:09.904094730Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:09.904552889Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
time="2020-02-20T08:06:09.904593722Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
time="2020-02-20T08:06:09.905370241Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:09.907188764Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:09.907223308Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:10.064041279Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:10.067489076Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
I0220 08:06:10.084167    3346 container.go:532] [/kubepods/burstable] Housekeeping took 6.09953788s
I0220 08:06:10.136688    3346 httplog.go:90] GET /apis/policy/v1beta1/poddisruptionbudgets?allowWatchBookmarks=true&resourceVersion=485225&timeout=9m0s&timeoutSeconds=540&watch=true: (9m10.099369084s) 0 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b/scheduler 127.0.0.1:36184]
time="2020-02-20T08:06:10.195428733Z" level=info msg="error in remotedialer server [400]: read tcp 192.168.100.66:6443->192.168.100.66:59694: i/o timeout"
time="2020-02-20T08:06:10.200127554Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:10.202119262Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:10.202231975Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
I0220 08:06:10.302492    3346 container.go:532] [/systemd] Housekeeping took 2.276718901s
I0220 08:06:10.310717    3346 httplog.go:90] GET /apis/k3s.cattle.io/v1/addons?allowWatchBookmarks=true&resourceVersion=485272&timeout=6m3s&timeoutSeconds=363&watch=true: (6m10.07195573s) 0 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b 127.0.0.1:36068]
I0220 08:06:10.313706    3346 container.go:532] [/kubepods/besteffort] Housekeeping took 1.793505722s
I0220 08:06:10.331482    3346 container.go:532] [/kubepods] Housekeeping took 1.585620489s
containerd: signal: killed

To Reproduce
I was not yet be able to.

Actual behavior
Services crashed.

Source

Moep90

👍2

Most helpful comment

Now all masters went down.
After restarting the k3s services It tries to recover but failed with the healthcheck and:

7606]
panic: runtime error: index out of range [4319] with length 0

goroutine 372 [running]:
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc01334c828, 0x10, 0x3899bc0)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc01334c800, 0x42ff6a, 0x7f6d269b9008)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
database/sql.(*Rows).close.func1()
    /usr/local/go/src/database/sql/sql.go:3076 +0x3c
database/sql.withLock(0x4756540, 0xc013d0cd00, 0xc015433b20)
    /usr/local/go/src/database/sql/sql.go:3184 +0x6d
database/sql.(*Rows).close(0xc01334c880, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/database/sql/sql.go:3075 +0x129
database/sql.(*Rows).Close(...)
    /usr/local/go/src/database/sql/sql.go:3059
database/sql.(*Rows).Next(0xc01334c880, 0xc004eb9040)
    /usr/local/go/src/database/sql/sql.go:2748 +0xb8
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.RowsToEvents(0xc01334c880, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:233 +0xfa
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).poll(0xc0007f1000, 0xc000d19380, 0x11cd3f)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:325 +0x228
created by github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).startWatch
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:289 +0xfa

Moep90 on 21 Feb 2020

👍5

All 3 comments

Now all masters went down.
After restarting the k3s services It tries to recover but failed with the healthcheck and:

7606]
panic: runtime error: index out of range [4319] with length 0

goroutine 372 [running]:
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc01334c828, 0x10, 0x3899bc0)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc01334c800, 0x42ff6a, 0x7f6d269b9008)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
database/sql.(*Rows).close.func1()
    /usr/local/go/src/database/sql/sql.go:3076 +0x3c
database/sql.withLock(0x4756540, 0xc013d0cd00, 0xc015433b20)
    /usr/local/go/src/database/sql/sql.go:3184 +0x6d
database/sql.(*Rows).close(0xc01334c880, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/database/sql/sql.go:3075 +0x129
database/sql.(*Rows).Close(...)
    /usr/local/go/src/database/sql/sql.go:3059
database/sql.(*Rows).Next(0xc01334c880, 0xc004eb9040)
    /usr/local/go/src/database/sql/sql.go:2748 +0xb8
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.RowsToEvents(0xc01334c880, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:233 +0xfa
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).poll(0xc0007f1000, 0xc000d19380, 0x11cd3f)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:325 +0x228
created by github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).startWatch
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:289 +0xfa

Moep90 on 21 Feb 2020

👍5

I experience the same behavior using k3os on my cluster using embedded multi master.