K3s: [k3s] running on [k3OS] services crashed

Created on 21 Feb 2020  路  3Comments  路  Source: k3s-io/k3s

Version:

$ k3s --version
k3s version v1.17.2+k3s1 (cdab19b0)

Describe the bug
I freshly installed a 3 server, 3 agent cluster.
This was around 2020-02-19 ~ 6 am.
About 25 hours later the service on one server has crashed (2020-02-20T08:06:07). while all other servers continued.

NOTE
I use the experimental dqlite feature

My Cluster:

$ kubectl get nodes -o wide
NAME                             STATUS   ROLES    AGE    VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE      KERNEL-VERSION     CONTAINER-RUNTIME
kcluster-master-01.example.com   Ready    master   2d2h   v1.17.2+k3s1   192.168.100.62   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-master-02.example.com   Ready    master   2d2h   v1.17.2+k3s1   192.168.100.67   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-master-03.example.com   Ready    master   2d2h   v1.17.2+k3s1   192.168.100.66   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-agent-01.example.com    Ready    <none>   2d2h   v1.17.2+k3s1   192.168.100.68   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-agent-02.example.com    Ready    <none>   2d2h   v1.17.2+k3s1   192.168.100.65   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1
kcluster-agent-03.example.com    Ready    <none>   2d2h   v1.17.2+k3s1   192.168.100.64   <none>        k3OS v0.9.0   5.0.0-37-generic   containerd://1.3.3-k3s1

The Last 30 lines of log by kcluster-master-03.example.com.

$ tail -n 30 /var/log/k3s-service.log
I0220 08:05:52.129507    3346 httplog.go:90] GET /api/v1/namespaces/kube-system/configmaps/k3s: (4.854292ms) 200 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b 127.0.0.1:36068]
I0220 08:05:53.370205    3346 eviction_manager.go:229] eviction manager: synchronize housekeeping
time="2020-02-20T08:06:07.732728435Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:09.902470532Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:09.896929157Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:09.904094730Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:09.904552889Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
time="2020-02-20T08:06:09.904593722Z" level=error msg="Error writing ping" error="write tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
time="2020-02-20T08:06:09.905370241Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:09.907188764Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:09.907223308Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:10.064041279Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-20T08:06:10.067489076Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
I0220 08:06:10.084167    3346 container.go:532] [/kubepods/burstable] Housekeeping took 6.09953788s
I0220 08:06:10.136688    3346 httplog.go:90] GET /apis/policy/v1beta1/poddisruptionbudgets?allowWatchBookmarks=true&resourceVersion=485225&timeout=9m0s&timeoutSeconds=540&watch=true: (9m10.099369084s) 0 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b/scheduler 127.0.0.1:36184]
time="2020-02-20T08:06:10.195428733Z" level=info msg="error in remotedialer server [400]: read tcp 192.168.100.66:6443->192.168.100.66:59694: i/o timeout"
time="2020-02-20T08:06:10.200127554Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:59694->192.168.100.66:6443: i/o timeout"
time="2020-02-20T08:06:10.202119262Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:52914->192.168.100.62:6443: i/o timeout"
time="2020-02-20T08:06:10.202231975Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.100.66:55694->192.168.100.67:6443: i/o timeout"
I0220 08:06:10.302492    3346 container.go:532] [/systemd] Housekeeping took 2.276718901s
I0220 08:06:10.310717    3346 httplog.go:90] GET /apis/k3s.cattle.io/v1/addons?allowWatchBookmarks=true&resourceVersion=485272&timeout=6m3s&timeoutSeconds=363&watch=true: (6m10.07195573s) 0 [k3s/v1.17.2+k3s1 (linux/amd64) kubernetes/cdab19b 127.0.0.1:36068]
I0220 08:06:10.313706    3346 container.go:532] [/kubepods/besteffort] Housekeeping took 1.793505722s
I0220 08:06:10.331482    3346 container.go:532] [/kubepods] Housekeeping took 1.585620489s
containerd: signal: killed

To Reproduce
I was not yet be able to.

Actual behavior
Services crashed.

Most helpful comment

Now all masters went down.
After restarting the k3s services It tries to recover but failed with the healthcheck and:

7606]
panic: runtime error: index out of range [4319] with length 0

goroutine 372 [running]:
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc01334c828, 0x10, 0x3899bc0)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc01334c800, 0x42ff6a, 0x7f6d269b9008)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
database/sql.(*Rows).close.func1()
    /usr/local/go/src/database/sql/sql.go:3076 +0x3c
database/sql.withLock(0x4756540, 0xc013d0cd00, 0xc015433b20)
    /usr/local/go/src/database/sql/sql.go:3184 +0x6d
database/sql.(*Rows).close(0xc01334c880, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/database/sql/sql.go:3075 +0x129
database/sql.(*Rows).Close(...)
    /usr/local/go/src/database/sql/sql.go:3059
database/sql.(*Rows).Next(0xc01334c880, 0xc004eb9040)
    /usr/local/go/src/database/sql/sql.go:2748 +0xb8
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.RowsToEvents(0xc01334c880, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:233 +0xfa
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).poll(0xc0007f1000, 0xc000d19380, 0x11cd3f)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:325 +0x228
created by github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).startWatch
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:289 +0xfa

All 3 comments

Now all masters went down.
After restarting the k3s services It tries to recover but failed with the healthcheck and:

7606]
panic: runtime error: index out of range [4319] with length 0

goroutine 372 [running]:
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc01334c828, 0x10, 0x3899bc0)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc01334c800, 0x42ff6a, 0x7f6d269b9008)
    /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
database/sql.(*Rows).close.func1()
    /usr/local/go/src/database/sql/sql.go:3076 +0x3c
database/sql.withLock(0x4756540, 0xc013d0cd00, 0xc015433b20)
    /usr/local/go/src/database/sql/sql.go:3184 +0x6d
database/sql.(*Rows).close(0xc01334c880, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/database/sql/sql.go:3075 +0x129
database/sql.(*Rows).Close(...)
    /usr/local/go/src/database/sql/sql.go:3059
database/sql.(*Rows).Next(0xc01334c880, 0xc004eb9040)
    /usr/local/go/src/database/sql/sql.go:2748 +0xb8
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.RowsToEvents(0xc01334c880, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:233 +0xfa
github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).poll(0xc0007f1000, 0xc000d19380, 0x11cd3f)
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:325 +0x228
created by github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog.(*SQLLog).startWatch
    /go/src/github.com/rancher/k3s/vendor/github.com/rancher/kine/pkg/logstructured/sqllog/sql.go:289 +0xfa

I experience the same behavior using k3os on my cluster using embedded multi master.

I'm also seeing multi-master failure after a couple of hours. Perhaps a locking issue with dqlite?
Here are the full logs

Here is the automation that produced it, in case you want to try and reproduce

Was this page helpful?
0 / 5 - 0 ratings