K3s: io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470)

Created on 14 Apr 2020 · 20Comments · Source: k3s-io/k3s

This is my 5 times that this happened to me
Random Failures after restart using 'systemctl restart k3s'. Happens after serval days of use, then restart using 'systemctl restart k3s'

Version:
k3s version v1.17.2+k3s1 (cdab19b0)
Ubuntu 18.04.3 LTS

K3s arguments:
ExecStart=/usr/local/bin/k3s \
server \
'--cluster-init' \
'--tls-san' \
'10.7.55.77' \
'K3S_KUBECONFIG_MODE=664' \

What was install on the k3s:

Longhorn
Rbac-manager

What was configured:

Custom Registries with custom CA

What was the Log:

sudo journalctl -u k3s | tail -n 20
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 18964 (longhorn-manage) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19021 (csi-node-driver) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19072 (longhorn-manage) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19608 (csi-provisioner) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 7526 (pause) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 17808 (npm) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 17858 (node) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 k3s[7257]: time="2020-04-14T16:26:52.507407670+07:00" level=info msg="Starting k3s v1.17.2+k3s1 (cdab19b0)"
Apr 14 16:26:53 devops02 k3s[7257]: time="2020-04-14T16:26:53.197376154+07:00" level=fatal msg="starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470)"
Apr 14 16:26:53 devops02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Apr 14 16:26:53 devops02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Apr 14 16:26:53 devops02 systemd[1]: Failed to start Lightweight Kubernetes.

aredqlite

Source

mikeccuk2005

Most helpful comment

We would like to close out this issue. If you strongly object to this you can leave comments - or if you wish create a separate GH issue.

dqlite and raft versions have been bumped. We are having a hard time reproducing this issue now - in fact we have not been able to reproduce the issue.

Further, as @jamesorlakin mentioned we are indeed moving to embedded etcd in K3s. You will find in our upcoming v1.19.0 release of K3s that this feature has replaced dqlite. We are planning to release v1.19.0 K3s not later than September 1, 2020 (however this is subject to change).

davidnuzik on 11 Aug 2020

👍3

All 20 comments

I have the same problem with my cluster on Rasp4B, so It became one-time usable. Any service restart or node reboot simply destroy cluster.

It's probably dependencies issue:
https://github.com/rancher/kine/pull/29
https://github.com/canonical/dqlite/issues/190

SteelCrow on 14 Apr 2020

Similar issue here

leolb-aphp on 14 Apr 2020

I reinstalled everything,
I do not recommend using buildin dsqlite for production. Especally multiple master, that will f-ed everything up fast.
Instead I hook k3s with postgresql, works like a charm ( for now )

mikeccuk2005 on 15 Apr 2020

@mikeccuk2005 Don't close the issue, the bug isnt fixed yet.

leolb-aphp on 15 Apr 2020

I've manually recompiled k3s with go-dqlite v1.4.1 - will report back if it resolves the issue after a little while of uptime.

leolb-aphp on 15 Apr 2020

In the sqlite directory there were snapshot files. /var/lib/rancher/k3s/server/db/
Can I rollback?
I would recommend have k3s a feature for backup and restore, if this issue continues.

mikeccuk2005 on 15 Apr 2020

Receiving errors like this after a little while:

avril 15 19:22:30 k3s[40012]: time="2020-04-15T19:22:30.268644397+02:00" level=error msg="error while range on /registry/longhorn.io/settings/ /registry/longhorn.io/settings/: context canceled"
avril 15 19:22:30 k3s[40012]: panic: runtime error: index out of range [2127] with length 232
avril 15 19:22:30 k3s[40012]: goroutine 19865 [running]:
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc00f036fa8, 0xc00c406a20, 0x3)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc00f036f80, 0x42ff6a, 0x6dc47a0)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).close.func1()
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3076 +0x3c
avril 15 19:22:30 k3s[40012]: database/sql.withLock(0x4773ee0, 0xc00ef8c000, 0xc00b44e6f8)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3184 +0x6d
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).close(0xc00f037000, 0x4718940, 0xc0001321e0, 0x0, 0x0)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3075 +0x129
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).awaitDone(0xc00f037000, 0x47d9320, 0xc00c8b6c00, 0x0, 0x0)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:2733 +0x101
avril 15 19:22:30 k3s[40012]: created by database/sql.(*Rows).initContextClose
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:2717 +0xc1

And @mikeccuk2005 - I think backup of storage and Kubernetes configuration is to be done at another level, storage you can backup from your storage class, if local path, find the path and copy files away. For configuration, just save/export all the yamls.

Backing up and restoring using the db files seems unsafe.

leolb-aphp on 15 Apr 2020

And (code=killed, signal=SEGV) on others, so doesnt solve the problem.

leolb-aphp on 15 Apr 2020

See https://github.com/rancher/kine/pull/29#issuecomment-614210849

freeekanayaka on 15 Apr 2020

Is there a way to stop k3s gracefully to avoid this problem?
Currently using systemctl restart k3s is a bit aggressive, don't you think?

mikeccuk2005 on 16 Apr 2020

Receiving errors like this after a little while:

```
avril 15 19:22:30 bbs-edscpu-p001.bbs.aphp.fr k3s[40012]: time="2020-04-15T19:22:30.268644397+02:00" level=error msg="error while range on /registry/longhorn.io/settings/ /registry/longhorn.io/settings/: context canceled"
avril 15 19:22:30 bbs-edscpu-p001.bbs.aphp.fr k3s[40012]: panic: runtime error: index out of range [2127] with length 232

This specific error should be fixed when using latest go-dqlite, libdqlite and libraft.

freeekanayaka on 16 Apr 2020

Is there a way to stop k3s gracefully to avoid this problem?
Currently using systemctl restart k3s is a bit aggressive, don't you think?

The way to avoid this problem is to upgrade these dependencies. If that still doesn't work for you, please let me know.

freeekanayaka on 16 Apr 2020

So I've compiled kine, dqlite-build, k3s with RAFT_VER=v0.9.18 DQLITE_VER=v1.4.1 DQLITE_VER=v1.5.0 on arm64 and start testing it on my rasp4b cluster with 3 master nodes. I didn't see any previous error crush for about 4 hours when restarted services or shutdowned nodes. But dqlite leader election is not working, if I turn off first master (deployed with --cluster-init) all cluster is failed with error:

1 node

turned off

2 node

Apr 17 18:51:15 rpi-01 k3s[10100]: F0417 18:51:15.721930   10100 controllermanager.go:213] leaderelection lost
Apr 17 18:51:16 rpi-01 systemd[1]: k3s.service: Main process exited, code=exited, status=255/EXCEPTION

3 node

Apr 17 18:52:21 rpi-02 k3s[5600]: time="2020-04-17T18:52:21.325072836Z" level=error msg="error while range on /registry/apiextensions.k8s.io/customresourcedefinitions /registry/apiextensions.k8s.io/customresourcedefinitions: failed to create dqlite connection: no available dqlite leader server found"

SteelCrow on 17 Apr 2020

I'm not totally sure about this one, but sounds like either a deployment issue of some kind on your side or k3s/kine not forming the dqlite cluster correctly.

Probably something k3s/kine devs should look at and possibly get back to me if they find dqlite bugs or have questions about how to correctly form a cluster.

I'd suggest to open a separate issue for this tho, since it seems unrelated to the original one.

freeekanayaka on 20 Apr 2020

Thanks for the clarification on the unrealted issue @freeekanayaka

Regarding the original issue, we've bumped the necessary dependencies here:
https://github.com/rancher/dqlite-build/pull/13/files

@mikeccuk2005 or anyone else who hit this issue. Can you attempt reproduce with v1.18.2 (either an RC or the real release once its out). Since dqlite support is still experimental, this won't be consider a blocker for 1.18.2 and thus we won't holding up the release to test it.

cjellick on 5 May 2020

@SteelCrow I believe your issue is more tied to https://github.com/rancher/k3s/issues/1391

jamesorlakin on 11 May 2020

...

Can you attempt reproduce with v1.18.2 (either an RC or the real release once its out). Since dqlite support is still experimental, this won't be consider a blocker for 1.18.2 and thus we won't holding up the release to test it.

Came here because I had the error message: load closed segment 0000000000100224-0000000000100489: found 265 entries (expected 266)

Running on RPI-4 buster,
v1.18.2+k3s1 (698e444a)

Installed a fresh 3 node cluster. Was trying to debug why when node 0 was turned off, everything stopped (a la 1391), so was turning the service on and off.
Used systemctl stop k3s / start k3s several times (probably 6-7 times) on the first master without issues before the problem occurred. Total uptime < 3 hours

coldpenguin on 18 May 2020

Try to remove the file 0000000000100224-0000000000100489 from your k3s data directory (not sure where that is).

I don't know exactly the internals of k3s, so there might be an issue with it creates the dqlite which makes it not really HA.

freeekanayaka on 18 May 2020

If it's of interest, dqlite may be replaced by etcd in https://github.com/rancher/k3s/pull/1770

jamesorlakin on 18 May 2020

We would like to close out this issue. If you strongly object to this you can leave comments - or if you wish create a separate GH issue.

dqlite and raft versions have been bumped. We are having a hard time reproducing this issue now - in fact we have not been able to reproduce the issue.

davidnuzik on 11 Aug 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings