Longhorn: [BUG] Service account secret mount spam on longhorn-manager restart

Created on 13 Oct 2020 · 4Comments · Source: longhorn/longhorn

Describe the bug
On some nodes where longhorn-managers have been restarting a lot for some reason, there were growing numbers of /run/secrets/kubernetes.io/serviceaccount tmpfs mounts. At some point reaching 4k+ duplicate mounts leading to df taking 20s to execute and grinding docker to timeout on kubelet communication.

It appears that service account mounts grow exponentially with each restart.

This doesn't happen when restarting other pods.

To Reproduce
Steps to reproduce the behavior:

Install Longhorn on the cluster
On any node with longhorn-manager pod on it, observe current mounts (e.g. /etc/mtab or execute df), note how many /run/secrets/kubernetes.io/serviceaccount mounts are there
Kill the longhorn-manager pod on this node and let it successfully start again
Observe the mounts on this node again and see duplicate mount added

Expected behavior
No duplicate mounts present on the node.

Environment:

Longhorn version: 1.0.2
Kubernetes version: 1.17.9-eks
Node OS type and version: EKS-optimised AL2

Additional context
Might be related to #1757.

Probably an issue on itself, but the reason longhorn-manager restarts (without us killing it) is this panic:

E1010 09:17:42.204022       1 runtime.go:69] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(0x15c9880), concrete:(*runtime._type)(0x16855a0), asserted:(*runtime._type)(0x17d40c0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1beta1.Replica)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:967
/usr/local/go/src/runtime/iface.go:260
/go/src/github.com/longhorn/longhorn-manager/controller/replica_controller.go:106
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/controller.go:209
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:556
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
/usr/local/go/src/runtime/asm_amd64.s:1373
panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1beta1.Replica [recovered]
    panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1beta1.Replica
goroutine 245 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1615760, 0xc009fd4ba0)
    /usr/local/go/src/runtime/panic.go:967 +0x166
github.com/longhorn/longhorn-manager/controller.NewReplicaController.func3(0x16855a0, 0xc00d6e1020)
    /go/src/github.com/longhorn/longhorn-manager/controller/replica_controller.go:106 +0x69
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/controller.go:209
k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc0000a4018, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:556 +0x17e
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc0095c5e18, 0x1, 0x0)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265 +0x51
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x79
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000a91f60)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0095c5f60, 0xdf8475800, 0x0, 0x1, 0xc000614660)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002a1a00)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x9b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0008341b0, 0xc000275a70)
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
    /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

arekubernetes aremanager bug priorit2

Source

excieve

All 4 comments

Thanks for reporting this, we will look into this.

joshimoo on 13 Oct 2020

👍1

One more data point — when checking mounts inside the longhorn-manager container after the restart, it looks like this:

root@longhorn-manager-dqmhj:/# cat /etc/mtab |grep service
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs rw,relatime 0 0
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs ro,relatime 0 0

The duplicate one is mounted as rw it seems.

This issue hits us quite frequently btw. At least every week there's at least one node that suffers from this. 19 pod restarts is enough to make a node unresponsive with 4k+ mounts.

excieve on 15 Oct 2020

👍1

This is an interesting issue.
Left a comment to subscribe.

PhanLe1010 on 15 Oct 2020

@excieve thanks for reporting, I fixed the longhorn manager crash in https://github.com/longhorn/longhorn-manager/pull/725
which will be available when we release longhorn v1.1 please let us know if you are still having issues then :)

joshimoo on 22 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings