Describe the bug
On some nodes where longhorn-managers have been restarting a lot for some reason, there were growing numbers of /run/secrets/kubernetes.io/serviceaccount tmpfs mounts. At some point reaching 4k+ duplicate mounts leading to df taking 20s to execute and grinding docker to timeout on kubelet communication.
It appears that service account mounts grow exponentially with each restart.
This doesn't happen when restarting other pods.
To Reproduce
Steps to reproduce the behavior:
longhorn-manager pod on it, observe current mounts (e.g. /etc/mtab or execute df), note how many /run/secrets/kubernetes.io/serviceaccount mounts are therelonghorn-manager pod on this node and let it successfully start againExpected behavior
No duplicate mounts present on the node.
Environment:
Additional context
Might be related to #1757.
Probably an issue on itself, but the reason longhorn-manager restarts (without us killing it) is this panic:
E1010 09:17:42.204022 1 runtime.go:69] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(0x15c9880), concrete:(*runtime._type)(0x16855a0), asserted:(*runtime._type)(0x17d40c0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1beta1.Replica)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:967
/usr/local/go/src/runtime/iface.go:260
/go/src/github.com/longhorn/longhorn-manager/controller/replica_controller.go:106
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/controller.go:209
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:556
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
/usr/local/go/src/runtime/asm_amd64.s:1373
panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1beta1.Replica [recovered]
panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1beta1.Replica
goroutine 245 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1615760, 0xc009fd4ba0)
/usr/local/go/src/runtime/panic.go:967 +0x166
github.com/longhorn/longhorn-manager/controller.NewReplicaController.func3(0x16855a0, 0xc00d6e1020)
/go/src/github.com/longhorn/longhorn-manager/controller/replica_controller.go:106 +0x69
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/controller.go:209
k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc0000a4018, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:556 +0x17e
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc0095c5e18, 0x1, 0x0)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265 +0x51
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x79
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000a91f60)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0095c5f60, 0xdf8475800, 0x0, 0x1, 0xc000614660)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002a1a00)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x9b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0008341b0, 0xc000275a70)
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62
Thanks for reporting this, we will look into this.
One more data point — when checking mounts inside the longhorn-manager container after the restart, it looks like this:
root@longhorn-manager-dqmhj:/# cat /etc/mtab |grep service
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs rw,relatime 0 0
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs ro,relatime 0 0
The duplicate one is mounted as rw it seems.
This issue hits us quite frequently btw. At least every week there's at least one node that suffers from this. 19 pod restarts is enough to make a node unresponsive with 4k+ mounts.
This is an interesting issue.
Left a comment to subscribe.
@excieve thanks for reporting, I fixed the longhorn manager crash in https://github.com/longhorn/longhorn-manager/pull/725
which will be available when we release longhorn v1.1 please let us know if you are still having issues then :)