Longhorn: [Question] k3s install longhorn error

Created on 15 Jun 2020 · 7Comments · Source: longhorn/longhorn

* bug*
Error installing Longhorn on k3s.

instance-manager-r-434985db                1/1     Running                 0          31m
instance-manager-e-867139e6                1/1     Running                 0          31m
instance-manager-r-2bafdd7b                1/1     Running                 0          31m
instance-manager-e-66826b2f                1/1     Running                 0          31m
instance-manager-r-a2f1d4c5                1/1     Running                 0          31m
instance-manager-e-ab3218de                1/1     Running                 0          31m
longhorn-manager-h6d48                     0/1     CrashLoopBackOff        6          32m
engine-image-ei-eee5f438-k6p5p             0/1     CrashLoopBackOff        10         32m
longhorn-ui-8486987944-msczm               0/1     CrashLoopBackOff        10         32m
engine-image-ei-eee5f438-zj49j             0/1     CrashLoopBackOff        10         32m
longhorn-manager-6n9zn                     0/1     CrashLoopBackOff        7          32m
longhorn-manager-7l269                     0/1     CrashLoopBackOff        7          32m
longhorn-driver-deployer-cd74cb75b-hhgkt   0/1     Init:CrashLoopBackOff   11         32m
engine-image-ei-eee5f438-x65mf             0/1     CreateContainerError    0          32m

To Reproduce
I just run the command:

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml

Log
the logs of longhorn-manager-h6d48 :

time="2020-06-15T14:32:33Z" level=info msg="Start overwriting built-in settings with customized values"
time="2020-06-15T14:32:33Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access /var/lib/rancher/longhorn/engine-binaries/*: No such file or directory\n, error exit status 2"
time="2020-06-15T14:32:33Z" level=info msg="New upgrade leader elected: k3s-node01"
time="2020-06-15T14:32:38Z" level=info msg="New upgrade leader elected: k3s-node02"
time="2020-06-15T14:32:58Z" level=info msg="Start upgrading"
time="2020-06-15T14:32:58Z" level=info msg="No API version upgrade is needed"
time="2020-06-15T14:32:58Z" level=info msg="Finish upgrading"
E0615 14:32:58.582171       1 leaderelection.go:282] Failed to release lock: Lease.coordination.k8s.io "longhorn-manager-upgrade-lock" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0
time="2020-06-15T14:32:58Z" level=info msg="Upgrade leader lost: k3s-master"
time="2020-06-15T14:32:58Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.0 to be ready"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn Kubernetes node controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn replica controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn engine controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn volume controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn Engine Image controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn node controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn websocket controller"
time="2020-06-15T14:32:58Z" level=info msg="Start Longhorn Setting controller"
time="2020-06-15T14:32:58Z" level=info msg="Starting Longhorn instance manager controller"
time="2020-06-15T14:32:58Z" level=info msg="Start kubernetes controller"
time="2020-06-15T14:32:58Z" level=debug msg="Start monitoring instance manager instance-manager-r-a2f1d4c5"
time="2020-06-15T14:32:58Z" level=debug msg="Start monitoring instance manager instance-manager-e-ab3218de"
time="2020-06-15T14:33:04Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.0 to be ready"
time="2020-06-15T14:33:08Z" level=debug msg="Failed to check for the latest upgrade: Post \"https://longhorn-upgrade-responder.rancher.io/v1/checkupgrade\": dial tcp: lookup longhorn-upgrade-responder.rancher.io on 10.43.0.10:53: read udp 10.42.0.227:41022->10.43.0.10:53: read: connection refused"
time="2020-06-15T14:33:10Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.0 to be ready"
time="2020-06-15T14:34:58Z" level=fatal msg="Error starting manager: failed to wait for engine image longhornio/longhorn-engine:v1.0.0: Wait for engine image longhornio/longhorn-engine:v1.0.0 timed out"

the logs of engine-image-ei-eee5f438-k6p5p :

/bin/bash: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such file or directory

the logs of longhorn-ui-8486987944-msczm :

/bin/bash: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such file or directory

Environment:

Longhorn version:
V1.0.0
- detail *

NAME         STATUS   ROLES    AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
k3s-master   Ready    master   11h   v1.18.3+k3s1   192.168.109.10   <none>        CentOS Linux 7 (Core)   3.10.0-1127.10.1.el7.x86_64   containerd://1.3.3-k3s2
k3s-node01   Ready    <none>   11h   v1.18.3+k3s1   192.168.109.11   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64        containerd://1.3.3-k3s2
k3s-node02   Ready    <none>   11h   v1.18.3+k3s1   192.168.109.12   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64        containerd://1.3.3-k3s2

aredeployment aremanager bug priorit2 requirdoc requirmanual-test-plan

Source

lyred193

Most helpful comment

Workaround: Install a policy as follows on all LH nodes:

ausearch -c 'csi-resizer' --raw | audit2allow -M my-csiresizer
semodule -i my-csiresizer.pp
ausearch -c 'csi-provisioner' --raw | audit2allow -M my-csiprovisioner
semodule -i my-csiprovisioner.pp
ausearch -c 'csi-attacher' --raw | audit2allow -M my-csiattacher
semodule -i my-csiattacher.pp

janeczku on 17 Jun 2020

👍2

All 7 comments

Are you running Selinux?

joshimoo on 15 Jun 2020

Are you running Selinux?

when I disable the SELinux, longhorn is worked. thank you.

lyred193 on 16 Jun 2020

👎1 👍1

Reproduced using master (06/16) on RHEL with SELinux enabled.

While the installation succeeded, no volumes were successfully created.

```
time="2020-06-16T18:37:41Z" level=warning msg="Error syncing Longhorn engine longhorn-system/wordpress-maria-db-e-f09ec2d5: fail to sync engine for longhorn-system/wordpress-maria-db-e-f09ec2d5: fail to start rebuild for wordpress-maria-db-r-f67ec3cb of wordpress-maria-db-e-f09ec2d5: timed out waiting for the condition"

time="2020-06-16T18:37:41Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"wordpress-maria-db-e-f09ec2d5\", UID:\"892556bc-b801-447f-b19f-4ff9cc981620\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"2363174\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Start rebuilding replica wordpress-maria-db-r-f67ec3cb with Address 10.42.1.14:10000 for wordpress-maria-db"

time="2020-06-16T18:38:01Z" level=error msg="Failed rebuilding 10.42.1.14:10000 of wordpress-maria-db: failed to add replica address='tcp://10.42.1.14:10000' to controller 'wordpress-maria-db': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn [--url 10.42.0.71:10000 add tcp://10.42.1.14:10000], output , stderr, time=\"2020-06-16T18:38:01Z\" level=fatal msg=\"Error running add replica command: failed to get replica 10.42.1.14:10000: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\"transport: Error while dialing dial tcp 10.42.1.14:10000: i/o timeout\\"\"\n, error exit status 1"

time="2020-06-16T18:38:01Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"wordpress-maria-db-e-f09ec2d5\", UID:\"892556bc-b801-447f-b19f-4ff9cc981620\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"2363174\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.1.14:10000: failed to add replica address='tcp://10.42.1.14:10000' to controller 'wordpress-maria-db': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn [--url 10.42.0.71:10000 add tcp://10.42.1.14:10000], output , stderr, time=\"2020-06-16T18:38:01Z\" level=fatal msg=\"Error running add replica command: failed to get replica 10.42.1.14:10000: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\"transport: Error while dialing dial tcp 10.42.1.14:10000: i/o timeout\\"\"\n, error exit status 1"

time="2020-06-16T18:38:01Z" level=error msg="Removed failed rebuilding replica 10.42.1.14:10000 of wordpress-maria-db"

time="2020-06-16T18:38:01Z" level=info msg="Engine wordpress-maria-db-e-f09ec2d5 is still in backoff for replica wordpress-maria-db-r-f67ec3cb rebuild failure"
´´´

janeczku on 17 Jun 2020

Workaround: Install a policy as follows on all LH nodes:

ausearch -c 'csi-resizer' --raw | audit2allow -M my-csiresizer
semodule -i my-csiresizer.pp
ausearch -c 'csi-provisioner' --raw | audit2allow -M my-csiprovisioner
semodule -i my-csiprovisioner.pp
ausearch -c 'csi-attacher' --raw | audit2allow -M my-csiattacher
semodule -i my-csiattacher.pp

janeczku on 17 Jun 2020

👍2

With https://github.com/longhorn/longhorn/issues/1273 fixed, we should be able to support SELinux in v1.0.1.

yasker on 22 Jul 2020

In fact @janeczku has tested #1273 and it doesn't work for him on RHEL 7.8 with SELinux. @khushboo-rancher can you try to reproduce the issue?

This issue also related to https://github.com/rancher/rancher/issues/26789

yasker on 22 Jul 2020

Longhorn v1.0.1 successfully gets deployed on a k3s cluster on RHEL 7.8 with SELinux.

Tested the below P1 workflow as well, they worked fine:

Creating Volume and attaching to a pod
Creating a pod with Volume claim template using longhorn class which creates volume in longhorn.
Writing into volume.
Taking snapshots.
Reverting to a snapshot.
Deploying WordPress app with persistent volume enabled with longhorn.

The node details.

 cat /etc/os-release 
NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"

[root@ip-xx-xx-xx-xx ec2-user]# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      31
[root@ip-xx-xx-xx-xx ec2-user]# getenforce
Enforcing

Note:
The problem in deployment can be reproduced with longhorn- v1.0.0