Longhorn: k3s v1.19.2+k3s1 : longhorn-driver-deployer CrashLoopBackOff

Created on 7 Oct 2020 · 18Comments · Source: longhorn/longhorn

I am getting a CrashLoopBackOff for the longhorn-driver-deployer when deploying to k3s v1.19.2+k3s1.

Installed from upstream
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml

And the logs from the pod.

clemenko:clemenko k3s ( 174.138.56.187:6443 ) $ kubectl logs longhorn-driver-deployer-6b7d76659f-vjflp -n longhorn-system
time="2020-10-07T18:12:44Z" level=debug msg="Deploying CSI driver"
time="2020-10-07T18:12:44Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-10-07T18:12:45Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-10-07T18:12:46Z" level=warning msg="Proc not found: kubelet"
time="2020-10-07T18:12:46Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-10-07T18:12:47Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-10-07T18:12:48Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2020-10-07T18:12:49Z" level=warning msg="Proc not found: k3s"
time="2020-10-07T18:12:49Z" level=error msg="failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
time="2020-10-07T18:12:49Z" level=fatal msg="Error deploying driver: failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"

This works perfectly fine with k3s 1.18.
You can recreate with deploying with on the latest channel. Stable works.

aredeployment backport-needed bug priorit1 releasnote

Source

clemenko

Most helpful comment

If anyone wants to install using the helm chart, an overrides file with:

csi:
  kubeletRootDir: /var/lib/kubelet

or directly
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --set csi.kubeletRootDir=/var/lib/kubelet

Seemed to do the trick for me.

AntonOfTheWoods on 26 Oct 2020

👍5 🚀2

All 18 comments

@khushboo-rancher Can you see if you can reproduce this issue?

yasker on 8 Oct 2020

I couldn't reproduce the issue on v1.19.2+k3s1 cluster with ubuntu 18.04 nodes.

I tried with both Helm command and kubectl command, I was able to deploy the longhorn and longhorn-driver-deployer comes up active successfully.

@clemenko Could you please let us know the OS of the nodes you tried with?

khushboo-rancher on 8 Oct 2020

I am using Ubuntu 20.04 on digitalocean. The install is with kubectl.

-
andy - 410.212.3200

From: khushboo-rancher notifications@github.com
Sent: Wednesday, October 7, 2020 7:58:59 PM
To: longhorn/longhorn longhorn@noreply.github.com
Cc: Andy Clemenko clemenko@gmail.com; Mention mention@noreply.github.com
Subject: Re: [longhorn/longhorn] [Question] longhorn-driver-deployer CrashLoopBackOff (#1861)

I couldn't reproduce the issue on v1.19.2+k3s1 cluster with ubuntu 18.04 nodes.

I tried with both Helm command and kubectl command, I was able to deploy the longhorn and longhorn-driver-deployer comes up active successfully.

@clemenkohttps://github.com/clemenko Could you please let us know the OS of the nodes you tried with?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/longhorn/longhorn/issues/1861#issuecomment-705253790, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYHWPCMMJTBVEVPMAF4OS3SJT6EHANCNFSM4SHXTWLA.

clemenko on 8 Oct 2020

Thanks for the reporting @clemenko . @khushboo-rancher and I have reproduced the issue with k3s v1.19.2+k3s1. We also verified that k3s v1.18 works fine.

For now, the temporary workaround is to set KUBELET_ROOT_DIR to /var/lib/kubelet here.

yasker on 8 Oct 2020

👍1

The root cause is k3s changed the command line separator. Before k3s v1.19, the cmdline is separated by \00:

hexdump -C cmdline
00000000  2f 75 73 72 2f 6c 6f 63  61 6c 2f 62 69 6e 2f 6b  |/usr/local/bin/k|
00000010  33 73 00 61 67 65 6e 74  00 2d 2d 6e 6f 64 65 2d  |3s.agent.--node-|
00000020  65 78 74 65 72 6e 61 6c  2d 69 70 00 31 38 2e 32  |external-ip.18.2|
00000030  31 36 2e 31 38 2e 31 35  37 00                    |16.18.157.|
0000003a

After v1.19, it's using normal spaces, which is \20:

hexdump -C cmdline
00000000  2f 75 73 72 2f 6c 6f 63  61 6c 2f 62 69 6e 2f 6b  |/usr/local/bin/k|
00000010  33 73 20 61 67 65 6e 74  00 00 00 00 00 00 00 00  |3s agent........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Our driver detection script at https://github.com/longhorn/longhorn-manager/blob/6efea60312f19e6e82caee0f1632629f35ffc86a/app/get_proc_arg.go#L53 is using \000 instead of \x20 as the separator so it failed to recognized the new format.

It's a straightforward fix but we do need to consider both situations.

Also, not sure when the code was checked in too and if it will only affect v1.19 in the future.

yasker on 8 Oct 2020

🚀2

The workaround worked. Thanks!
for those playing along at home :

curl https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml | sed -e 's/#- name: KUBELET_ROOT_DIR/- name: KUBELET_ROOT_DIR/g' -e 's$#  value: /var/lib/rancher/k3s/agent/kubelet$  value: /var/lib/kubelet$g' | kubectl apply -f -

for the win

clemenko on 8 Oct 2020

👍4

FYI the k3s /proc/pid/cmdline change is due to https://github.com/rancher/k3s/pull/2072

brandond on 9 Oct 2020

@brandond
Thanks a bunch. I am looking for the reason :)

PhanLe1010 on 9 Oct 2020

Pre-merged Checklist

[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at https://github.com/longhorn/longhorn-manager/pull/699
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area upgrade
Issues
[x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at

longhorn-io-github-bot on 9 Oct 2020

Verified with longhorn master - 10/12/2020

Validation - Pass

Deployed longhorn on k3s v1.19.2+k3s1 and k3s v1.18.9+k3s1 cluster successfully.
Validate the basic case of creating volume and taking a snapshot of some data.

Logs from longhorn-driver-deployer pod

time="2020-10-12T21:33:51Z" level=debug msg="Deploying CSI driver"
time="2020-10-12T21:34:00Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:01 pm | time="2020-10-12T21:34:01Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:02 pm | time="2020-10-12T21:34:02Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:03 pm | time="2020-10-12T21:34:03Z" level=warning msg="Proc not found: kubelet"
Mon, Oct 12 2020 2:34:03 pm | time="2020-10-12T21:34:03Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:04 pm | time="2020-10-12T21:34:04Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:05 pm | time="2020-10-12T21:34:05Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:06 pm | time="2020-10-12T21:34:06Z" level=info msg="Proc found: k3s"
Mon, Oct 12 2020 2:34:06 pm | time="2020-10-12T21:34:06Z" level=info msg="Detected root dir path: /var/lib/kubelet"

khushboo-rancher on 13 Oct 2020

If anyone wants to install using the helm chart, an overrides file with:

csi:
  kubeletRootDir: /var/lib/kubelet

or directly
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --set csi.kubeletRootDir=/var/lib/kubelet

Seemed to do the trick for me.

AntonOfTheWoods on 26 Oct 2020

👍5 🚀2

This is happening (still/again?) in k3s v1.19.3+k3s3 with the following log output:

time="2020-11-14T17:24:47Z" level=debug msg="Deploying CSI driver"
time="2020-11-14T17:24:47Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-11-14T17:24:48Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-11-14T17:24:49Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-11-14T17:24:50Z" level=warning msg="Proc not found: kubelet"
time="2020-11-14T17:24:50Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-11-14T17:24:51Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-11-14T17:24:52Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2020-11-14T17:24:53Z" level=warning msg="Proc not found: k3s"
time="2020-11-14T17:24:53Z" level=error msg="failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
time="2020-11-14T17:24:53Z" level=fatal msg="Error deploying driver: failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"

The workaround by @AntonOfTheWoods in https://github.com/longhorn/longhorn/issues/1861#issuecomment-716459507 worked for me, but shouldn’t auto detection work as of @khushboo-rancher's comment https://github.com/longhorn/longhorn/issues/1861#issuecomment-707376267?

morremeyer on 14 Nov 2020

@morremeyer The fix is only available on master branch. It is not back ported to the older versions yet.

@yasker When will we back port this fix to the older versions?

PhanLe1010 on 14 Nov 2020

@PhanLe1010 Oh, totally missed that, thanks! Just so I get it right, not backporting it would mean it would only be available with the v.1.20.x+k3sy releases?

morremeyer on 14 Nov 2020

@morremeyer not backporting the fix would mean that it would only be available on Longhorn v1.1.0 release and the releases after that. However, we will backport this to older Longhorn versions so that users don't have to do the workaround

PhanLe1010 on 14 Nov 2020

❤1 👍1

@PhanLe1010 Backporting means we need to create a new release v1.0.3, which we decide not to do last time since it's very close to the v1.1.0 release.

yasker on 16 Nov 2020

Thank you @yasker

PhanLe1010 on 16 Nov 2020

@morremeyer I am very sorry for providing the wrong information. The correct information is mentioned by https://github.com/longhorn/longhorn/issues/1861#issuecomment-727656009 . Backporting means that we would provide a patch release for older version. However, we decide not to do it this time because it's very close to the v1.1.0 release.

PhanLe1010 on 16 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Question] How to create PVC from snapshot or backup with yalm file?

hillbun · 6Comments

[BUG]Sometimes unexpected behaviors after node is recovered from the disconnection

shuo-wu · 7Comments

Create a network check tool to help troubleshooting intermittent network connectivity

yasker · 3Comments

manager: Fail to set volume status after all replica failed may result in volume is not attachable.

yasker · 3Comments

[Question] More explanation about the usage of fromBackup parameter in the StorageClass

anouarchattouna · 4Comments