Longhorn: k3s v1.19.2+k3s1 : longhorn-driver-deployer CrashLoopBackOff

Created on 7 Oct 2020  Â·  18Comments  Â·  Source: longhorn/longhorn

I am getting a CrashLoopBackOff for the longhorn-driver-deployer when deploying to k3s v1.19.2+k3s1.

Installed from upstream
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml

And the logs from the pod.

clemenko:clemenko k3s ( 174.138.56.187:6443 ) $ kubectl logs longhorn-driver-deployer-6b7d76659f-vjflp -n longhorn-system
time="2020-10-07T18:12:44Z" level=debug msg="Deploying CSI driver"
time="2020-10-07T18:12:44Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-10-07T18:12:45Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-10-07T18:12:46Z" level=warning msg="Proc not found: kubelet"
time="2020-10-07T18:12:46Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-10-07T18:12:47Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-10-07T18:12:48Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2020-10-07T18:12:49Z" level=warning msg="Proc not found: k3s"
time="2020-10-07T18:12:49Z" level=error msg="failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
time="2020-10-07T18:12:49Z" level=fatal msg="Error deploying driver: failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"

This works perfectly fine with k3s 1.18.
You can recreate with deploying with on the latest channel. Stable works.

aredeployment backport-needed bug priorit1 releasnote

Most helpful comment

If anyone wants to install using the helm chart, an overrides file with:

csi:
  kubeletRootDir: /var/lib/kubelet

or directly
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --set csi.kubeletRootDir=/var/lib/kubelet

Seemed to do the trick for me.

All 18 comments

@khushboo-rancher Can you see if you can reproduce this issue?

I couldn't reproduce the issue on v1.19.2+k3s1 cluster with ubuntu 18.04 nodes.

I tried with both Helm command and kubectl command, I was able to deploy the longhorn and longhorn-driver-deployer comes up active successfully.

@clemenko Could you please let us know the OS of the nodes you tried with?

I am using Ubuntu 20.04 on digitalocean. The install is with kubectl.

-
andy - 410.212.3200


From: khushboo-rancher notifications@github.com
Sent: Wednesday, October 7, 2020 7:58:59 PM
To: longhorn/longhorn longhorn@noreply.github.com
Cc: Andy Clemenko clemenko@gmail.com; Mention mention@noreply.github.com
Subject: Re: [longhorn/longhorn] [Question] longhorn-driver-deployer CrashLoopBackOff (#1861)

I couldn't reproduce the issue on v1.19.2+k3s1 cluster with ubuntu 18.04 nodes.

I tried with both Helm command and kubectl command, I was able to deploy the longhorn and longhorn-driver-deployer comes up active successfully.

@clemenkohttps://github.com/clemenko Could you please let us know the OS of the nodes you tried with?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/longhorn/longhorn/issues/1861#issuecomment-705253790, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYHWPCMMJTBVEVPMAF4OS3SJT6EHANCNFSM4SHXTWLA.

Thanks for the reporting @clemenko . @khushboo-rancher and I have reproduced the issue with k3s v1.19.2+k3s1. We also verified that k3s v1.18 works fine.

For now, the temporary workaround is to set KUBELET_ROOT_DIR to /var/lib/kubelet here.

The root cause is k3s changed the command line separator. Before k3s v1.19, the cmdline is separated by \00:

hexdump -C cmdline
00000000  2f 75 73 72 2f 6c 6f 63  61 6c 2f 62 69 6e 2f 6b  |/usr/local/bin/k|
00000010  33 73 00 61 67 65 6e 74  00 2d 2d 6e 6f 64 65 2d  |3s.agent.--node-|
00000020  65 78 74 65 72 6e 61 6c  2d 69 70 00 31 38 2e 32  |external-ip.18.2|
00000030  31 36 2e 31 38 2e 31 35  37 00                    |16.18.157.|
0000003a

After v1.19, it's using normal spaces, which is \20:

hexdump -C cmdline
00000000  2f 75 73 72 2f 6c 6f 63  61 6c 2f 62 69 6e 2f 6b  |/usr/local/bin/k|
00000010  33 73 20 61 67 65 6e 74  00 00 00 00 00 00 00 00  |3s agent........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Our driver detection script at https://github.com/longhorn/longhorn-manager/blob/6efea60312f19e6e82caee0f1632629f35ffc86a/app/get_proc_arg.go#L53 is using \000 instead of \x20 as the separator so it failed to recognized the new format.

It's a straightforward fix but we do need to consider both situations.

Also, not sure when the code was checked in too and if it will only affect v1.19 in the future.

The workaround worked. Thanks!
for those playing along at home :

curl https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml | sed -e 's/#- name: KUBELET_ROOT_DIR/- name: KUBELET_ROOT_DIR/g' -e 's$#  value: /var/lib/rancher/k3s/agent/kubelet$  value: /var/lib/kubelet$g' | kubectl apply -f -

for the win

FYI the k3s /proc/pid/cmdline change is due to https://github.com/rancher/k3s/pull/2072

@brandond
Thanks a bunch. I am looking for the reason :)

Pre-merged Checklist

  • [x] Does the PR include the explanation for the fix or the feature?

  • [x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
    The PR is at https://github.com/longhorn/longhorn-manager/pull/699

  • [x] Is the reproduce steps/test steps documented?

  • [x] Which areas/issues this PR might have potential impacts on?
    Area upgrade
    Issues

  • [x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

  • [x] If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at

  • [x] if labeled: require/doc Has the necessary document PR submitted or merged?
    The Doc issue/PR is at

  • [x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
    The automation skeleton PR is at
    The automation test case PR is at

  • [x] if labeled: require/automation-engine Has the engine integration test been merged?
    The engine automation PR is at

  • [x] if labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

Verified with longhorn master - 10/12/2020

Validation - Pass

Deployed longhorn on k3s v1.19.2+k3s1 and k3s v1.18.9+k3s1 cluster successfully.
Validate the basic case of creating volume and taking a snapshot of some data.

Logs from longhorn-driver-deployer pod

time="2020-10-12T21:33:51Z" level=debug msg="Deploying CSI driver"
time="2020-10-12T21:34:00Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:01 pm | time="2020-10-12T21:34:01Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:02 pm | time="2020-10-12T21:34:02Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:03 pm | time="2020-10-12T21:34:03Z" level=warning msg="Proc not found: kubelet"
Mon, Oct 12 2020 2:34:03 pm | time="2020-10-12T21:34:03Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:04 pm | time="2020-10-12T21:34:04Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:05 pm | time="2020-10-12T21:34:05Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:06 pm | time="2020-10-12T21:34:06Z" level=info msg="Proc found: k3s"
Mon, Oct 12 2020 2:34:06 pm | time="2020-10-12T21:34:06Z" level=info msg="Detected root dir path: /var/lib/kubelet"

If anyone wants to install using the helm chart, an overrides file with:

csi:
  kubeletRootDir: /var/lib/kubelet

or directly
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --set csi.kubeletRootDir=/var/lib/kubelet

Seemed to do the trick for me.

This is happening (still/again?) in k3s v1.19.3+k3s3 with the following log output:

time="2020-11-14T17:24:47Z" level=debug msg="Deploying CSI driver"
time="2020-11-14T17:24:47Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-11-14T17:24:48Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-11-14T17:24:49Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-11-14T17:24:50Z" level=warning msg="Proc not found: kubelet"
time="2020-11-14T17:24:50Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-11-14T17:24:51Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-11-14T17:24:52Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2020-11-14T17:24:53Z" level=warning msg="Proc not found: k3s"
time="2020-11-14T17:24:53Z" level=error msg="failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
time="2020-11-14T17:24:53Z" level=fatal msg="Error deploying driver: failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"

The workaround by @AntonOfTheWoods in https://github.com/longhorn/longhorn/issues/1861#issuecomment-716459507 worked for me, but shouldn’t auto detection work as of @khushboo-rancher's comment https://github.com/longhorn/longhorn/issues/1861#issuecomment-707376267?

@morremeyer The fix is only available on master branch. It is not back ported to the older versions yet.

@yasker When will we back port this fix to the older versions?

@PhanLe1010 Oh, totally missed that, thanks! Just so I get it right, not backporting it would mean it would only be available with the v.1.20.x+k3sy releases?

@morremeyer not backporting the fix would mean that it would only be available on Longhorn v1.1.0 release and the releases after that. However, we will backport this to older Longhorn versions so that users don't have to do the workaround

@PhanLe1010 Backporting means we need to create a new release v1.0.3, which we decide not to do last time since it's very close to the v1.1.0 release.

Thank you @yasker

@morremeyer I am very sorry for providing the wrong information. The correct information is mentioned by https://github.com/longhorn/longhorn/issues/1861#issuecomment-727656009 . Backporting means that we would provide a patch release for older version. However, we decide not to do it this time because it's very close to the v1.1.0 release.

Was this page helpful?
0 / 5 - 0 ratings