I am getting a CrashLoopBackOff for the longhorn-driver-deployer when deploying to k3s v1.19.2+k3s1.
Installed from upstream
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml
And the logs from the pod.
clemenko:clemenko k3s ( 174.138.56.187:6443 ) $ kubectl logs longhorn-driver-deployer-6b7d76659f-vjflp -n longhorn-system
time="2020-10-07T18:12:44Z" level=debug msg="Deploying CSI driver"
time="2020-10-07T18:12:44Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-10-07T18:12:45Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-10-07T18:12:46Z" level=warning msg="Proc not found: kubelet"
time="2020-10-07T18:12:46Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-10-07T18:12:47Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-10-07T18:12:48Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2020-10-07T18:12:49Z" level=warning msg="Proc not found: k3s"
time="2020-10-07T18:12:49Z" level=error msg="failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
time="2020-10-07T18:12:49Z" level=fatal msg="Error deploying driver: failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
This works perfectly fine with k3s 1.18.
You can recreate with deploying with on the latest channel. Stable works.
@khushboo-rancher Can you see if you can reproduce this issue?
I couldn't reproduce the issue on v1.19.2+k3s1 cluster with ubuntu 18.04 nodes.
I tried with both Helm command and kubectl command, I was able to deploy the longhorn and longhorn-driver-deployer comes up active successfully.
@clemenko Could you please let us know the OS of the nodes you tried with?
I am using Ubuntu 20.04 on digitalocean. The install is with kubectl.
-
andy - 410.212.3200
From: khushboo-rancher notifications@github.com
Sent: Wednesday, October 7, 2020 7:58:59 PM
To: longhorn/longhorn longhorn@noreply.github.com
Cc: Andy Clemenko clemenko@gmail.com; Mention mention@noreply.github.com
Subject: Re: [longhorn/longhorn] [Question] longhorn-driver-deployer CrashLoopBackOff (#1861)
I couldn't reproduce the issue on v1.19.2+k3s1 cluster with ubuntu 18.04 nodes.
I tried with both Helm command and kubectl command, I was able to deploy the longhorn and longhorn-driver-deployer comes up active successfully.
@clemenkohttps://github.com/clemenko Could you please let us know the OS of the nodes you tried with?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/longhorn/longhorn/issues/1861#issuecomment-705253790, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYHWPCMMJTBVEVPMAF4OS3SJT6EHANCNFSM4SHXTWLA.
Thanks for the reporting @clemenko . @khushboo-rancher and I have reproduced the issue with k3s v1.19.2+k3s1. We also verified that k3s v1.18 works fine.
For now, the temporary workaround is to set KUBELET_ROOT_DIR to /var/lib/kubelet here.
The root cause is k3s changed the command line separator. Before k3s v1.19, the cmdline is separated by \00:
hexdump -C cmdline
00000000 2f 75 73 72 2f 6c 6f 63 61 6c 2f 62 69 6e 2f 6b |/usr/local/bin/k|
00000010 33 73 00 61 67 65 6e 74 00 2d 2d 6e 6f 64 65 2d |3s.agent.--node-|
00000020 65 78 74 65 72 6e 61 6c 2d 69 70 00 31 38 2e 32 |external-ip.18.2|
00000030 31 36 2e 31 38 2e 31 35 37 00 |16.18.157.|
0000003a
After v1.19, it's using normal spaces, which is \20:
hexdump -C cmdline
00000000 2f 75 73 72 2f 6c 6f 63 61 6c 2f 62 69 6e 2f 6b |/usr/local/bin/k|
00000010 33 73 20 61 67 65 6e 74 00 00 00 00 00 00 00 00 |3s agent........|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
Our driver detection script at https://github.com/longhorn/longhorn-manager/blob/6efea60312f19e6e82caee0f1632629f35ffc86a/app/get_proc_arg.go#L53 is using \000 instead of \x20 as the separator so it failed to recognized the new format.
It's a straightforward fix but we do need to consider both situations.
Also, not sure when the code was checked in too and if it will only affect v1.19 in the future.
The workaround worked. Thanks!
for those playing along at home :
curl https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml | sed -e 's/#- name: KUBELET_ROOT_DIR/- name: KUBELET_ROOT_DIR/g' -e 's$# value: /var/lib/rancher/k3s/agent/kubelet$ value: /var/lib/kubelet$g' | kubectl apply -f -
for the win
FYI the k3s /proc/pid/cmdline change is due to https://github.com/rancher/k3s/pull/2072
@brandond
Thanks a bunch. I am looking for the reason :)
[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at https://github.com/longhorn/longhorn-manager/pull/699
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area upgrade
Issues
[x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at
Verified with longhorn master - 10/12/2020
Validation - Pass
Deployed longhorn on k3s v1.19.2+k3s1 and k3s v1.18.9+k3s1 cluster successfully.
Validate the basic case of creating volume and taking a snapshot of some data.
Logs from longhorn-driver-deployer pod
time="2020-10-12T21:33:51Z" level=debug msg="Deploying CSI driver"
time="2020-10-12T21:34:00Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:01 pm | time="2020-10-12T21:34:01Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:02 pm | time="2020-10-12T21:34:02Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
Mon, Oct 12 2020 2:34:03 pm | time="2020-10-12T21:34:03Z" level=warning msg="Proc not found: kubelet"
Mon, Oct 12 2020 2:34:03 pm | time="2020-10-12T21:34:03Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:04 pm | time="2020-10-12T21:34:04Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:05 pm | time="2020-10-12T21:34:05Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
Mon, Oct 12 2020 2:34:06 pm | time="2020-10-12T21:34:06Z" level=info msg="Proc found: k3s"
Mon, Oct 12 2020 2:34:06 pm | time="2020-10-12T21:34:06Z" level=info msg="Detected root dir path: /var/lib/kubelet"
If anyone wants to install using the helm chart, an overrides file with:
csi:
kubeletRootDir: /var/lib/kubelet
or directly
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --set csi.kubeletRootDir=/var/lib/kubelet
Seemed to do the trick for me.
This is happening (still/again?) in k3s v1.19.3+k3s3 with the following log output:
time="2020-11-14T17:24:47Z" level=debug msg="Deploying CSI driver"
time="2020-11-14T17:24:47Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-11-14T17:24:48Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-11-14T17:24:49Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-11-14T17:24:50Z" level=warning msg="Proc not found: kubelet"
time="2020-11-14T17:24:50Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-11-14T17:24:51Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2020-11-14T17:24:52Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2020-11-14T17:24:53Z" level=warning msg="Proc not found: k3s"
time="2020-11-14T17:24:53Z" level=error msg="failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
time="2020-11-14T17:24:53Z" level=fatal msg="Error deploying driver: failed to get arg root-dir. Need to specify \"--kubelet-root-dir\" in your Longhorn deployment yaml.: failed to get kubelet root dir, no related proc for root-dir detection, error out"
The workaround by @AntonOfTheWoods in https://github.com/longhorn/longhorn/issues/1861#issuecomment-716459507 worked for me, but shouldn’t auto detection work as of @khushboo-rancher's comment https://github.com/longhorn/longhorn/issues/1861#issuecomment-707376267?
@morremeyer The fix is only available on master branch. It is not back ported to the older versions yet.
@yasker When will we back port this fix to the older versions?
@PhanLe1010 Oh, totally missed that, thanks! Just so I get it right, not backporting it would mean it would only be available with the v.1.20.x+k3sy releases?
@morremeyer not backporting the fix would mean that it would only be available on Longhorn v1.1.0 release and the releases after that. However, we will backport this to older Longhorn versions so that users don't have to do the workaround
@PhanLe1010 Backporting means we need to create a new release v1.0.3, which we decide not to do last time since it's very close to the v1.1.0 release.
Thank you @yasker
@morremeyer I am very sorry for providing the wrong information. The correct information is mentioned by https://github.com/longhorn/longhorn/issues/1861#issuecomment-727656009 . Backporting means that we would provide a patch release for older version. However, we decide not to do it this time because it's very close to the v1.1.0 release.
Most helpful comment
If anyone wants to install using the
helmchart, an overrides file with:or directly
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system --set csi.kubeletRootDir=/var/lib/kubeletSeemed to do the trick for me.