Longhorn: [BUG]Engine image stuck in deploying `Text file busy`

Created on 17 Mar 2020 · 18Comments · Source: longhorn/longhorn

Describe the bug
Longhorn engine stuck in deploying. We have a 5 node cluster and recently upgrade to v0.8.0 all seemed fine but the longhorn engine is stuck in deploying.

One of the engine-images is successfully running but the rest are crash back with the log message cp: cannot create regular file '/data/longhorn': Text file busy
Looking the filesystem it seems like the ones that have failed arent creating an engine in /var/lib/rancher/longhorn/engine-binaries but im not sure if thats the reason. I have tried removing the container image and pulling again but not success. Any ideas?

We are using rancher v2.3.5

aredeployment bug

Source

dstuart

👍3

Most helpful comment

Hey guys. While we are waiting for update with correct systematic fix from longhorn developers, I've managed to make a quick fix. Just patched the engine-image-ei-xxxxxxx DaemonSet:

was:

...
    spec:
      containers:
      - args:
        - -c
        - cp /usr/local/bin/longhorn* /data/ && echo installed && trap 'rm
          /data/longhorn* && echo cleaned up' EXIT && sleep infinity
        command:
        - /bin/bash
        image: longhornio/longhorn-engine:v0.8.0
...

now:

...
    spec:
      containers:
      - args:
        - -c
        - cp /usr/local/bin/longhorn* /data/ | true && echo installed && trap 'rm
          /data/longhorn* && echo cleaned up' EXIT && sleep infinity
        command:
        - /bin/bash
        image: longhornio/longhorn-engine:v0.8.0
...

So I've skip cp error code checking by adding '| true' after cp and longhorn up and run.

cyboman32 on 4 Apr 2020

👍2

All 18 comments

@dstuart How long it has been like that? Normally the issue should be self-solved after a while. Also, which engine image caused the trouble? If it's v0.7.0, you can scale down the workload and it should detach the volume, then you can upgrade the engine and scale up the workload again to complete the upgrade process.

yasker on 17 Mar 2020

I left it for a 3 hours and it never righted had about 35 restarts. I am using v0.8.0 I tried a redeploy but couldn't scale up and down. I also change it to deploy on a single node which seemed to work in that it said the engine was running but it still said deploying in longhorn UI and I couldnt create a PVC

dstuart on 17 Mar 2020

I managed to get it running on one of the nodes but as it requires 100% availability it doesn't need to work. On the nodes that didn't have the engine directory created on container deploy i.e. var/lib/rancher/longhorn/engine-binaries/longhornio-longhorn-engine-v0.8.0 I tried to tar it up from my one working node and move it onto all the other nodes but it didn't work and still complained about "p: cannot create regular file '/data/longhorn': Text file busy".

If I could undestand what its trying to do with the /data/longhorn file it might make more sense if there is something wrong with my setup, else I guess I am left with the option of deleting the deployments, removing PVC's and trying to uninstall/reinstall longhorn which is a little painful

dstuart on 17 Mar 2020

@dstuart /data/longhorn is the Longhorn engine binary that needs to be installed to /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v0.8.0, so it can be used by the instance managers to start engine/replica processes. I think there are some running workloads is using the binary, though if that's the case, I don't know why this binary image has been deployed twice.

Can you send us a support bundle? You can find the link at the footer of UI. After that, can you try to scale down all the Kubernetes workloads to zero? This should result in all the volumes detached. Then there shouldn't be anyone using the engine binary, then the installation should continue.

Btw, in 0.8.0, the directory for the engine binaries should be at /var/lib/longhorn instead of /var/lib/rancher/longhorn since we've changed the root directory to remove rancher part in v0.8.0.

yasker on 18 Mar 2020

So I ended up fixing it, by going into the correct folder (/var/lib/longhorn) and removing the engine folder for v0.8.0 and redeploying the container. Not sure why or how the engine got corrupted but they look exactly the same with same permissions and same sizes. Would you want a backup of them?

Thanks for you help!

dstuart on 18 Mar 2020

@dstuart That's great. Regarding reproducing the issue and fix the root cause, if you can send us a support bundle (even after it's worked around), that would be a great help. You can send it to [email protected].

I will keep this issue until we figure out the root cause.

yasker on 18 Mar 2020

This bug happens only in v0.8.0. Reproduce step:

Deploy Longhorn v0.8.0
Create and attach a volume
Delete the related DaemonSet of engine image v0.8.0:

~ kubectl -n longhorn-system delete ds engine-image-ei-e10d6bf5

Check the log of the newly created engine-image pods. They should crash with the error log:

~ kubectl -n longhorn-system logs engine-image-ei-e10d6bf5-xxxxx
cp: cannot create regular file '/data/longhorn': Text file busy

Cause:
Since there is only one matching engine binary for one instance manager in v0.7.0, instance managers don't need to directly use the binary in /var/lib/rancher/longhorn/engine-binaries/....
Instead, they can use /usr/local/bin/longhorn to launch processes.
https://github.com/longhorn/longhorn-manager/blob/96edc510d24a65a6ebb3dbe39480913bf3c61cd4/controller/replica_controller.go#L337

In v0.8.0, one set of instance managers can launch multiple versions of engine/replica processes, they always need to use the binaries in the /var/lib/longhorn/engine-binaries/.
Then engine image pods will fail to copy/overwrite the binary to the directory if there is running processes launched by the binary.
https://github.com/longhorn/longhorn-manager/blob/master/engineapi/instance_manager.go#L104

Possible solutions:

The binary will be cleaned up only when the engine image object deletion timestamp is set.
The deletion of the engine image object means there is no active volume using the binary and it's safe to delete the binary.
Use cp -n to avoid overwriting the in-use binary during the engine image pod creation.

shuo-wu on 25 Mar 2020

cp -n will not be enough if the image is e.g. master tag.

If there is an existing binary, we can calculate the md5/sha1 and only replace it if it's different from what we're about to install, so at least it won't be an issue during the production deployment (since we will not reuse the tag).

Though I don't understand the first solution.

yasker on 27 Mar 2020

The key point of the 1st solution is decoupling the binary cleanup with the engine image pod/DaemonSet deletion/restart. Since the pod deletion/restart doesn't mean that we need to remove the related binary. The only situation in which we need to do the cleanup is engine image object deletion and the binary is no longer used.

shuo-wu on 27 Mar 2020

But how is it related to this issue? Is the issue caused by cp failed or cleanup failed? I assume it's the cp?

yasker on 27 Mar 2020

Here is the logic:
The cause of this issue is the restarted/re-launched engine image pod is trying to overwrite the binary --> The binary is not cleaned up correctly during the previous pod shutdown --> The reason of the cleanup failure is due to the binary being used...

OK. It seems that we still need to avoid overwriting the binary with cp.

shuo-wu on 27 Mar 2020

Just thought about it, we don't need md5 or sha1. Just run a binary diff will be enough.

yasker on 27 Mar 2020

Test case1:

Clean up all engine binaries on all nodes before Launching Longhorn:

rm -rf /var/lib/longhorn/engine-binaries/*

Deploy Longhorn.
Create a pod with Longhorn volume. Then keep writing data into the volume.
Delete the engine image DaemonSet. e.g., For longhornio/longhorn-engine:master, it's:

kubectl -n longhorn-system delete ds engine-image-ei-605a0f3e

Check if the data writing still works
Wait for the engine image becoming ready then check the data and if volume snapshot/backup works fine.

Test case2:

Create an invalid/fake engine binary file for each node. If the engine image you are using is longhornio/longhorn-engine:master, the creation is like:

# touch /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
# ll /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
-rw-r--r-- 1 root root 0 Mar 30 05:16 /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn

Deploy Longhorn.
Check if Longhorn can be launched successfully and if Longhorn volume works fine.
Check if the file on each node is replaced by a real/valid engine binary

# ll /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
-rwxr-xr-x 1 root root 26304136 Mar 26 06:38 /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn*
# /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
NAME:
   longhorn - A new cli application

USAGE:
   longhorn [global options] command [command options] [arguments...]

VERSION:
   fdea55d

COMMANDS:
   controller
   replica
   sync-agent
   sync-agent-server-reset
   start-with-replicas, start
   add-replica, add
   ls-replica, ls
   rm-replica, rm
   replica-rebuild-status, rebuild-status
   snapshots, snapshot
   backups, backup
   expand
   journal
   info
   frontend
   version
   help, h                                 Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --url value    (default: "http://localhost:9501")
   --debug
   --help, -h     show help
   --version, -v  print the version

shuo-wu on 30 Mar 2020

Hey guys. While we are waiting for update with correct systematic fix from longhorn developers, I've managed to make a quick fix. Just patched the engine-image-ei-xxxxxxx DaemonSet:

was:

...
    spec:
      containers:
      - args:
        - -c
        - cp /usr/local/bin/longhorn* /data/ && echo installed && trap 'rm
          /data/longhorn* && echo cleaned up' EXIT && sleep infinity
        command:
        - /bin/bash
        image: longhornio/longhorn-engine:v0.8.0
...

now:

...
    spec:
      containers:
      - args:
        - -c
        - cp /usr/local/bin/longhorn* /data/ | true && echo installed && trap 'rm
          /data/longhorn* && echo cleaned up' EXIT && sleep infinity
        command:
        - /bin/bash
        image: longhornio/longhorn-engine:v0.8.0
...

So I've skip cp error code checking by adding '| true' after cp and longhorn up and run.

cyboman32 on 4 Apr 2020

👍2

Validation: PASSED

meldafrawi on 6 Apr 2020

Ran into the same problem, was able to temporarily fix and recover the failed deployment by manually editing the engine-image-xyz daemonset, setting the following command:

-c "cp /usr/local/bin/longhorn* /data/ | true && echo installed && trap 'rm /data/longhorn* && echo cleaned up' EXIT && sleep infinity"

The part added is the | true

Version: v0.7.0

miraculixx on 16 Oct 2020

@miraculixx which version you're using? We should have fixed it since v0.8.1 release.

yasker on 17 Oct 2020

@yasker thank you, forgot to mention, updated the comment. Unfortunately release upgrade was not done in time on this cluster.

miraculixx on 17 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

k3s v1.19.2+k3s1 : longhorn-driver-deployer CrashLoopBackOff

clemenko · 18Comments

Longhorn manager not run on RancherOS

aleksey005 · 17Comments

Failed to install app longhorn. Error: UPGRADE FAILED: transport is closing

aleksey005 · 33Comments

Feature Request: Pause replica rebuild for server maintenance

shubb30 · 22Comments

[BUG] Improve Kubernetes node drain support

excieve · 28Comments