Describe the bug
Longhorn engine stuck in deploying. We have a 5 node cluster and recently upgrade to v0.8.0 all seemed fine but the longhorn engine is stuck in deploying.
One of the engine-images is successfully running but the rest are crash back with the log message cp: cannot create regular file '/data/longhorn': Text file busy
Looking the filesystem it seems like the ones that have failed arent creating an engine in /var/lib/rancher/longhorn/engine-binaries but im not sure if thats the reason. I have tried removing the container image and pulling again but not success. Any ideas?
We are using rancher v2.3.5
@dstuart How long it has been like that? Normally the issue should be self-solved after a while. Also, which engine image caused the trouble? If it's v0.7.0, you can scale down the workload and it should detach the volume, then you can upgrade the engine and scale up the workload again to complete the upgrade process.
I left it for a 3 hours and it never righted had about 35 restarts. I am using v0.8.0 I tried a redeploy but couldn't scale up and down. I also change it to deploy on a single node which seemed to work in that it said the engine was running but it still said deploying in longhorn UI and I couldnt create a PVC
I managed to get it running on one of the nodes but as it requires 100% availability it doesn't need to work. On the nodes that didn't have the engine directory created on container deploy i.e. var/lib/rancher/longhorn/engine-binaries/longhornio-longhorn-engine-v0.8.0 I tried to tar it up from my one working node and move it onto all the other nodes but it didn't work and still complained about "p: cannot create regular file '/data/longhorn': Text file busy".
If I could undestand what its trying to do with the /data/longhorn file it might make more sense if there is something wrong with my setup, else I guess I am left with the option of deleting the deployments, removing PVC's and trying to uninstall/reinstall longhorn which is a little painful
@dstuart /data/longhorn is the Longhorn engine binary that needs to be installed to /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v0.8.0, so it can be used by the instance managers to start engine/replica processes. I think there are some running workloads is using the binary, though if that's the case, I don't know why this binary image has been deployed twice.
Can you send us a support bundle? You can find the link at the footer of UI. After that, can you try to scale down all the Kubernetes workloads to zero? This should result in all the volumes detached. Then there shouldn't be anyone using the engine binary, then the installation should continue.
Btw, in 0.8.0, the directory for the engine binaries should be at /var/lib/longhorn instead of /var/lib/rancher/longhorn since we've changed the root directory to remove rancher part in v0.8.0.
So I ended up fixing it, by going into the correct folder (/var/lib/longhorn) and removing the engine folder for v0.8.0 and redeploying the container. Not sure why or how the engine got corrupted but they look exactly the same with same permissions and same sizes. Would you want a backup of them?
Thanks for you help!
@dstuart That's great. Regarding reproducing the issue and fix the root cause, if you can send us a support bundle (even after it's worked around), that would be a great help. You can send it to [email protected].
I will keep this issue until we figure out the root cause.
This bug happens only in v0.8.0. Reproduce step:
~ kubectl -n longhorn-system delete ds engine-image-ei-e10d6bf5
~ kubectl -n longhorn-system logs engine-image-ei-e10d6bf5-xxxxx
cp: cannot create regular file '/data/longhorn': Text file busy
Cause:
Since there is only one matching engine binary for one instance manager in v0.7.0, instance managers don't need to directly use the binary in /var/lib/rancher/longhorn/engine-binaries/....
Instead, they can use /usr/local/bin/longhorn to launch processes.
https://github.com/longhorn/longhorn-manager/blob/96edc510d24a65a6ebb3dbe39480913bf3c61cd4/controller/replica_controller.go#L337
In v0.8.0, one set of instance managers can launch multiple versions of engine/replica processes, they always need to use the binaries in the /var/lib/longhorn/engine-binaries/.
Then engine image pods will fail to copy/overwrite the binary to the directory if there is running processes launched by the binary.
https://github.com/longhorn/longhorn-manager/blob/master/engineapi/instance_manager.go#L104
Possible solutions:
cp -n to avoid overwriting the in-use binary during the engine image pod creation.cp -n will not be enough if the image is e.g. master tag.
If there is an existing binary, we can calculate the md5/sha1 and only replace it if it's different from what we're about to install, so at least it won't be an issue during the production deployment (since we will not reuse the tag).
Though I don't understand the first solution.
The key point of the 1st solution is decoupling the binary cleanup with the engine image pod/DaemonSet deletion/restart. Since the pod deletion/restart doesn't mean that we need to remove the related binary. The only situation in which we need to do the cleanup is engine image object deletion and the binary is no longer used.
But how is it related to this issue? Is the issue caused by cp failed or cleanup failed? I assume it's the cp?
Here is the logic:
The cause of this issue is the restarted/re-launched engine image pod is trying to overwrite the binary --> The binary is not cleaned up correctly during the previous pod shutdown --> The reason of the cleanup failure is due to the binary being used...
OK. It seems that we still need to avoid overwriting the binary with cp.
Just thought about it, we don't need md5 or sha1. Just run a binary diff will be enough.
Test case1:
rm -rf /var/lib/longhorn/engine-binaries/*
longhornio/longhorn-engine:master, it's:kubectl -n longhorn-system delete ds engine-image-ei-605a0f3e
ready then check the data and if volume snapshot/backup works fine.Test case2:
longhornio/longhorn-engine:master, the creation is like:# touch /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
# ll /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
-rw-r--r-- 1 root root 0 Mar 30 05:16 /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
# ll /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
-rwxr-xr-x 1 root root 26304136 Mar 26 06:38 /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn*
# /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn
NAME:
longhorn - A new cli application
USAGE:
longhorn [global options] command [command options] [arguments...]
VERSION:
fdea55d
COMMANDS:
controller
replica
sync-agent
sync-agent-server-reset
start-with-replicas, start
add-replica, add
ls-replica, ls
rm-replica, rm
replica-rebuild-status, rebuild-status
snapshots, snapshot
backups, backup
expand
journal
info
frontend
version
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
--url value (default: "http://localhost:9501")
--debug
--help, -h show help
--version, -v print the version
Hey guys. While we are waiting for update with correct systematic fix from longhorn developers, I've managed to make a quick fix. Just patched the engine-image-ei-xxxxxxx DaemonSet:
was:
...
spec:
containers:
- args:
- -c
- cp /usr/local/bin/longhorn* /data/ && echo installed && trap 'rm
/data/longhorn* && echo cleaned up' EXIT && sleep infinity
command:
- /bin/bash
image: longhornio/longhorn-engine:v0.8.0
...
now:
...
spec:
containers:
- args:
- -c
- cp /usr/local/bin/longhorn* /data/ | true && echo installed && trap 'rm
/data/longhorn* && echo cleaned up' EXIT && sleep infinity
command:
- /bin/bash
image: longhornio/longhorn-engine:v0.8.0
...
So I've skip cp error code checking by adding '| true' after cp and longhorn up and run.
Validation: PASSED
Ran into the same problem, was able to temporarily fix and recover the failed deployment by manually editing the engine-image-xyz daemonset, setting the following command:
-c "cp /usr/local/bin/longhorn* /data/ | true && echo installed && trap 'rm /data/longhorn* && echo cleaned up' EXIT && sleep infinity"
The part added is the | true
Version: v0.7.0
@miraculixx which version you're using? We should have fixed it since v0.8.1 release.
@yasker thank you, forgot to mention, updated the comment. Unfortunately release upgrade was not done in time on this cluster.
Most helpful comment
Hey guys. While we are waiting for update with correct systematic fix from longhorn developers, I've managed to make a quick fix. Just patched the
engine-image-ei-xxxxxxxDaemonSet:was:
now:
So I've skip
cperror code checking by adding '| true' after cp and longhorn up and run.