Longhorn: Failed to install app longhorn. Error: UPGRADE FAILED: transport is closing

Created on 23 Sep 2019 · 33Comments · Source: longhorn/longhorn

Longhorn rollback from 0.6.0 to 0.5.0 not work:

Failed to install app longhorn. Error: UPGRADE FAILED: transport is closing

aredeployment arerancher bug

Source

aleksey005

All 33 comments

It's a Rancher Bug. See:
https://github.com/rancher/rancher/issues/23017

shuo-wu on 23 Sep 2019

Is there any way to recover from this without waiting for the second next stable Rancher release?

rbq on 24 Sep 2019

After upgrade 0.5.0 to 0.6.1 - work good, but error not clear

If you delete, then everything is fine

aleksey005 on 24 Sep 2019

@aleksey005 Delete what exactly, the catalog-installed app or the resources it provisions? Does that disconnect/corrupt mounted volumes? Will it pick up the existing volumes after reinstalling?

rbq on 24 Sep 2019

@tbq I made backups in the S3 volumes of the Longhorn, then deleted the failed application from the Apps - Longhorn, then restored the volumes through the backup Longhorn procedure.

aleksey005 on 24 Sep 2019

@tbq !!! If you use this method, you need to save and use the same volume names that were before, then everything goes well.

aleksey005 on 24 Sep 2019

@rbq Deleting the app will remove all the data, make sure you have backups before doing it.

We're investigating possible ways to work around the Rancher issue, stay tuned.

yasker on 24 Sep 2019

I have the same problem.

jocarren on 25 Sep 2019

Steps to reproduce this issue:

Setup environment before reproducing: Rancher version v2.2.7; Kubernetes version v1.13.9; One Kubernetes Cluster with 3 nodes;
Set Chart repo URL to https://github.com/shuo-wu/charts.git with branch test as the test chart. Please refresh the chart before reproducing it.
Use kubectl apply -f https://raw.githubusercontent.com/shuo-wu/longhorn/test/deploy/pv-test.yaml to create the environment for triggering the bug later.
Launch Longhorn v0.5.0 from the catalog and wait for it to become active.
Upgrade to v0.6.0(which is the faulted version in this chart). Wait until error message Failed to install app longhorn-system. Error: UPGRADE FAILED: timed out waiting for the condition occurs on the app detail page(typically 5~10 minutes).
Try to roll back to the previous revision (v0.5.0). The error message Failed to install app longhorn-system. Error: UPGRADE FAILED: transport is closing shows and we cannot upgrade or rollback in this step.

shuo-wu on 26 Sep 2019

We're still validating the workaround.

yasker on 26 Sep 2019

@yasker Thank you!

I experimented with the workaround and managed to revert to 0.5. My Helm ConfigMaps seem to be called longhorn.v${n} though, without the -system part.

Then I tried to upgrade to 0.6.1 and it failed with:

Failed to install app longhorn. Error: UPGRADE FAILED: no ConfigMap with the name "longhorn-default-setting" found

I definitely didn't delete that one:

$ kubectl -n longhorn-system get cm
NAME                                           DATA   AGE
external-attacher-leader-io-rancher-longhorn   0      212d
longhorn-default-setting                       1      2d23h

Anyway, I then managed to revert to 0.5 again, this time by killing the workloads, removing the Helm config map and finally killing that respawning Longhorn 0.6.1 uninstaller workload several times, as it just went on and on logging lacking API permissions.

[edit:] Don't try this at home. The deployment showed up as okay, but actually wasn't.

rbq on 26 Sep 2019

@rbq Saw your edit. What's wrong with your deployment?

yasker on 26 Sep 2019

This error message Failed to install app longhorn. Error: UPGRADE FAILED: no ConfigMap with the name "longhorn-default-setting" found is actually a Helm bug: If the new resources introduced by the new version already exist in your cluster before upgrading, this kind of error will be triggered when you try to upgrade to that new version.

For Longhorn system, this new version means v0.6.0 or v0.6.1. The failed v0.6.0 upgrading will introduce 2 new resources: a ConfigMap named longhorn-default-setting and a CRD named instancemanagers.longhorn.rancher.io. In order to upgrade to v0.6.1, you also need to force deleting these resource besides removing the Helm ConfigMaps and Longhorn workloads:

kubectl -n longhorn-system delete cm longhorn-default-setting
kubectl patch -p '{"metadata":{"finalizers": null}}' crd instancemanagers.longhorn.rancher.io
kubectl delete crd instancemanagers.longhorn.rancher.io

The lacking API permissions issue is caused by the new version(v0.6.1) uninstaller running on the old Longhorn system. You can directly remove it:

kubectl -n longhorn-system delete jobs longhorn-uninstall

shuo-wu on 26 Sep 2019

@yasker The UI couldn't connect to the nodes and provisioning volumes didn't work. I then saw that there were still some workloads with version 0.6.1 running and removed those. That seemed to trigger Longhorn going on a killing spree, setting all volumes to “Deleting …” without any way to abort. I ended up uninstalling Longhorn, manually removing the rest, patching away a bunch of finalizers, deleting resources, moving the leftover data on the nodes aside and re-installing 0.6.1.

rbq on 27 Sep 2019

Thanks @rbq . How did you trigger the uninstaller? It shouldn't be triggered during either upgrade or rollback process, unless there is something we don't aware of.

yasker on 28 Sep 2019

Also, a quick update: we're still working on the workaround. The previous steps have some issues and gaps, we're validating the updated version currently. We should able to release it next week.

yasker on 28 Sep 2019

On a single-node cluster, Longhorn is installed and updated without failures.

aleksey005 on 28 Sep 2019

Validation: PASSED

Steps to test:

https://github.com/longhorn/longhorn/wiki/Longhorn-v0.6.0-Upgrade:-Workaround-for-recovering-from-a-rollback-failure-in-Rancher

meldafrawi on 3 Oct 2019

We will release the fix with v0.6.2 soon.

yasker on 3 Oct 2019

👍1

We will keep the issue open until Rancher fixes https://github.com/rancher/rancher/issues/21070

yasker on 3 Oct 2019

👍1

The workaround is now available at https://github.com/longhorn/longhorn/wiki/Longhorn-v0.6.0-Upgrade:-Workaround-for-recovering-from-a-rollback-failure-in-Rancher , along with Longhorn v0.6.2 release.

yasker on 8 Oct 2019

Rancher v2.3 has been fixed to mitigate the issue. See https://github.com/rancher/rancher/issues/21070 for details.

yasker on 1 Nov 2019

@yasker issue still exists 😢
Longhorn: 1.0.0
Rancher: 2.4.4
kubernetes deployed onto Ubuntu 18.04 via rke 1.1.2
UPD: Setting option Helm Wait to false fixes problem with freezing at state installing which then becomes failed

TemaSM on 13 Jun 2020

@yasker I just upgrade longhorn from 0.8.0 to 1.0.0 on the Rancher UI and stuck at state installing just like @TemaSM

I did upgrade on Rancher 2.2.12, now I already upgrade to Rancher 2.4.5 (2.2.12 -> 2.3.0 -> 2.3.8 -> 2.4.5), I also upgrade k8s to v1.18.6.

The longhorn state is still stuck at installing. While it still functioning which all of my volumes are working fine, I can't upgrade longhorn to 1.0.1.

cwt on 30 Jul 2020

@cwt @TemaSM is there any more specific failure information you can share? For example, @cwt, since you mentioned the Longhorn, stuck at state installing, can you check the workload page to see if any workload isn't in healthy state?

yasker on 30 Jul 2020

@yasker workloads are all green. on the app page -> longhorn-system I got this message:

Failed to install app longhorn-system. Error: UPGRADE FAILED: a release named longhorn-system is in use, cannot re-use a name that is still in use

as I said previously, everything seem working fine, I can create volume, snapshot, backup. all of my pods that mount longhorn volumes are working fine too. It's just that the status is installing and I can't upgrade to 1.0.1.

cwt on 31 Jul 2020

@cwt It would hard for us to figure out from Longhorn side since it seems related to some pre-exist conditions. I found a related bug at https://github.com/helm/helm/issues/4174 which you might find useful.

yasker on 1 Aug 2020

@yasker Thanks for your help anyway. Since I already have backup for all volumes, I already plan to remove longhorn and reinstall it again. I think my problem should be fixed.

cwt on 1 Aug 2020

... is there any more specific failure information you can share? ...

-> No any specific/detailed info. Everything is just like @cwt described:

... as I said previously, everything seem working fine, I can create volume, snapshot, backup. all of my pods that mount longhorn volumes are working fine too. It's just that the status is installing ...

@yasker Anyway, thanks for your time! I decided to look onto Longhorn later, when it became more stable.

TemaSM on 9 Aug 2020

@TemaSM ))) Longhorn is stable, I have been using it for a long time and very rarely there were any hallucinations. The only thing that I would like to improve is to increase performance and make it a complete analog of the file system, that is, so that applications can delete data.

aleksey005 on 9 Aug 2020

@aleksey005 I mean, stable for my needs. For example, currently I need to automate somehow auto-mount of all Longhorn's block devices to FS of nodes. If you have any tips doing this, will appreciate any tips from you.

TemaSM on 26 Aug 2020

@TemaSM why are you asking this question in a closed ish. Open a new one or make a request for such functionality. Culture comes first.
https://github.com/longhorn/longhorn/issues/906#issuecomment-678664652

aleksey005 on 26 Aug 2020

@TemaSM We don't support automatically mount the block device needed for Longhorn now. Feel free to create an issue for the enhancement so we can track it.

yasker on 27 Aug 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings