Zero-to-jupyterhub-k8s: Helm troubleshooting and related helm release info

Created on 8 Apr 2018 · 8Comments · Source: jupyterhub/zero-to-jupyterhub-k8s

I'm tracking some Helm issues / PR's here relating to this repo. I've summarized some troubleshooting steps and information about helm releases here.

Troubleshooting

Cleanup of abandoned resources (pre-puller pods are the most noticeable)

Helm hooks used in this chart are k8s resources version controlled by helm, but simply created before upgrades in our case. These are supposed to be deleted automatically by thanks to bugs in helm, this isn't the case. To clean up this mess:

# The script will delete resources that were meant to be temporary
# The bug that caused this is fixed in version 0.7b1 of the Helm chart
NAMESPACE=<YOUR-NAMESPACE>

resource_types="daemonset,serviceaccount,clusterrole,clusterrolebinding,job"
for bad_resource in $(kubectl get $resource_types --namespace $NAMESPACE | grep '/pre-pull' | awk '{print $1}');
do
    kubectl delete $bad_resource --namespace $NAMESPACE --now
done

kubectl delete $resource_types --selector hub.jupyter.org/deletable=true --namespace $NAMESPACE --now

Cleanup of Helm revisions

Helm has some bugs causing multiple revisions become considered as deployed, this can cause bugs while installing such as "... not found ...". Helm keeps track of this by looking at configmaps, we can manually clean up some of these.

# get and overview releases
helm list

RELEASE_NAME=<YOUR-RELEASE>

# get and overview of the revisions
helm history $RELEASE_NAME

# check if you have multiple revisions in a DEPLOYED status (a bug)
kubectl get cm -n kube-system --selector "NAME=$RELEASE_NAME,STATUS in (DEPLOYED)"
kubectl delete cm -n kube-system <list all but the most recent DEPLOYED revision configmaps separated with spaces>

# optional clean up of other revisions
kubectl delete cm -n kube-system --selector "STATUS in (FAILED,SUPERSEDED,DELETED)

`--force` flag in `helm upgrade`

When a change to a k8s resource is made that kubectl is not allowed to patch, helm must delete it and then add it anew. That is when we need the --force flag. In between 0.6 and 0.7 of the helm chart, we will need the --force flag.

Released (Helm 2.9.1)

Helm 2.8.2

helm#3539 resolves unmotivated ... already exist ... errors. Something we should recommend for stability.

Helm 2.9.0

helm#3540 introduced helm.sh/hook-delete-policy=before-hook-creation. It specifies that Tiller should delete the previous hook before the new hook is launched.

Unreleased

helm#4384 Support for k8s 1.11 not established. (Resolved in 2.11)
helm#3744 resolves a hook-delete-policy issue.

Helm 2.10.x? (no longer awaited)

helm#3744 fixes a bug we have where our hook-image-puller daemonset isn't deleted on an upgrade failure. With #758 this will be circumvented though by using the before-hook-creation hook delete policy so this is no longer important.

Helm 2.10.0 (?)

helm#3811 fixes a bug which makes us able to relax the yamllint config.

Helm 3.0.0?

helm#3837 resolves a yaml detail. With this we no longer need to pipe | trimSuffix '\n' when using the toYaml helper in order to please the yaml-linter. _Update_: The PR was reverted, can hopefully be introduced again in Helm 3.0.0.
helm#3811 resolves another yaml-linting detail for (#625)
We can revert usage of | trimSuffix '\n' introduced in #625 with Helm 3.0.0 and forward perhaps (#3837, #3888)

Source

consideRatio

👍2 ❤1

Most helpful comment

@consideRatio haha - I'm now maintaining the cluster that @jgerardsimcock put together for our research team and found this thread while trying to upgrade a different deployment! Can confirm that this was super helpful :) Thank you all so much for the incredible work!

delgadom on 26 Sep 2019

❤1 🎉1

All 8 comments

I am running Helm/Tiller v2.8.1. I've got a number of pods that I cannot get rid of. I delete them and they return. I am not too clear on image pre-puller states. Will I need to delete and the current deployment and redeploy for these pods to be removed?

screen shot 2018-04-17 at 8 42 04 am

jgerardsimcock on 17 Apr 2018

@jgerardsimcock Helm 2.8.2 introduced a bugfix making the functionality to automatically delete stuff created in a temporary pre-helm upgrade-state also known as helm hooks finally work as intended. To solve the issue long term, we need Helm 2.8.2!

The pods will reappear since the pods are controlled by a DaemonSet, and their purpose is to make sure there is a pod for every node. If you delete them, they will remove the pods themselves.

So in summary, I recommend the following:

Upgrade Helm to 2.8.2
kubectl delete ds,sa,clusterrole,clusterrolebinding,job --selector hub.jupyter.org/deletable=true

/cc @tracek

UPDATE

I recalled memories wrong, the reason for wanting Helm 2.8.2 that I was thinking of was actually kubernetes/helm#3539. It will save us from "... already exist ..." errors that can come from having had some failed helm upgrades in the past.

The issue that I was thinking should have been resolved in 2.8.2, which caused objects not to be removed, is still unmerged in kubernetes/helm/3540. So we will manually need to cleanup stuff with the step 2 above when helm upgrades fails during the hook phase for now, it won't affect us in general though.

@yuvipanda, with the before-hook-creation delete policy, we could have one single image puller daemonset that would be able to do the work of both the continuous-image-puller as well as the hook-image-puller at the same time, I think... I would need to consider if it works works even though switching the hook/continuous enabled flags in in all possible manners. It might end up lowering and increasing the complexity at the same time, I'm not confident on what option would end up the most robust.

consideRatio on 18 Apr 2018

👍1

I am having the same problem, with hook pods hanging around that I cannot delete. Solution 2 returned a "No resources found" for me. Is there anything new on this front regarding the removal of hook-image-puller pods that will not delete? I am on Helm 2.8.2.

tjcrone on 22 Jun 2018

@jgerardsimcock, did you ever figure out a solution to this problem?

tjcrone on 23 Jun 2018

@tjcrone there are so many bugs that are associated with this as far as i recall, so I've lost track, but the summary to avoid trouble is.

Use helm 2.9.1 (server+client side)
Clean up helm's releases, it may be that you have multiple releases considered active at the same time
- To do this, look at helms configmaps, use kubectl and "--selector" and inspect away. I'd recommend cleaning up a lot.
- The quickest way would be to go helm delete <asdf> --purge and start fresh, but some resources created by helm hooks won't be cleaned up, you may be able to find and delete them all with kubectl delete ds,sa,clusterrole,clusterrolebinding,job --selector hub.jupyter.org/deletable=true (also add in your namespace as a flag --namespace asdf)

consideRatio on 29 Jun 2018

Thanks @consideRatio! I added the namespace to the delete command and it worked. (I should have known to add this, duh.) I have also disabled the hook prepuller in my config file. Problem solved!

tjcrone on 29 Jun 2018

🎉1 👍1

delgadom on 26 Sep 2019

❤1 🎉1

Closing this as outdated as we now recommend usage of Helm 3, but will mention that it can be nice to have a troubleshooting section in general in a todo list I'm building up.

consideRatio on 5 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to set hub to ephemeral storage?

Boes-man · 3Comments

Documentation about storage type hostPath and homeHostPathTemplate

gsemet · 3Comments

Adding a CI step to test the upgrade path

betatim · 4Comments

warning: cannot overwrite table with non table for extraConfig (map[])

jgerardsimcock · 4Comments

Extract files from Dockerfile to ConfigMaps?

consideRatio · 4Comments