Charts: [stable/prometheus-operator] Upgrade times out and breaks kubectl

Created on 15 Jan 2020 · 6Comments · Source: helm/charts

Describe the bug
helm upgrade ... to with new values (same chart version), fails with Error: UPGRADE FAILED: pre-upgrade hooks failed: Timeout: request did not complete within requested timeout 30s. Subsequent upgrade attempts all fail at different points with different error messages. Following this failure, all attempts at using kubectl fail with Unable to connect to the server: EOF and similarly all helm commands fail with Error: Kubernetes cluster unreachable. This connection issues resolves itself after ~30 min.

If it's of any use, here's the failures from me retrying the upgrade command a bunch of times after the first failure with --debug:

upgrade.go:280: [debug] warning: Upgrade "prom" failed: pre-upgrade hooks failed: Delete https://.../apis/rbac.authorization.k8s.io/v1/namespaces/monitoring/roles/prom-prometheus-operator-admission: http2: server sent GOAWAY and closed the connection; LastStreamID=107, ErrCode=NO_ERROR, debug=""

upgrade.go:280: [debug] warning: Upgrade "prom" failed: pre-upgrade hooks failed: warning: Hook pre-upgrade prometheus-operator/templates/prometheus-operator/admission-webhooks/job-patch/job-createSecret.yaml failed: Timeout: request did not complete within requested timeout 30s



md5-300acf9d256943093dc5ea21151ffdde



upgrade.go:280: [debug] warning: Upgrade "prom" failed: pre-upgrade hooks failed: warning: Hook pre-upgrade prometheus-operator/templates/prometheus-operator/admission-webhooks/job-patch/psp.yaml failed: Timeout: request did not complete within requested timeout 30s



md5-300acf9d256943093dc5ea21151ffdde



upgrade.go:225: [debug] creating upgraded release for prom
Error: UPGRADE FAILED: create: failed to create: etcdserver: request timed out



md5-300acf9d256943093dc5ea21151ffdde



upgrade.go:225: [debug] creating upgraded release for prom
Error: UPGRADE FAILED: create: failed to create: Internal error occurred: resource quota evaluates timeout



md5-e9d9e79a69388d90d8f07123f150be4d



$ helm version
version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}



md5-300acf9d256943093dc5ea21151ffdde



$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-13T11:52:47Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:09:08Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}



md5-3584504024916d1277b63e8c3bde72f2



$ kubectl version --context=docker-desktop
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-13T11:52:47Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Which chart:
stable/prometheus-operator v8.5.7

What happened:
Upgrading my release with additional values failed. Subsequent attempts to run the same upgrade command also failed, all at different points in the upgrade and with different error messages. Additionally, these failures led to the cluster not be responsive at all via kubectl.

What you expected to happen:
Upgrade works, or if it fails, kubectl still works.

How to reproduce it (as minimally and precisely as possible):
This part I'm not sure of. The most minimal cluster I had this issue on was the cluster docker-desktop for mac provides with ingress-nginx added. I can provide the values.yaml I'm using if that'll help.

Anything else we need to know:
I previously ran multiple upgrades without fail. And, once I was able to connect to my cluster again, I even managed to uninstall and reinstall the problematic release successfully with the exact same values that failed on the upgrade.

Is there something I should be doing differently when applying upgrades in the future?

lifecyclstale

Source

aaron-lerner

Most helpful comment

Welcome to new _os world. Bots are closing unsolved issues.
Looking forward to see bots solve and close issues.

Bot: "I think it counts as activity"

muba00 on 6 May 2020

😄3

All 6 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.