Charts: [stable/prometheus-operator] Cannot update additionalPrometheusRulesMap Error: UPGRADE FAILED: failed to create resource: Timeout: request did not complete within requested timeout 30s

Created on 8 Aug 2019 · 16Comments · Source: helm/charts

Describe the bug
Unable to upgrade stable/prometheus-operator after installing stable/prometheus-redis-exporter and updating addidionalPrometheusRulesMap

Version of Helm and Kubernetes:
helm 2.14.3
kubernetes : v1.13.6-gke.13
version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.6-gke.13", GitCommit:"fcbc1d20b6bca1936c0317743055ac75aef608ce", GitTreeState:"clean", BuildDate:"2019-06-19T20:50:07Z", GoVersion:"go1.11.5b4", Compiler:"gc", Platform:"linux/amd64"}

Which chart:
stable/prometheus-operator

What happened:

helm upgrade prometheus --namespace monitoring -f prometheus-operator.yaml stable/prometheus-operator 
2019/08/08 17:15:51 Warning: Merging destination map for chart 'prometheus-operator'. The destination item 'remoteWrite' is a table and ignoring the source 'remoteWrite' as it has a non-table value of: []
2019/08/08 17:15:51 Warning: Merging destination map for chart 'prometheus-operator'. The destination item 'remoteRead' is a table and ignoring the source 'remoteRead' as it has a non-table value of: []
UPGRADE FAILED
Error: failed to create resource: Timeout: request did not complete within requested timeout 30s
Error: UPGRADE FAILED: failed to create resource: Timeout: request did not complete within requested timeout 30s

What you expected to happen:
Successful upgrade.

How to reproduce it (as minimally and precisely as possible):

install prometheus-operator

helm install  --name prometheus \  
        --namespace monitoring \
        -f prometheus-operator.yaml \
        stable/prometheus-operator

prometheus-operator.yaml contains values for additionalPrometheusRulesMap

install redis-exporter

helm install \   
        --name redis \
        --namespace monitoring \
        -f redis-exporter.yaml \
        stable/prometheus-redis-exporter

redis-exporter.yaml contains values for redisAddress and persistence

Attempt an upgrade with changes in additionalPrometheusRulesMap

helm upgrade prometheus --namespace monitoring -f prometheus-operator.yaml stable/prometheus-operator

Anything else we need to know:

lifecyclstale

Source

vyashole

Most helpful comment

I managed to get around the issue (different error message though server could not find the requested resource) by manually deleting the webhooks - didn't need to delete the entire release.

EDIT: in my case, the problem was on our end, see comment below this one

kubectl delete MutatingWebhookConfiguration (name)-prometheus-o-admission
kubectl delete ValidatingWebhookConfiguration (name)-prometheus-o-admission

The upgrade re-created them both so that should be ok. Still no idea what the root cause is though, the operator logs don't show much. (Maybe the order of the operations performed during the upgrade causes clashes with existing resources?)

peterholak on 17 Sep 2019

👍14 👎1

All 16 comments

I suspect the issue is that you're running this in a private GKE cluster:

When Google configure the control plane for private clusters, they automatically configure VPC peering between your Kubernetes cluster’s network and a separate Google managed project. In order to restrict what Google are able to access within your cluster, the firewall rules configured restrict access to your Kubernetes pods. This means that in order to use the webhook component with a GKE private cluster, you must configure an additional firewall rule to allow the GKE control plane access to your webhook pod.

You can read more information on how to add firewall rules for the GKE control plane nodes in the GKE docs

Alternatively, you can disable the hooks by setting prometheusOperator.admissionWebhooks.enabled=false.

vsliouniaev on 13 Aug 2019

Looks like this is the GCP solution you're looking for:
https://github.com/helm/charts/issues/16249#issuecomment-520795222

vsliouniaev on 13 Aug 2019

I have had the same issue occur on EKS also.

helm version                                                                                                                                                                 
Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}

kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

vyashole on 22 Aug 2019

You said I can disable the webhooks, but I do not know the purpose of those. What purpose do they serve? How will it affect my setup if I disable them?

Apologies if it is a silly question. I know next to nothing about all this.

vyashole on 22 Aug 2019

Webhooks perform validation of prometheusrules. Without them, creating an invalid resource will cause prometheus not to load it. If the container restarts it will go into a crashloop. This was the behaviour before this feature was added

vsliouniaev on 22 Aug 2019

👍1

@vsliouniaev I created a firewall rule that allows communication from the control plane to all nodes over 8443, then I got this error:

Error: failed to create resource: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-prometheus-oper-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: no service port "443" found for service "prometheus-prometheus-oper-operator"

So I allowed 443 in the firewall as well, but the service itself does not listen on 443 so that's of no use. What I fail to understand is why the webhook is trying to connect over 443 if the service is exposed on 8443?

vyashole on 9 Sep 2019

I gave up on all this and decided I will disable the webhooks but setting prometheusOperator.admissionWebhooks.enabled=false does not help on this particular release.
Created another release with webhooks disabled in the first place and that seems to work fine.
Something is making it hard to disable webhooks on the existing release.

vyashole on 9 Sep 2019

I deleted the existing release and made a new release on my cluster. Now everything works. I'd close this issue but it would be better if someone explains this behavior and then close it so I'm keeping this open.

vyashole on 10 Sep 2019

There are two admission webhook configurations, both controlled by prometheusOperator.admissionWebhooks.enabled. If you turn this on, the resources get created, if you turn this off, they are not (and are removed by Helm if they are in the cluster already)

Just testes this again to confirm that this is the behaviour I am getting using

$ helm upgrade prom-op stable/prometheus-operator --set prometheusOperator.admissionWebhooks.enabled=false

vsliouniaev on 10 Sep 2019

when I posted this comment above, I had tried exactly that and I got no service port "443" found for service "prometheus-prometheus-oper-operator"
I can't try again because I purged the release that was giving me this error. A new release worked just fine, both with and without admission webhooks. I might try to find some time this weekend to reproduce this behavior just to help you find whether there's really a bug here or if it was just some misconfiguration.

vyashole on 13 Sep 2019

I managed to get around the issue (different error message though server could not find the requested resource) by manually deleting the webhooks - didn't need to delete the entire release.

EDIT: in my case, the problem was on our end, see comment below this one

kubectl delete MutatingWebhookConfiguration (name)-prometheus-o-admission
kubectl delete ValidatingWebhookConfiguration (name)-prometheus-o-admission

peterholak on 17 Sep 2019

👍14 👎1

I looked into it a bit further, turns out we had a values file that used old images of prometheus-operator, from before webhook support was added to it (which was in 0.31).

So problem on our end, not the chart. If anyone else runs into "server could not find the requested resource" with webhooks (which basically means the call returns 404 - that would have been a much more useful error message), might be worth checking what image version is used on the operator pod.

peterholak on 27 Sep 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] on 27 Oct 2019

This issue is being automatically closed due to inactivity.

stale[bot] on 10 Nov 2019

if you have run before without disabling the webhook, you must manually delete the following kinds:

kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io //delete all objects
kubectl get MutatingWebhookConfiguration //delete all objects

and after that run:

helm install --name prometheus-operator stable/prometheus-operator \
--set prometheusOperator.admissionWebhooks.enabled=false \
--set prometheusOperator.admissionWebhooks.patch.enabled=false \
--set prometheusOperator.tlsProxy.enabled=false

BahmaniAlireza on 10 Mar 2020

👍3

Does is this issue also related to the following error messages I am getting while help upgrade:

Its also on GKE private cluster.
client.go:440: [debug] Looks like there are no changes for Service "chart1-prometheus-operator-operator" client.go:440: [debug] Looks like there are no changes for Service "chart1-prometheus-operator-prometheus" client.go:440: [debug] Looks like there are no changes for DaemonSet "chart1-prometheus-node-exporter" client.go:440: [debug] Looks like there are no changes for Deployment "chart1-prometheus-operator-operator" client.go:205: [debug] error updating the resource "prometheus-operator-test-customer-rule-file": cannot patch "prometheus-operator-test-customer-rule-file" with kind PrometheusRule: Timeout: request did not complete within requested timeout 30s client.go:205: [debug] error updating the resource "chart1-prometheus-operator-alertmanager.rules": cannot patch "chart1-prometheus-operator-alertmanager.rules" with kind PrometheusRule: Timeout: request did not complete within requested timeout 30s client.go:205: [debug] error updating the resource "chart1-prometheus-operator-etcd": cannot patch "chart1-prometheus-operator-etcd" with kind PrometheusRule: Timeout: request did not complete within requested timeout 30s .... ....

and the upgrade is just stuck.