Describe the bug
Unable to upgrade stable/prometheus-operator after installing stable/prometheus-redis-exporter and updating addidionalPrometheusRulesMap
Version of Helm and Kubernetes:
helm 2.14.3
kubernetes : v1.13.6-gke.13
version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.6-gke.13", GitCommit:"fcbc1d20b6bca1936c0317743055ac75aef608ce", GitTreeState:"clean", BuildDate:"2019-06-19T20:50:07Z", GoVersion:"go1.11.5b4", Compiler:"gc", Platform:"linux/amd64"}
Which chart:
stable/prometheus-operator
What happened:
helm upgrade prometheus --namespace monitoring -f prometheus-operator.yaml stable/prometheus-operator
2019/08/08 17:15:51 Warning: Merging destination map for chart 'prometheus-operator'. The destination item 'remoteWrite' is a table and ignoring the source 'remoteWrite' as it has a non-table value of: []
2019/08/08 17:15:51 Warning: Merging destination map for chart 'prometheus-operator'. The destination item 'remoteRead' is a table and ignoring the source 'remoteRead' as it has a non-table value of: []
UPGRADE FAILED
Error: failed to create resource: Timeout: request did not complete within requested timeout 30s
Error: UPGRADE FAILED: failed to create resource: Timeout: request did not complete within requested timeout 30s
What you expected to happen:
Successful upgrade.
How to reproduce it (as minimally and precisely as possible):
helm install --name prometheus \
--namespace monitoring \
-f prometheus-operator.yaml \
stable/prometheus-operator
prometheus-operator.yaml contains values for additionalPrometheusRulesMap
helm install \
--name redis \
--namespace monitoring \
-f redis-exporter.yaml \
stable/prometheus-redis-exporter
redis-exporter.yaml contains values for redisAddress and persistence
additionalPrometheusRulesMaphelm upgrade prometheus --namespace monitoring -f prometheus-operator.yaml stable/prometheus-operator
Anything else we need to know:
I suspect the issue is that you're running this in a private GKE cluster:
When Google configure the control plane for private clusters, they automatically configure VPC peering between your Kubernetes cluster鈥檚 network and a separate Google managed project. In order to restrict what Google are able to access within your cluster, the firewall rules configured restrict access to your Kubernetes pods. This means that in order to use the webhook component with a GKE private cluster, you must configure an additional firewall rule to allow the GKE control plane access to your webhook pod.
You can read more information on how to add firewall rules for the GKE control plane nodes in the GKE docs
Alternatively, you can disable the hooks by setting prometheusOperator.admissionWebhooks.enabled=false.
Looks like this is the GCP solution you're looking for:
https://github.com/helm/charts/issues/16249#issuecomment-520795222
I have had the same issue occur on EKS also.
helm version
Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
You said I can disable the webhooks, but I do not know the purpose of those. What purpose do they serve? How will it affect my setup if I disable them?
Apologies if it is a silly question. I know next to nothing about all this.
Webhooks perform validation of prometheusrules. Without them, creating an invalid resource will cause prometheus not to load it. If the container restarts it will go into a crashloop. This was the behaviour before this feature was added
@vsliouniaev I created a firewall rule that allows communication from the control plane to all nodes over 8443, then I got this error:
Error: failed to create resource: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-prometheus-oper-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: no service port "443" found for service "prometheus-prometheus-oper-operator"
So I allowed 443 in the firewall as well, but the service itself does not listen on 443 so that's of no use. What I fail to understand is why the webhook is trying to connect over 443 if the service is exposed on 8443?
I gave up on all this and decided I will disable the webhooks but setting prometheusOperator.admissionWebhooks.enabled=false does not help on this particular release.
Created another release with webhooks disabled in the first place and that seems to work fine.
Something is making it hard to disable webhooks on the existing release.
I deleted the existing release and made a new release on my cluster. Now everything works. I'd close this issue but it would be better if someone explains this behavior and then close it so I'm keeping this open.
There are two admission webhook configurations, both controlled by prometheusOperator.admissionWebhooks.enabled. If you turn this on, the resources get created, if you turn this off, they are not (and are removed by Helm if they are in the cluster already)
Just testes this again to confirm that this is the behaviour I am getting using
$ helm upgrade prom-op stable/prometheus-operator --set prometheusOperator.admissionWebhooks.enabled=false
when I posted this comment above, I had tried exactly that and I got no service port "443" found for service "prometheus-prometheus-oper-operator"
I can't try again because I purged the release that was giving me this error. A new release worked just fine, both with and without admission webhooks. I might try to find some time this weekend to reproduce this behavior just to help you find whether there's really a bug here or if it was just some misconfiguration.
I managed to get around the issue (different error message though server could not find the requested resource) by manually deleting the webhooks - didn't need to delete the entire release.
EDIT: in my case, the problem was on our end, see comment below this one
kubectl delete MutatingWebhookConfiguration (name)-prometheus-o-admission
kubectl delete ValidatingWebhookConfiguration (name)-prometheus-o-admission
The upgrade re-created them both so that should be ok. Still no idea what the root cause is though, the operator logs don't show much. (Maybe the order of the operations performed during the upgrade causes clashes with existing resources?)
I looked into it a bit further, turns out we had a values file that used old images of prometheus-operator, from before webhook support was added to it (which was in 0.31).
So problem on our end, not the chart. If anyone else runs into "server could not find the requested resource" with webhooks (which basically means the call returns 404 - that would have been a much more useful error message), might be worth checking what image version is used on the operator pod.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
if you have run before without disabling the webhook, you must manually delete the following kinds:
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io //delete all objects
kubectl get MutatingWebhookConfiguration //delete all objects
and after that run:
helm install --name prometheus-operator stable/prometheus-operator \
--set prometheusOperator.admissionWebhooks.enabled=false \
--set prometheusOperator.admissionWebhooks.patch.enabled=false \
--set prometheusOperator.tlsProxy.enabled=false
Does is this issue also related to the following error messages I am getting while help upgrade:
Its also on GKE private cluster.
client.go:440: [debug] Looks like there are no changes for Service "chart1-prometheus-operator-operator"
client.go:440: [debug] Looks like there are no changes for Service "chart1-prometheus-operator-prometheus"
client.go:440: [debug] Looks like there are no changes for DaemonSet "chart1-prometheus-node-exporter"
client.go:440: [debug] Looks like there are no changes for Deployment "chart1-prometheus-operator-operator"
client.go:205: [debug] error updating the resource "prometheus-operator-test-customer-rule-file":
cannot patch "prometheus-operator-test-customer-rule-file" with kind PrometheusRule: Timeout: request did not complete within requested timeout 30s
client.go:205: [debug] error updating the resource "chart1-prometheus-operator-alertmanager.rules":
cannot patch "chart1-prometheus-operator-alertmanager.rules" with kind PrometheusRule: Timeout: request did not complete within requested timeout 30s
client.go:205: [debug] error updating the resource "chart1-prometheus-operator-etcd":
cannot patch "chart1-prometheus-operator-etcd" with kind PrometheusRule: Timeout: request did not complete within requested timeout 30s
....
....
and the upgrade is just stuck.
Most helpful comment
I managed to get around the issue (different error message though
server could not find the requested resource) by manually deleting the webhooks - didn't need to delete the entire release.EDIT: in my case, the problem was on our end, see comment below this one
The upgrade re-created them both so that should be ok. Still no idea what the root cause is though, the operator logs don't show much. (Maybe the order of the operations performed during the upgrade causes clashes with existing resources?)