Describe the bug
Prometheusrules don't pass the webhooks checks
Version of Helm and Kubernetes:
Client: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
Which chart:
stable/prometheus-operator v6.4.3
What happened:
After upgrading to the latest version of the prometheus-operator chart when I release anything that has a prometheusrule fails when trying to pass the admissionControl webhooks.
Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
What you expected to happen:
The rules are valid, should validate and the release should not fail.
How to reproduce it (as minimally and precisely as possible):
Run the latest version of the chart 6.4.3.
Anything else we need to know:
root@my-shell-95cb5df57-f9tnp:/# curl https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/mutate -k
request has no body
The apiserver is the component making the request, so I'm wondering if you have something in your cluster that's preventing this from happening.
Googling around for GKE and admission hooks I've come across this article, indicating a firewall issue between the masters and regular nodes https://www.revsys.com/tidbits/jetstackcert-manager-gke-private-clusters/
You can simply disable the admission webhooks : prometheusOperator.admissionWebhooks.enabled=false
I have just done a test with a new GCP cluster and I don't see this behaviour, however, it's possible that on a cluster that was provisioned earlier this is still a problem. This issue appears to be related to what you are seeing, any chance you could validate it?
https://github.com/kubernetes/kubernetes/issues/79739
I suspect the issue is that you're running this in a private GKE cluster:
When Google configure the control plane for private clusters, they automatically configure VPC peering between your Kubernetes cluster鈥檚 network and a separate Google managed project. In order to restrict what Google are able to access within your cluster, the firewall rules configured restrict access to your Kubernetes pods. This means that in order to use the webhook component with a GKE private cluster, you must configure an additional firewall rule to allow the GKE control plane access to your webhook pod.
You can read more information on how to add firewall rules for the GKE control plane nodes in the GKE docs
Alternatively, you can disable the hooks by setting prometheusOperator.admissionWebhooks.enabled=false.
Thanks @vsliouniaev , sorry for the late reply but I still haven't had time to test it. Will do it today and let you know.
I'm just updating the issue as I find out possible causes - big thanks to the folks here for pointing me in the right direction: https://github.com/coreos/prometheus-operator/issues/2711
I ran the script present in the doc and I could deploy Prometheus-operator this time. I think that fixed it. I'm going to run some more tests, but I'm pretty confident the solution works.
Just a tiny issue I had with the script was that while creating the list of tags it was appending extra ,,,,, at the end.
I added a quick and simple sed to fix that, other people may find it useful.
#!/bin/bash
CLUSTER_NAME=clustername
CLUSTER_REGION=europe-west1
VPC_NETWORK=$(gcloud container clusters describe $CLUSTER_NAME --region $CLUSTER_REGION --format='value(network)')
MASTER_IPV4_CIDR_BLOCK=$(gcloud container clusters describe $CLUSTER_NAME --region $CLUSTER_REGION --format='value(privateClusterConfig.masterIpv4CidrBlock)')
NODE_POOLS_TARGET_TAGS=$(gcloud container clusters describe $CLUSTER_NAME --region $CLUSTER_REGION --format='value[terminator=","](nodePools.config.tags)' --flatten='nodePools[].config.tags[]' | sed 's/,\{2,\}//g')
echo $VPC_NETWORK
echo $MASTER_IPV4_CIDR_BLOCK
echo $NODE_POOLS_TARGET_TAGS
gcloud compute firewall-rules create "allow-apiserver-to-admission-webhook-8443" \
--allow tcp:8443 \
--network="$VPC_NETWORK" \
--source-ranges="$MASTER_IPV4_CIDR_BLOCK" \
--target-tags="$NODE_POOLS_TARGET_TAGS" \
--description="Allow apiserver access to admission webhook pod on port 8443" \
--direction INGRESS
Confirmed, I was able to release applications using prometheus rules.
@vsliouniaev thanks a lot for your help!
Thanks a lot for that @amalucelli , if you could add that to the docs in the chart that would be awesome!
The error log posted by @amartorelli shows that the call hits the service on port 443. How does enabling 8443 help in this case?
@acondrat because the call is done to the operator service which listen to 443, and then forward to the pod on port 8443
@allamand Can you please confirm what forward means in this case? Do you mean that Operator service is redirecting the API service to call the POD directly on POD_IP:POD_PORT? So first call goes to service on port 443 which is OK and second call goes to pod on port 8443 which is NOK and requires an extra firewall rule.
thanks!
Yes that's it the pod only listen on 8443, not 443, but service listen on 443 not 8443
if you have run before without disabling the webhook, you must manually delete the following kinds:
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io //delete all objects
kubectl get MutatingWebhookConfiguration //delete all objects
and after that run:
helm install --name prometheus-operator stable/prometheus-operator \
--set prometheusOperator.admissionWebhooks.enabled=false \
--set prometheusOperator.admissionWebhooks.patch.enabled=false \
--set prometheusOperator.tlsProxy.enabled=false
@BahmaniAlireza this worked as it made things run, but none of the alert rules were created
That was due to kube version restrictions in chart
Using the following values is mostly working with GKE, aside from kube-proxy
prometheus-operator:
coreDns:
enabled: false
defaultRules:
create: true
kubelet:
enabled: true
serviceMonitor:
https: false
kubeControllerManager:
enabled: false
kubeDns:
enabled: true
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
kubeTargetVersionOverride: "1.15.999"
prometheusOperator:
admissionWebhooks:
enabled: false
patch:
enabled: false
tlsProxy:
enabled: false
In my case GKE firewall rule was already added, but I had a deny-all network policy for all in a namespace. So, I've granted access and PrometheusRule start working :)
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: allow-admission-webhook-access
namespace: monitoring
spec:
podSelector:
matchLabels:
"app": "prometheus-operator"
ingress:
- from: []
ports:
- port: 8443
Policy example I took from Elasticsearch docs, possibly we need to add it to the prometheus-operator documentation:
https://github.com/elastic/cloud-on-k8s/pull/2524/files
Good luck.
Most helpful comment
if you have run before without disabling the webhook, you must manually delete the following kinds:
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io //delete all objects
kubectl get MutatingWebhookConfiguration //delete all objects
and after that run:
helm install --name prometheus-operator stable/prometheus-operator \
--set prometheusOperator.admissionWebhooks.enabled=false \
--set prometheusOperator.admissionWebhooks.patch.enabled=false \
--set prometheusOperator.tlsProxy.enabled=false