Cert-manager: FailedDiscoveryCheck (403) with cert-manager Webhook

Created on 25 Sep 2019  路  27Comments  路  Source: jetstack/cert-manager

Describe the bug:
I'm trying to deploy a on-prem k8s cluster and I want to user cert-manager for the certificates. When I try to create a ClusterIssuer, it says that

Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": the server is currently unable to handle the request

When I run kubectl get apiservice it returns me the following error:
failing or missing response from https://<internal-svc-ip>:443/apis/webhook.certmanager.k8s.io/v1beta1: bad status from https://<internal-svc-ip>:443/apis/webhook.certmanager.k8s.io/v1beta1: 403

Expected behaviour:
Issuer is created when I run kubectl apply

Steps to reproduce the bug:

  • Create namespace cert-manager
  • Deploy using the manifest YAML
  • Try to create an Issuer following the example and documentation. Also trying the tests

Anything else we need to know?:

Environment details::

  • Kubernetes version (e.g. v1.10.2): 1.15.3
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): on-prem
  • cert-manager version (e.g. v0.4.0): 0.10
  • Install method (e.g. helm or static manifests): static manifest at https://github.com/jetstack/cert-manager/releases/download/v0.10.0/cert-manager.yaml
  • YAML file:
    ` apiVersion: certmanager.k8s.io/v1alpha1 kind: ClusterIssuer metadata: name: letsencrypt-staging spec: acme: email: <my-mail> server: https://acme-staging-v02.api.letsencrypt.org/directory privateKeySecretRef: # Secret resource used to store the account's private key. name: example-clusterissuer-key # Add a single challenge solver, HTTP01 using nginx solvers: <ul> <li>http01:<br /> ingress:<br /> class: nginx<br /> I also have installednginxinc/kubernetes-ingress`

/kind bug

lifecyclstale triagsupport

Most helpful comment

@otakumike sure thing, here it is. Given the logs and error messages I knew the port had to be 6443 and the source addresses those of the k8s master, hence:

# 1) Retrieve the network tag automatically given to the worker nodes
# NOTE: this only works if you have only one cluster in your GCP project. You will have to manually inspect the result of this command to find the tag for the cluster you want to target
WORKER_NODES_TAG=$(gcloud compute instances list --format='text(tags.items[0])' --filter='metadata.kubelet-config:*' | grep tags | awk '{print $2}' | sort | uniq)

# 2) Take note of the VPC network in which you deployed your cluster
# NOTE this only works if you have only one network in which you deploy your clusters
NETWORK=$(gcloud compute instances list --format='text(networkInterfaces[0].network)' --filter='metadata.kubelet-config:*' | grep networks | awk -F'/' '{print $NF}' | sort | uniq)

# 3) Create the firewall rule targeting the tag above
gcloud compute firewall-rules create k8s-cert-manager \                                                                                                           
  --source-ranges 172.16.0.0/28 \
  --target-tags $WORKER_NODES_TAG  \
  --allow TCP:6443 --network $NETWORK

All 27 comments

I met the same error. My k8s version is 1.16.0.

Have you confirmed that your cluster passes conformance tests? You can run them using Sonobuoy.

Specifically, you should make sure you've followed the instructions under https://kubernetes.io/docs/tasks/access-kubernetes-api/configure-aggregation-layer/.

Same exact issue here, but on a fresh new GKE installation. Followed each step in the docs a few times just to be sure, same results all the time. Here's what I'm seeing at the moment (seems the apiservice cannot connect to the webhook, which is always restarted twice for reasons I don't know):

$ kubectl get pods -n cert-manager -o wide
NAME                                       READY   STATUS    RESTARTS   AGE   IP          NODE                                           NOMINATED NODE   READINESS GATES
cert-manager-57c65cb5f5-2lpg8              1/1     Running   0          36m   10.1.1.14   gke-ewoz-gke-ewoz-default-pool-a1910d12-ll3x   <none>           <none>
cert-manager-cainjector-6f868ccdf6-m2nd2   1/1     Running   0          36m   10.1.0.17   gke-ewoz-gke-ewoz-default-pool-75797d79-n9vt   <none>           <none>
cert-manager-webhook-5896b5fb5c-9mpnh      1/1     Running   2          36m   10.1.0.18   gke-ewoz-gke-ewoz-default-pool-75797d79-n9vt   <none>           <none>
$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
...
  labels:
...
    helm.sh/chart: cert-manager-v0.10.0
...
status:
  conditions:
  - lastTransitionTime: "2019-09-26T14:51:43Z"
    message: 'no response from https://10.1.0.18:6443: Get https://10.1.0.18:6443:
      net/http: request canceled while waiting for connection (Client.Timeout exceeded
      while awaiting headers)'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available



md5-0f035d633819ad242229ace398a7a272



$ kubectl logs cert-manager-webhook-5896b5fb5c-9mpnh -n cert-manager
flag provided but not defined: -v
Usage of tls:
  -tls-cert-file string

I0926 14:52:07.441421       1 secure_serving.go:116] Serving securely on [::]:6443
I0926 15:21:03.378203       1 log.go:172] http: TLS handshake error from 10.1.1.10:37256: remote error: tls: unknown certificate authority
I0926 15:21:35.195409       1 log.go:172] http: TLS handshake error from 10.1.1.10:37298: remote error: tls: unknown certificate authority
I0926 15:22:04.276655       1 log.go:172] http: TLS handshake error from 10.1.1.10:37352: remote error: tls: unknown certificate authority

@skuro are you using "private GKE nodes" by any chance?

@munnerz you are right, but I think I've now properly configured the firewall, so that the following changed to green status:

$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
...
status:
  conditions:
  - lastTransitionTime: "2019-09-26T17:25:06Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available

After that everything started to work as expected.

@skuro could you please share what firewall changes you made? I'm having this same issue on a GKE private cluster. Thanks!

@otakumike sure thing, here it is. Given the logs and error messages I knew the port had to be 6443 and the source addresses those of the k8s master, hence:

# 1) Retrieve the network tag automatically given to the worker nodes
# NOTE: this only works if you have only one cluster in your GCP project. You will have to manually inspect the result of this command to find the tag for the cluster you want to target
WORKER_NODES_TAG=$(gcloud compute instances list --format='text(tags.items[0])' --filter='metadata.kubelet-config:*' | grep tags | awk '{print $2}' | sort | uniq)

# 2) Take note of the VPC network in which you deployed your cluster
# NOTE this only works if you have only one network in which you deploy your clusters
NETWORK=$(gcloud compute instances list --format='text(networkInterfaces[0].network)' --filter='metadata.kubelet-config:*' | grep networks | awk -F'/' '{print $NF}' | sort | uniq)

# 3) Create the firewall rule targeting the tag above
gcloud compute firewall-rules create k8s-cert-manager \                                                                                                           
  --source-ranges 172.16.0.0/28 \
  --target-tags $WORKER_NODES_TAG  \
  --allow TCP:6443 --network $NETWORK

Thanks @Skuro. Turns out I already had that rule but had forgotten about it. My problem must be somewhere else, but thanks again for the response :)

@munnerz I'll try to download Sonobuoy and launch their tests. It's strange because the cluster is new (I only have nginxinc-kubernetes-ingress) and I followed the documentation.

In my case, webhook has restarted twice. And now, the cainjector has 9 restarts (5 days of usage).

I have the same problem:
Environment:

  • Cloud provider: GKE
  • Kubernetes version: v1.13.7-gke.8
  • Helm version: 2.14.3
  • Cert-manager version: v0.10.0
  • Nginx-ingress version: 0.25.1

Everything is in the same namespace.

Same problem with kubeadm on AWS. Kubernetes: 1.16.0

This problem was hugely annoying. I also found Helm had problems between older resources from v0.8 and the updated one because until I totally cleared things it out it would still report using the old API with the following error:

"Internal error occurred: failed calling admission webhook" when the admission webhook was deprecated.

I did the following and that fixed it for me:

  • Purged the Helm install
    helm delete <name> --purge --tls --tiller-namespace cert-manager

  • Made certain the following was in the RBAC permissions set for the user executing cert-manager:
    - apiGroups: ["webhook.certmanager.k8s.io"] resources: ["*"] verbs: ["*"]

  • Did a static manifest install

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml

  • Did a static manifest uninstall (warning! This clears your namespace, etc...backup your cert-manager settings first)

kubectl delete -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml

NOTE: For Backup/Restore see this link.

  • Re-installed using Helm after rebuilding the namespace/permissions from my yaml scripts.

In my case I feel like this is largely a problem with upgrade path not being smooth and taking care to remove all the old stuff.

So the solution is to create a user with full permissions and using Helm?

In my case I don't use Helm, and I used cert-manager manifest with a new cluster... It only has the required pods to work correctly. I followed KTHW documentation (and the official one).

Just to clarify, you should not need to create any additional RBAC resources in order to make the webhook work.

Issues like this stem from communication problems between the Kubernetes apiserver and the webhook component, and you can follow the 'chain' of communication like so:

  • The webhook runs in the cluster in the cert-manager namespace
  • A Kubernetes APIService resource exposes the webhook as a part of the Kubernetes API
  • A Kubernetes ValidatingWebhookConfiguration resources tells the apiserver to talk to the webhook via the APIService resource (i.e. it loops back and talks to itself) in order to validate resources.

If any part of that communication flow doesn't work, you'll see errors as you've described.

Typically, and as some people have noted above, this sometimes falls down at the * A Kubernetes APIService resource exposes the webhook as a part of the Kubernetes API level - the Kubernetes apiserver is unable to communicate with the webhook.

This can be caused by many things, but for example, on GKE this is caused by firewall rules blocking communication to the Kubernetes 'worker' nodes from the control plane. This is remediated by adding additional firewall rules to grant this permission.

On AWS, it really depends on how you've configured your VPCs/security groups and how you've configured networking. Notably though, you must configure your control plane so that it can communicate with pod/service IPs from the 'apiserver' container/network namespace.

You'll also run into this issue if you try and deploy metrics-server too, as this is deployed in a similar fashion.

Looks like it is the same as https://github.com/istio/istio/issues/10637. I build my clusters with Terraform and I was able to solve the linked issue by adding the following security group rule:

resource "aws_security_group_rule" "node_control_plane_https" {
  description              = "Allow HTTPS from control plane to nodes"
  from_port                = 443
  protocol                 = "tcp"
  security_group_id        = aws_security_group.node.id
  source_security_group_id = aws_security_group.control_plane.id
  to_port                  = 443
  type                     = "ingress"
}

I will test later whether this solves this issue here, too.

I'll try to open the port 443 though all nodes, but I think it is used by the nginx Ingress resource. Maybe the master node needs this port for the webhook?

I'm also using certs for the communication/auth between nodes. Is it possible that this webhook needs a valid certificate for auth? How can I configure it?

I wasn鈥檛 suggesting the need to create an RBAC install 鈥攚hat I saw is that the upgrade doesn鈥檛 occur correctly for an RBAC install without updating the permissions. Many of which changed because the APIService endpoint changed. This would be relevant to a fresh install not working.

Second point was that the manifests don鈥檛 correctly remove old behavior. So needed to run a helm purge as well as a reversal using a static manifest before installing to clear some of the incorrect items. This seems to be the case for the issues people are having with/without RBAC involved.

In my case on a fresh GKE cluster (v1.13.7-gke.24) with kubectl (v1.11.1 or v1.14.3) it seems to just be a matter of waiting.

After I first apply the static manifest:

kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml

If I try to create any ClusterIssuer right away, I get:

Error from server (NotFound): error when deleting "cluster/platform/cert-manager/2_issuers.yaml": the server could not find the requested resource (delete clusterissuers.certmanager.k8s.io letsencrypt-staging)
Error from server (NotFound): error when deleting "cluster/platform/cert-manager/2_issuers.yaml": the server could not find the requested resource (delete clusterissuers.certmanager.k8s.io letsencrypt-prod)

This seems to correspond with:

$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
endpoints for service/cert-manager-webhook in "cert-manager" have no addresses

But if I wait a few seconds, that eventually changes to:

$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
all checks passed

And at that point if I try again to apply my ClusterIssuer manifest it works. This stops me from being able to kubectl apply -Rf my whole cert-manager + issuers manifests in one go.

Isn't there some way to let me declare everything at once and have the issuers work when they're ready? Isn't that the k8s way?

Update: Workaround

This workaround gets it done for me for now:

kubectl apply -Rf cert-manager/manifest.yaml
# work around https://github.com/jetstack/cert-manager/issues/2109
until [ "$(kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')" == "True" ];
do echo "Waiting for v1beta1.webhook.certmanager.k8s.io..." && sleep 1
done
kubectl apply -Rf cert-manager/issuers.yaml

@themightychris my apiservice is returning HTTP 403.

Logs are the following:
Cluster doesn't provide requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.

In the ConfigMap I only have client-ca-file. This is something that @munnerz mentioned.

I need to create a specific certificate for the Webhook, and I don't know the name for it and where to add it

Have you confirmed that your cluster passes conformance tests? You can run them using Sonobuoy.

Specifically, you should make sure you've followed the instructions under https://kubernetes.io/docs/tasks/access-kubernetes-api/configure-aggregation-layer/.

I'm seeing this as well in EKS when trying to use a custom CNI. For metrics server, I put the API Service on the host network and that resolved the issue:

https://github.com/helm/charts/blob/c4d3dde988271fddf80c00bd9281453202234b9d/stable/metrics-server/templates/metrics-server-deployment.yaml#L38-L40

Can we get something like this for the cert-manager chart? Manually adding this to the deployment after the install makes the API Service go available:

kubectl get apiservice v1beta1.webhook.cert-manager.io
NAME                              SERVICE                             AVAILABLE   AGE
v1beta1.webhook.cert-manager.io   cert-manager/cert-manager-webhook   True        2d20h

Just stumbled upon this. It seems to be related to #2340. I also have a private cluster with GKE and adding an ingress firewall rule granting access from the master API CIDR range to port 6443 resolved the issue for me.

This is also documented here

It's probably worth mentioning the following: after creating the firewall rule, running kubectl apply -f test-resource.yml, witnessing it creating the resources with no error and confirming the "certificate issued successfully" for the test resources, I have deleted the firewall rule, deleted the test-resources.yml resources, re-created them successfully without the firewall rule. In the meantime, the webhook thing seemed to be working fine.

Only by removing the entire helm chart and re-adding it again I could again see the initial error (given the firewall rule was no longer in place).

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

I'm going to close this now as I don't think there's anything for us to do - we have explicit notes in the documentation around using private clusters/apiservers :)

If you think there's somewhere we could improve, please open an issue describing the improvement to make and/or create a PR over at github.com/jetstack/cert-manager.

If anyone here is attempting to terraform this and running into this issue, a decent solution is to manually tag your nodes like so:

resource "google_container_cluster_nodepool" "utilities_nodepool" {
  ...
  node_config {
    ...
    tags = [var.cert_manager_node_network_tag] 
    ...
}

and create a firewall rule utilising the tag like so:

resource "google_compute_firewall" "cert-manager-firewall-rule" {
  name                   = "cert-manager-firewall-rule"
  project                = var.project_id
  network                = var.vpc_network_name
  source_ranges          = var.master_cidr
  target_tags            = [var.cert_manager_node_network_tag]

  allow {
    protocol = "tcp"
    ports = ["6443"] 
  }
}

Of course, you'll need to ensure that cert-man is installed only to those nodes, but can be done easily with node-attractors and taints.

@raeballz thanks for the details 馃槃 would you be able to make a PR to our 'Compatibility' docs with this info, for others to use in future? I think it'd really help to have some examples like this! https://cert-manager.io/docs/installation/compatibility/#gke

You can find the Markdown document to edit here: https://github.com/cert-manager/website/blob/master/content/en/docs/installation/compatibility.md

Ran into this about a year ago and ran into it again just now. Takes me ages to figure it out by eventually stumbling on this same thread again.

The official docs which I found via Google:
https://cert-manager.io/docs/faq/webhook/
https://cert-manager.io/docs/installation/compatibility/

All they say is:

In order to use the webhook component with a GKE private cluster, you must configure an additional firewall rule to allow the GKE control plane access to your webhook pod.

With no mention of WHICH firewall rule needs to be added. I can see that I had to add 6443 last time now that I look at the previous rules. Hopefully I find myself next year as well with a tag:

Error from server (InternalError): error when creating "/data/helm/certmanager/config/cluster_issuer.yaml": Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": the server is currently unable to handle the request
Was this page helpful?
0 / 5 - 0 ratings

Related issues

Azylog picture Azylog  路  3Comments

f-f picture f-f  路  4Comments

munjal-patel picture munjal-patel  路  3Comments

Stono picture Stono  路  3Comments

cpick picture cpick  路  3Comments