Describe the bug:
I'm trying to deploy a on-prem k8s cluster and I want to user cert-manager for the certificates. When I try to create a ClusterIssuer, it says that
Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": the server is currently unable to handle the request
When I run kubectl get apiservice it returns me the following error:
failing or missing response from https://<internal-svc-ip>:443/apis/webhook.certmanager.k8s.io/v1beta1: bad status from https://<internal-svc-ip>:443/apis/webhook.certmanager.k8s.io/v1beta1: 403
Expected behaviour:
Issuer is created when I run kubectl apply
Steps to reproduce the bug:
cert-managerAnything else we need to know?:
Environment details::
https://github.com/jetstack/cert-manager/releases/download/v0.10.0/cert-manager.yaml
apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
email: <my-mail>
server: https://acme-staging-v02.api.letsencrypt.org/directory
privateKeySecretRef:
# Secret resource used to store the account's private key.
name: example-clusterissuer-key
# Add a single challenge solver, HTTP01 using nginx
solvers:
<ul>
<li>http01:<br />
ingress:<br />
class: nginx<br />
I also have installednginxinc/kubernetes-ingress`/kind bug
I met the same error. My k8s version is 1.16.0.
Have you confirmed that your cluster passes conformance tests? You can run them using Sonobuoy.
Specifically, you should make sure you've followed the instructions under https://kubernetes.io/docs/tasks/access-kubernetes-api/configure-aggregation-layer/.
Same exact issue here, but on a fresh new GKE installation. Followed each step in the docs a few times just to be sure, same results all the time. Here's what I'm seeing at the moment (seems the apiservice cannot connect to the webhook, which is always restarted twice for reasons I don't know):
$ kubectl get pods -n cert-manager -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cert-manager-57c65cb5f5-2lpg8 1/1 Running 0 36m 10.1.1.14 gke-ewoz-gke-ewoz-default-pool-a1910d12-ll3x <none> <none>
cert-manager-cainjector-6f868ccdf6-m2nd2 1/1 Running 0 36m 10.1.0.17 gke-ewoz-gke-ewoz-default-pool-75797d79-n9vt <none> <none>
cert-manager-webhook-5896b5fb5c-9mpnh 1/1 Running 2 36m 10.1.0.18 gke-ewoz-gke-ewoz-default-pool-75797d79-n9vt <none> <none>
$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
...
labels:
...
helm.sh/chart: cert-manager-v0.10.0
...
status:
conditions:
- lastTransitionTime: "2019-09-26T14:51:43Z"
message: 'no response from https://10.1.0.18:6443: Get https://10.1.0.18:6443:
net/http: request canceled while waiting for connection (Client.Timeout exceeded
while awaiting headers)'
reason: FailedDiscoveryCheck
status: "False"
type: Available
md5-0f035d633819ad242229ace398a7a272
$ kubectl logs cert-manager-webhook-5896b5fb5c-9mpnh -n cert-manager
flag provided but not defined: -v
Usage of tls:
-tls-cert-file string
I0926 14:52:07.441421 1 secure_serving.go:116] Serving securely on [::]:6443
I0926 15:21:03.378203 1 log.go:172] http: TLS handshake error from 10.1.1.10:37256: remote error: tls: unknown certificate authority
I0926 15:21:35.195409 1 log.go:172] http: TLS handshake error from 10.1.1.10:37298: remote error: tls: unknown certificate authority
I0926 15:22:04.276655 1 log.go:172] http: TLS handshake error from 10.1.1.10:37352: remote error: tls: unknown certificate authority
@skuro are you using "private GKE nodes" by any chance?
@munnerz you are right, but I think I've now properly configured the firewall, so that the following changed to green status:
$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
...
status:
conditions:
- lastTransitionTime: "2019-09-26T17:25:06Z"
message: all checks passed
reason: Passed
status: "True"
type: Available
After that everything started to work as expected.
@skuro could you please share what firewall changes you made? I'm having this same issue on a GKE private cluster. Thanks!
@otakumike sure thing, here it is. Given the logs and error messages I knew the port had to be 6443 and the source addresses those of the k8s master, hence:
# 1) Retrieve the network tag automatically given to the worker nodes
# NOTE: this only works if you have only one cluster in your GCP project. You will have to manually inspect the result of this command to find the tag for the cluster you want to target
WORKER_NODES_TAG=$(gcloud compute instances list --format='text(tags.items[0])' --filter='metadata.kubelet-config:*' | grep tags | awk '{print $2}' | sort | uniq)
# 2) Take note of the VPC network in which you deployed your cluster
# NOTE this only works if you have only one network in which you deploy your clusters
NETWORK=$(gcloud compute instances list --format='text(networkInterfaces[0].network)' --filter='metadata.kubelet-config:*' | grep networks | awk -F'/' '{print $NF}' | sort | uniq)
# 3) Create the firewall rule targeting the tag above
gcloud compute firewall-rules create k8s-cert-manager \
--source-ranges 172.16.0.0/28 \
--target-tags $WORKER_NODES_TAG \
--allow TCP:6443 --network $NETWORK
Thanks @Skuro. Turns out I already had that rule but had forgotten about it. My problem must be somewhere else, but thanks again for the response :)
@munnerz I'll try to download Sonobuoy and launch their tests. It's strange because the cluster is new (I only have nginxinc-kubernetes-ingress) and I followed the documentation.
In my case, webhook has restarted twice. And now, the cainjector has 9 restarts (5 days of usage).
I have the same problem:
Environment:
Everything is in the same namespace.
Same problem with kubeadm on AWS. Kubernetes: 1.16.0
This problem was hugely annoying. I also found Helm had problems between older resources from v0.8 and the updated one because until I totally cleared things it out it would still report using the old API with the following error:
"Internal error occurred: failed calling admission webhook" when the admission webhook was deprecated.
I did the following and that fixed it for me:
Purged the Helm install
helm delete <name> --purge --tls --tiller-namespace cert-manager
Made certain the following was in the RBAC permissions set for the user executing cert-manager:
- apiGroups: ["webhook.certmanager.k8s.io"]
resources: ["*"]
verbs: ["*"]
Did a static manifest install
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml
kubectl delete -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml
NOTE: For Backup/Restore see this link.
In my case I feel like this is largely a problem with upgrade path not being smooth and taking care to remove all the old stuff.
So the solution is to create a user with full permissions and using Helm?
In my case I don't use Helm, and I used cert-manager manifest with a new cluster... It only has the required pods to work correctly. I followed KTHW documentation (and the official one).
Just to clarify, you should not need to create any additional RBAC resources in order to make the webhook work.
Issues like this stem from communication problems between the Kubernetes apiserver and the webhook component, and you can follow the 'chain' of communication like so:
If any part of that communication flow doesn't work, you'll see errors as you've described.
Typically, and as some people have noted above, this sometimes falls down at the * A Kubernetes APIService resource exposes the webhook as a part of the Kubernetes API level - the Kubernetes apiserver is unable to communicate with the webhook.
This can be caused by many things, but for example, on GKE this is caused by firewall rules blocking communication to the Kubernetes 'worker' nodes from the control plane. This is remediated by adding additional firewall rules to grant this permission.
On AWS, it really depends on how you've configured your VPCs/security groups and how you've configured networking. Notably though, you must configure your control plane so that it can communicate with pod/service IPs from the 'apiserver' container/network namespace.
You'll also run into this issue if you try and deploy metrics-server too, as this is deployed in a similar fashion.
Looks like it is the same as https://github.com/istio/istio/issues/10637. I build my clusters with Terraform and I was able to solve the linked issue by adding the following security group rule:
resource "aws_security_group_rule" "node_control_plane_https" {
description = "Allow HTTPS from control plane to nodes"
from_port = 443
protocol = "tcp"
security_group_id = aws_security_group.node.id
source_security_group_id = aws_security_group.control_plane.id
to_port = 443
type = "ingress"
}
I will test later whether this solves this issue here, too.
I'll try to open the port 443 though all nodes, but I think it is used by the nginx Ingress resource. Maybe the master node needs this port for the webhook?
I'm also using certs for the communication/auth between nodes. Is it possible that this webhook needs a valid certificate for auth? How can I configure it?
I wasn鈥檛 suggesting the need to create an RBAC install 鈥攚hat I saw is that the upgrade doesn鈥檛 occur correctly for an RBAC install without updating the permissions. Many of which changed because the APIService endpoint changed. This would be relevant to a fresh install not working.
Second point was that the manifests don鈥檛 correctly remove old behavior. So needed to run a helm purge as well as a reversal using a static manifest before installing to clear some of the incorrect items. This seems to be the case for the issues people are having with/without RBAC involved.
In my case on a fresh GKE cluster (v1.13.7-gke.24) with kubectl (v1.11.1 or v1.14.3) it seems to just be a matter of waiting.
After I first apply the static manifest:
kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/v0.10.1/cert-manager.yaml
If I try to create any ClusterIssuer right away, I get:
Error from server (NotFound): error when deleting "cluster/platform/cert-manager/2_issuers.yaml": the server could not find the requested resource (delete clusterissuers.certmanager.k8s.io letsencrypt-staging)
Error from server (NotFound): error when deleting "cluster/platform/cert-manager/2_issuers.yaml": the server could not find the requested resource (delete clusterissuers.certmanager.k8s.io letsencrypt-prod)
This seems to correspond with:
$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
endpoints for service/cert-manager-webhook in "cert-manager" have no addresses
But if I wait a few seconds, that eventually changes to:
$ kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
all checks passed
And at that point if I try again to apply my ClusterIssuer manifest it works. This stops me from being able to kubectl apply -Rf my whole cert-manager + issuers manifests in one go.
Isn't there some way to let me declare everything at once and have the issuers work when they're ready? Isn't that the k8s way?
This workaround gets it done for me for now:
kubectl apply -Rf cert-manager/manifest.yaml
# work around https://github.com/jetstack/cert-manager/issues/2109
until [ "$(kubectl get apiservice v1beta1.webhook.certmanager.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')" == "True" ];
do echo "Waiting for v1beta1.webhook.certmanager.k8s.io..." && sleep 1
done
kubectl apply -Rf cert-manager/issuers.yaml
@themightychris my apiservice is returning HTTP 403.
Logs are the following:
Cluster doesn't provide requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
In the ConfigMap I only have client-ca-file. This is something that @munnerz mentioned.
I need to create a specific certificate for the Webhook, and I don't know the name for it and where to add it
Have you confirmed that your cluster passes conformance tests? You can run them using Sonobuoy.
Specifically, you should make sure you've followed the instructions under https://kubernetes.io/docs/tasks/access-kubernetes-api/configure-aggregation-layer/.
I'm seeing this as well in EKS when trying to use a custom CNI. For metrics server, I put the API Service on the host network and that resolved the issue:
Can we get something like this for the cert-manager chart? Manually adding this to the deployment after the install makes the API Service go available:
kubectl get apiservice v1beta1.webhook.cert-manager.io
NAME SERVICE AVAILABLE AGE
v1beta1.webhook.cert-manager.io cert-manager/cert-manager-webhook True 2d20h
Just stumbled upon this. It seems to be related to #2340. I also have a private cluster with GKE and adding an ingress firewall rule granting access from the master API CIDR range to port 6443 resolved the issue for me.
This is also documented here
It's probably worth mentioning the following: after creating the firewall rule, running kubectl apply -f test-resource.yml, witnessing it creating the resources with no error and confirming the "certificate issued successfully" for the test resources, I have deleted the firewall rule, deleted the test-resources.yml resources, re-created them successfully without the firewall rule. In the meantime, the webhook thing seemed to be working fine.
Only by removing the entire helm chart and re-adding it again I could again see the initial error (given the firewall rule was no longer in place).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
I'm going to close this now as I don't think there's anything for us to do - we have explicit notes in the documentation around using private clusters/apiservers :)
If you think there's somewhere we could improve, please open an issue describing the improvement to make and/or create a PR over at github.com/jetstack/cert-manager.
If anyone here is attempting to terraform this and running into this issue, a decent solution is to manually tag your nodes like so:
resource "google_container_cluster_nodepool" "utilities_nodepool" {
...
node_config {
...
tags = [var.cert_manager_node_network_tag]
...
}
and create a firewall rule utilising the tag like so:
resource "google_compute_firewall" "cert-manager-firewall-rule" {
name = "cert-manager-firewall-rule"
project = var.project_id
network = var.vpc_network_name
source_ranges = var.master_cidr
target_tags = [var.cert_manager_node_network_tag]
allow {
protocol = "tcp"
ports = ["6443"]
}
}
Of course, you'll need to ensure that cert-man is installed only to those nodes, but can be done easily with node-attractors and taints.
@raeballz thanks for the details 馃槃 would you be able to make a PR to our 'Compatibility' docs with this info, for others to use in future? I think it'd really help to have some examples like this! https://cert-manager.io/docs/installation/compatibility/#gke
You can find the Markdown document to edit here: https://github.com/cert-manager/website/blob/master/content/en/docs/installation/compatibility.md
Ran into this about a year ago and ran into it again just now. Takes me ages to figure it out by eventually stumbling on this same thread again.
The official docs which I found via Google:
https://cert-manager.io/docs/faq/webhook/
https://cert-manager.io/docs/installation/compatibility/
All they say is:
In order to use the webhook component with a GKE private cluster, you must configure an additional firewall rule to allow the GKE control plane access to your webhook pod.
With no mention of WHICH firewall rule needs to be added. I can see that I had to add 6443 last time now that I look at the previous rules. Hopefully I find myself next year as well with a tag:
Error from server (InternalError): error when creating "/data/helm/certmanager/config/cluster_issuer.yaml": Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": the server is currently unable to handle the request
Most helpful comment
@otakumike sure thing, here it is. Given the logs and error messages I knew the port had to be
6443and the source addresses those of the k8s master, hence: