Describe the bug:
Installing cert-manager ends with
webhook fails to start MountVolume.SetUp failed for volume "certs" : secret "cert-manager-webhook-webhook-tls" not found
Expected behaviour:
No errors, pods start without errors
Steps to reproduce the bug:
Simply install cert-manager from helm or static manifests
Anything else we need to know?:
Installation result with helm
helm ls --namespace cert-manager
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
cert-manager cert-manager 2 2019-12-16 18:40:14.296856384 +0100 CET deployed cert-manager-v0.12.0 v0.12.0
and the pods
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-784bc9c58b-xq25x 1/1 Running 0 20m
cert-manager cert-manager-cainjector-85fbdf788-d8s5l 0/1 CrashLoopBackOff 9 28m
cert-manager cert-manager-webhook-76f9b64b45-brpp5 0/1 ContainerCreating 0 28m
default multitool 1/1 Running 0 88m
ingress-nginx default-http-backend-67cf578fc4-lr5jw 1/1 Running 0 32h
ingress-nginx nginx-ingress-controller-7gczj 1/1 Running 0 32h
ingress-nginx nginx-ingress-controller-x5j2x 1/1 Running 0 32h
kube-system calico-kube-controllers-5fd6f588f8-jhtl5 1/1 Running 1 107m
kube-system calico-node-82s74 1/1 Running 0 92m
kube-system calico-node-qv7fg 1/1 Running 0 92m
kube-system coredns-5c59fd465f-nlwcw 1/1 Running 0 32h
kube-system coredns-5c59fd465f-z8jvg 1/1 Running 0 32h
kube-system coredns-autoscaler-d765c8497-hrkzk 1/1 Running 0 32h
kube-system metrics-server-64f6dffb84-5mwrk 1/1 Running 0 32h
kube-system rke-coredns-addon-deploy-job-mldcf 0/1 Completed 0 32h
kube-system rke-ingress-controller-deploy-job-wxvt7 0/1 Completed 0 32h
kube-system rke-metrics-addon-deploy-job-szd4v 0/1 Completed 0 32h
kube-system rke-network-plugin-deploy-job-d9cbg 0/1 Completed 0 32h
and there is definitively no such secret cert-manager-webhook-webhook-tls
kubectl get secret -n cert-manager
NAME TYPE DATA AGE
cert-manager-cainjector-token-m65nj kubernetes.io/service-account-token 3 18m
cert-manager-token-rzmdx kubernetes.io/service-account-token 3 18m
cert-manager-webhook-token-59qnz kubernetes.io/service-account-token 3 18m
Pod details cert-manager-cainjector
kubectl describe pod cert-manager-cainjector-6659d6844d-mpxc7 -n cert-manager
Name: cert-manager-cainjector-6659d6844d-mpxc7
Namespace: cert-manager
Priority: 0
Node: x.x.x.x/192.168.100.2
Start Time: Tue, 17 Dec 2019 17:55:34 +0100
Labels: app=cainjector
app.kubernetes.io/instance=cert-manager
app.kubernetes.io/managed-by=Tiller
app.kubernetes.io/name=cainjector
helm.sh/chart=cert-manager-v0.12.0
pod-template-hash=6659d6844d
Annotations: cni.projectcalico.org/podIP: 10.42.111.203/32
Status: Running
IP: 10.42.111.203
IPs:
IP: 10.42.111.203
Controlled By: ReplicaSet/cert-manager-cainjector-6659d6844d
Containers:
cert-manager:
Container ID: docker://674aeca3b8baed3c230c349e9bfea0f50b3cc287adddb6733e282e306712ed49
Image: quay.io/jetstack/cert-manager-cainjector:v0.12.0
Image ID: docker-pullable://quay.io/jetstack/cert-manager-cainjector@sha256:9ff6923f6c567573103816796df283d03256bc7a9edb7450542e106b349cf34a
Port: <none>
Host Port: <none>
Args:
--v=2
--leader-election-namespace=kube-system
State: Terminated
Reason: Error
Exit Code: 255
Started: Tue, 17 Dec 2019 17:56:11 +0100
Finished: Tue, 17 Dec 2019 17:56:41 +0100
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Tue, 17 Dec 2019 17:55:38 +0100
Finished: Tue, 17 Dec 2019 17:56:08 +0100
Ready: False
Restart Count: 1
Environment:
POD_NAMESPACE: cert-manager (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from cert-manager-cainjector-token-lhz85 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cert-manager-cainjector-token-lhz85:
Type: Secret (a volume populated by a Secret)
SecretName: cert-manager-cainjector-token-lhz85
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned cert-manager/cert-manager-cainjector-6659d6844d-mpxc7 to x.x.x.x
Normal Pulled 9s (x2 over 42s) kubelet, x.x.x.x Container image "quay.io/jetstack/cert-manager-cainjector:v0.12.0" already present on machine
Normal Created 8s (x2 over 41s) kubelet, x.x.x.x Created container cert-manager
Normal Started 8s (x2 over 41s) kubelet, x.x.x.x Started container cert-manager
Warning BackOff <invalid> kubelet, x.x.x.x Back-off restarting failed container
Pod details cert-manager-webhook
kubectl describe pod cert-manager-webhook-547567b88f-b7fzk -n cert-manager
Name: cert-manager-webhook-547567b88f-b7fzk
Namespace: cert-manager
Priority: 0
Node: x.x.x.x/192.168.100.1
Start Time: Tue, 17 Dec 2019 17:55:36 +0100
Labels: app=webhook
app.kubernetes.io/instance=cert-manager
app.kubernetes.io/managed-by=Tiller
app.kubernetes.io/name=webhook
helm.sh/chart=cert-manager-v0.12.0
pod-template-hash=547567b88f
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/cert-manager-webhook-547567b88f
Containers:
cert-manager:
Container ID:
Image: quay.io/jetstack/cert-manager-webhook:v0.12.0
Image ID:
Port: <none>
Host Port: <none>
Args:
--v=2
--secure-port=10250
--tls-cert-file=/certs/tls.crt
--tls-private-key-file=/certs/tls.key
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:6080/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:6080/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAMESPACE: cert-manager (v1:metadata.namespace)
Mounts:
/certs from certs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from cert-manager-webhook-token-lf56p (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
certs:
Type: Secret (a volume populated by a Secret)
SecretName: cert-manager-webhook-tls
Optional: false
cert-manager-webhook-token-lf56p:
Type: Secret (a volume populated by a Secret)
SecretName: cert-manager-webhook-token-lf56p
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned cert-manager/cert-manager-webhook-547567b88f-b7fzk to y.y.y.y
Warning FailedMount <invalid> kubelet, y.y.y.y Unable to attach or mount volumes: unmounted volumes=[certs], unattached volumes=[cert-manager-webhook-token-lf56p certs]: timed out waiting for the condition
Warning FailedMount <invalid> (x9 over 118s) kubelet, y.y.y.y MountVolume.SetUp failed for volume "certs" : secret "cert-manager-webhook-tls" not found
possible related issues (mostly closed)
Environment details::
v1.16.2baremetal0.10.0, 0.11.0 and 0.12.0helm and static manifests/kind bug
Can confirm, currently no way of getting this into an operational state.
You are also running k8s v.1.16?
No, actually running on 1.17.0.
Seems to not be related to k8s then, is it?
However, I did spin up a 1.16 cluster and had the same issue, not sure. Maybe issue started with changes made from k8s 1.16, will start digging into it in a few days.
same problem here on gke with kubernetes 1.14
@g0blin79 with which version of cert-manager? Did you try other versions?
@papanito 0.12.0 but same problem with 0.11.x
So I tried some more stuff today, no success. Rolling everything back inside the actual cluster and doing everything as the docs say didn't help at all.
Local cluster using VMs also didn't help. All on the latest k8s.
Has anyone else tried something that helped by any chance?
I had it working for quite a few times in the past, I can't grasp what is going wrong now...
Neither do I. I had it working before as well 1.10 as well as 1.11
Same problem with cert-manager 0.8.1 on kube 1.15.3 in an AWS kops cluster. I'm currently debugging this so it's a fresh cluster where the only thing running on it is cert-manager.
Installed with (basically):
wget https://github.com/jetstack/cert-manager/releases/download/v0.8.1/cert-manager.yaml
kubectl apply --validate=false -f cert-manager.yml
# unable to recognize "namespaces/cert-manager": no matches for kind "ClusterIssuer" in version "certmanager.k8s.io/v1alpha1"
# some kinda race? usually applying it twice just works 🤡
kubectl apply --validate=false -f cert-manager.yml
I'm getting things like
Error from server (InternalError): error when creating "namespaces/cert-manager": Internal error occurred: failed calling webhook "clusterissuers.admission.certmanager.k8s.io": the server is currently unable to handle the request
When I inspect the webhook pod I see:
MountVolume.SetUp failed for volume "certs" : secret "cert-manager-webhook-webhook-tls" not found
Back-off restarting failed container
Here's some logs: logs-from-webhook-in-cert-manager-webhook-dfcbcc64b-6tg7k.txt
Some standouts:
flag provided but not defined: -v
Usage of tls:
-tls-cert-file string
W1229 17:43:31.745737 1 authentication.go:262] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:cert-manager:cert-manager-webhook" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
F1229 17:43:31.746273 1 cmd.go:42] configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:cert-manager:cert-manager-webhook" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
Obviously looks like some kinda permissions error because I see the "cert-manager-webhook-webhook-tl" secret in the cert-manager namespace
NAME TYPE DATA AGE
cert-manager-cainjector-token-dwlzt kubernetes.io/service-account-token 3 26m
cert-manager-token-4f2dr kubernetes.io/service-account-token 3 26m
cert-manager-webhook-ca kubernetes.io/tls 3 25m
cert-manager-webhook-token-8lcnh kubernetes.io/service-account-token 3 26m
cert-manager-webhook-webhook-tls kubernetes.io/tls 3 25m
default-token-q65fw kubernetes.io/service-account-token 3 26m
Still lookin
@austinpray interesting! At least your secrets are being created. Many people here have issues with the secrets itself, so it seems that you're one step ahead of us. Have you tried setting the permissions of the service account manually and retry it? Maybe that gets it working?!
Interesting indeed, in my case I never saw the secrets created. Mine is also a fresh cluster, setup from scratch - using rke - and does not run anything at the moment. Also the installtion with e.g. helm does not give any indication that something went wrong
helm install \
cert-manager \
--namespace cert-manager \
--version v0.12.0 \
jetstack/cert-manager
NAME: cert-manager
LAST DEPLOYED: Mon Dec 30 15:48:19 2019
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
cert-manager has been deployed successfully!
In order to begin issuing certificates, you will need to set up a ClusterIssuer
or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).
More information on the different types of issuers and how to configure them
can be found in our documentation:
https://docs.cert-manager.io/en/latest/reference/issuers.html
For information on how to configure cert-manager to automatically provision
Certificates for Ingress resources, take a look at the `ingress-shim`
documentation:
Does anyone know where to find the source for the webhook? Is it open sourced somewhere or is it just the built image?
Apparently the controller is responsible
The webhook's 'webhookbootstrap' controller is responsible for creating these
secrets with no manual intervention needed
So looking at the code it seems it's created here:
https://github.com/jetstack/cert-manager/blob/master/cmd/controller/app/controller.go#L242
Don't know if that helps you @filipweidemann
Thank you @papanito. Yes that helps, until someone with an actual solution comes up I just want to try and work it out as well. Fingers crossed. :)
By the way, are you guys running high availability clusters by any chance? Because I am and I want to rule out that my API servers are somehow messing with it :P
@austinpray
@papanito
@g0blin79
Nope simple cluster with 2 nodes at the moment
@filipweidemann yep I've got 3 masters and 3 nodes
Okay so that's also a dead end...
Thanks for the replies though 👍
Far fetched but can it be related to the underlying os? What are you guys using? My nodes run on Debian 10.
Debian 10 on my nodes as well @papanito
For me the issue was fixed after a proper label on a namespace cert-manager was added. I had there label
cert-manager.io/disable-validation: "true"
But in my case (helm chart version 0.8.1) I had to put there
certmanager.k8s.io/disable-validation: "true".
When it was done, finally missing certificates for the cluster were generated - at first cert-manager-webhook-ca (took around a minute) and then cert-manager-webhook-webhook-tls (also not immediately).
To be sure that you have similar problem, check list of your issuers and certificates:
kubectl describe issuers -n cert-manager
kubectl describe certificates -n cert-manager
Precious source to clarify it was in https://cert-manager-munnerz.readthedocs.io/en/stable/admin/resource-validation-webhook.html
It says that webhook is enabled if cert-manager has such new name of label - can be added by
kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true
I struggled with this cert-manager for 2 days deleting, cleaning, reinstalling, adding already generated in another cluster certificates - but nothing helped until this label was assigned...
Most important - do not hurry and track status of certificates via
kubectl describe certificate --namespace cert-manager
Damn, I see. They changed their namespace! I am not sure if I used the cert-manager.io one, but rather the old deprecated one for version 0.12
Gonna give this a shot again @dladlk
Doesn't help with the 0.12 version of cert-manager. Same error, no webhook-tls secret is being created.
I just removed one worker node, ran kubeadm reset and bootstrapped a single node control plane on it, tainted the master and deployed cert-manager, and guess what? All secrets are being created as it should. There is something going on, but I am not sure that this is a bug anymore or even related to cert-manager...
@dakale thanks for the hint. I've added the label, however still the same problem. Also deleting the complete namespace and re-create cert-manager again did not help. Will have a look at what @filipweidemann mentioned
Just an idea: could it all be related to some networking issues?
Because even though the single node control plane with tainted master stuff worked, adding a node and scheduling the cert-manager pods on the actual worker node literally creates the exact same issues...
Are you guys running Calico?
Yes running calico with nft enabled for calico dameonset according to projectcalico/calico#2322
....
Environment:
FELIX_IPTABLESBACKEND: NFT
....
Alright. I had calico running as well, but also tried Flannel in the meantime, same result.
I was getting sick and tired of wasting my time with trying to understand what the whole chain of errors leading to this error actually looks like, and I don't want to waste more time, so here's something that I tried a few minutes ago out of straight up anger and it worked (hacky, but it works):
First of all, taint one of your master nodes (doesn't matter if you're running HA clusters or single node control planes), then assign a label to the tainted master, something like kubectl label node yourmaster schedule-certmanager=true.
After you've done that, download the desired version of the cert-manager manifest you want to deploy, and add a nodeSelector to all 3 deployment resources inside this manifest, looking something like this:
kind: Deployment
spec:
containers:
...
volumes:
...
nodeSelector:
schedule-certmanager: "true"
After you've done this, kubectl apply -f <your-yaml-file> and watch it finally coming to life...
I hope no one finding this issue with deeper understanding of this whole chain is throwing up when he/she sees it, but hey. It's working for now.
However, I'd gladly appreciate any attention from maintainers or alike, so we should keep this issue open. Something is strange if the secrets are only being created on master nodes...
thanks @filipweidemann for your input this saved my day ;-) However I figured that tainting may not been necessary, I've did the following
deleted namespace cert-manager
kubectl delete ns cert-manager --force --grace-period=0
created/modified the manifest according to your suggestion
helm template cert-manager jetstack/cert-manager --namespace cert-manager > cert-manager.yml
Then add nodeSelector to deployments in cert-manager.yml
labeled the master node
kubectl label node <master node name> schedule-certmanager=true
created ns cert-manager (no additional lables added)
kubectl create ns cert-manager
applied manifest
kubectl apply -f cert-manager.yml
Result
kubectl -n cert-manager get pods
NAME READY STATUS RESTARTS AGE
cert-manager-55798cbfdf-mtbz6 1/1 Running 0 3m38s
cert-manager-cainjector-5b5d88b76b-drgbm 1/1 Running 0 3m38s
cert-manager-webhook-656f59b5d5-zn6sb 1/1 Running 0 3m38s
@papanito Thanks, now it works on raspberry as well
Thanks goes all to @filipweidemann he figured it out
Good catch @papanito, didn't know tainting was optional.
We migrated to another cloud host and surprise, everything is working now, even without the fix.
If anyone still experiences issues, even after deploying to the masters, maybe keep in mind that your infrastructure could also be responsible..
Also experiencing this on a GKE 1.15 cluster using v0.12. I tried copying the secret from another cluster, which solves this issue, but I'm now facing problems further down the pipeline, and I'm not sure if they are related to this or not.
Is there anybody from cert-manager team looking into it?
I'm not sure if this is the issue anyone else in this thread is running into, but I was able to solve this error by deploying everything into the cert-manager namespace and adding the following to the Helm chart's values.yaml:
---
global:
rbac:
create: true
leaderElection:
namespace: cert-manager
Hi,
First of all, thanks to the maintainers for the time and effort put into this OSS project.
I have been dealing with this issue for the past few days, banging my head against a wall as to why things didn't work as they should. Some context:
I have 2 clusters, both on GCP, one being production, and another one being a scaled-down version, for staging/testing. I had successfully deployed v0.12 to staging with no issues, but were facing this particular issue on the production cluster. I had tried copying the secret from the staging to production, which seemed to solve this issue, but where facing other problems further down the pipeline, where CertificateRequests and Orders were not being created automatically by Certificates and Issuers/ClusterIssuers.
Stuff I tried:
In the end, here's what I learned, and how it fixed the problem for me:
At the time of my experiments above, I am using Helm v3, without having explicitly migrated from Helm 2 to 3. As Helm 3 does not detect Helm 2 stuff, I was not aware that there was a Helm 2 installed version of cert-manager on my production cluster. Even with all the installs/uninstalls above, something must have survived, and was most likely causing issues.
So, the solution for me was:
Hope this helps someone else
@tiagojsag mhh you used helm 2? Did you also tried the manifests? I just was wondering if there might be a problem using helm 2 vs. helm 3? However me and @filipweidemann also tried to install cert-manager using the manifests which also did not work - unless #2477 plays into it.
Just guessing and putting my thought here
Hi,
I had successfully installed v0.12 using helm 3 on my staging cluster - where there was no "hidden" v0.8 cert-manager, like on my prod cluster. So while I can't be sure this is the cause, IMO the fault is that old v0.8 installed with helm 2 I had on my production cluster, and not necessarily helm 3.
But then again, I did run multiple uninstalls on my prod cluster, so I have no idea what may have been left behind that would cause this issue... :/
BTW, stupid detail that may help (or may be totally useless) when debugging:
On my staging cluster, where I am using helm 3, this happens:
$ kubectl get clusterissuers
No resources found in default namespace.
$ kubectl get clusterissuer.cert-manager.io
NAME READY AGE
letsencrypt-staging True 6d2h
However, on my production cluster, where I am using helm 2, both commands return the expected list of resources
Thanks @ioben , your solution works good to me with helmv2. I have no idea why setting global.leaderElection.namespace="cert-manager" resolves the issue of no secret of cert-manager-webhook-tls previously.
_helm install --name my-release --namespace cert-manager jetstack/cert-manager --version "v0.12.0" --set global.leaderElection.namespace="cert-manager" --set global.podSecurityPolicy.enabled=true_
Taking a look through this, it seems like a lot of the issues here are caused by multiple different versions of cert-manager installed, often due to upgrading from Helm 2 to 3.
When uninstalling cert-manager, please follow the instructions here fully: https://cert-manager.io/docs/installation/uninstall/
This should fully remove all old resources. After that, it should be safe to install the latest version of cert-manager using any of our supported installed methods (Helm 2, 3, or static manifests).
Just in case somebody hits this issue and needs to work with web proxys… Check your settings for http_proxy, https_proxy, and no_proxy. In my case some escape characters (\) caused the cert-manager-webhook not to enter the Running state.
I had something like this:
- name: NO_PROXY
value: int.company.com\,localhost\,127.0.0.1\,10.0.0.0/8\,172.16.0.0/12\,192.168.0.0/16\,100.64.0.0/10
instead of
- name: NO_PROXY
value: int.company.com,localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,100.64.0.0/10
Most helpful comment
same problem here on gke with kubernetes 1.14