Cert-manager: Documenting "context deadline exceeded" errors relating to the webhook

Created on 4 Nov 2019 · 66Comments · Source: jetstack/cert-manager

Describe the bug:

When I try to create a ClusterIssuer I get the following error

kubectl apply -f cert-issuer-letsencrypt-dev.yml
Error from server (InternalError): error when creating "cert-issuer-letsencrypt-dev.yml":
Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": 
Post https://kubernetes.default.svc:443/apis/webhook.certmanager.k8s.io/v1beta1/mutations?timeout=30s: 
context deadline exceeded

Expected behaviour:

Creation of ClusterIssuer works without errors

Steps to reproduce the bug:

Install cert-manager as follows

kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.10/deploy/manifests/00-crds.yaml
kubectl create namespace cert-manager
kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true
helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install \
  --name cert-manager \
  --namespace cert-manager \
  --version v0.10.1 \
  jetstack/cert-manager

Then I run

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dev
  namespace: cert-manager
spec:
  acme:
    # The ACME server URL
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    # Email address used for ACME registration
    email: [email protected]
    # Name of a secret used to store the ACME account private key
    privateKeySecretRef: 
      name: letsencrypt-dev
    # Enable the HTTP-01 challenge provider
    # http01: {}
    solvers:
    - dns01:
        cloudflare:
          email: [email protected]
          apiKeySecretRef:
            name: cloudflare-api-key-secret
            key: api-key

Anything else we need to know?:

Environment details::

Kubernetes version (e.g. v1.10.2): 1.15
Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): baremetal
cert-manager version (e.g. v0.4.0): 0.10.1
Install method (e.g. helm or static manifests): helm

/kind bug

aredeploy good first issue kindocumentation prioritimportant-longterm

Source

papanito

👍13

Most helpful comment

Nope still stuck an this sucks

papanito on 8 Nov 2019

😄8

All 66 comments

I've removed cert-manager 0.10.1 and added 0.11.0 but still get the same error

Error from server (InternalError): error when creating "cert-issuer-letsencrypt-dev.yml": 
Internal error occurred: failed calling webhook "webhook.cert-manager.io": 
Post https://kubernetes.default.svc:443/apis/webhook.cert-manager.io/v1beta1/mutations?timeout=30s: 
context deadline exceeded

papanito on 6 Nov 2019

@papanito Have you found any solution on this? I'm in the exact same situation.

hampos on 8 Nov 2019

Nope still stuck an this sucks

papanito on 8 Nov 2019

😄8

@munnerz do you have a hint maybe

papanito on 8 Nov 2019

@papanito I completely disabled the webhook to move past the issue, but this is not recommended as it can expose the cluster to misconfiguration errors and break cert-manager.
Check this link: https://docs.cert-manager.io/en/latest/getting-started/webhook.html#disable-the-webhook-component
It is working fine with the webhook disabled though.

hampos on 12 Nov 2019

@hampos thanks for the hint will have a look at it

papanito on 12 Nov 2019

Seems that kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true is oboslete now, at least there is no mention about it in latest docs.

lukasmrtvy on 9 Dec 2019

I've got the same issue on a k3s cluster currently.
The strange thing is this worked fine on another k3s cluster yesterday but I used helm2.

@papanito out of curiosity what version of helm did you use?

sub6as on 21 Jan 2020

@sub6as I used helm 2 as well

papanito on 21 Jan 2020

@papanito Was that with or without tiller?

I've just been able to install 0.13 in a k3s cluster with helm2 (and tiller) but I wasn't able to do the same with helm3 in the same cluster yesterday.

sub6as on 22 Jan 2020

@sub6as with tiller - as far as I know helm2 requires tiller....

papanito on 22 Jan 2020

any updates on this issue?

madmesi on 3 Feb 2020

any updates on this issue?
I"m also seeing this, don't know what causes the issue. still looking for a solution.

Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

madmesi on 3 Feb 2020

👀1

any updates on this issue?

Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

junetalk on 6 Feb 2020

👍1

What os are you running your clusters?
For me in meanwhile cert-manager works fine, I've setup my cluster completely new on a debian 10 but I had to switch back from nfttables to legacy iptables in order to get networking working. This might me affecting you as well?

papanito on 6 Feb 2020

@papanito
centos7
helm3
k8s1.17.2

junetalk on 6 Feb 2020

👍1

@javachen so I guess my guess was wrong then :-(

papanito on 6 Feb 2020

😕1

@papanito i check coredns and then it works

junetalk on 6 Feb 2020

@javachen means? Was something wrong with coredns and after fixing this issue is gone?

papanito on 6 Feb 2020

Also having this issue, with k3s:

Error from server (InternalError): error when creating "lstage.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.kube-system.svc:443/mutate?timeout=30s: context deadline exceeded

[sseneca@alarm-master ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2+k3s1", GitCommit:"cdab19b09a84389ffbf57bebd33871c60b1d6b28", GitTreeState:"clean", BuildDate:"2020-01-27T18:08:16Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/arm64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2+k3s1", GitCommit:"cdab19b09a84389ffbf57bebd33871c60b1d6b28", GitTreeState:"clean", BuildDate:"2020-01-27T18:08:16Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/arm64"}

[sseneca@alarm-master ~]$ helm version
version.BuildInfo{Version:"v3.0.3", GitCommit:"ac925eb7279f4a6955df663a0128044a8a6b7593", GitTreeState:"clean", GoVersion:"go1.13.6"}

I'm running Arch Linux ARM, and crucially I'm pretty certain I have no firewalls at all right now.

sseneca on 6 Feb 2020

Yup, we ran into the same issue on an Kubernetes 1.17.1 cluster.

Versions:

Kubernetes 1.17.0/1
Cilium 1.7.0-rc3 (Known Good, even if RC)
cert-manager 0.13.0
Helm v3.0.3

We also verified our network as up and running and working.

E0207 18:46:58.922930       1 controller.go:131] cert-manager/controller/issuers "msg"="re-queuing item  due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=4s: context deadline exceeded" "key"="livedab-staging/fdab-le"

EDIT: We restarted the webhook-Pod MULTIPLE times, afterwards it works. It seems, that this is maybe a problem with concurrent starting pods.

johannwagner on 7 Feb 2020

Unfortunately restarting the pod didn't fix the issue for me.

Edit: Well never mind, I restarted all the cert-manager related pods and now it worked. Strange

sseneca on 8 Feb 2020

👍1

I somehow managed to workaround this issue when downgrading to v0.11. everything seems to working properly.
https://docs.cert-manager.io/en/release-0.11/

madmesi on 16 Feb 2020

👍1

Same issue. Deployed with K8S manifest (not helm). Error from server (InternalError): error when creating "staging-letsencrypt.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.32.0.4:10250: i/o timeout mentioned correct IP and port, so I think coredns and services are alright.

For webhook pod, there is no log after

I0218 13:55:19.631368       1 main.go:64]  "level"=0 "msg"="enabling TLS as certificate file flags specified"  
I0218 13:55:19.631867       1 server.go:121]  "level"=0 "msg"="listening for insecure healthz connections"  "address"=":6080"
I0218 13:55:19.632088       1 server.go:133]  "level"=0 "msg"="listening for secure connections"  "address"=":10250"
I0218 13:55:19.632310       1 tls_file_source.go:142]  "level"=0 "msg"="detected private key or certificate data on disk has changed. reloading certificate"

I have done this before, on another AWS EKS cluster. Comparing to this time, the only difference on networking is using weave net CNI instead of aws vpc CNI.

Magicloud on 18 Feb 2020

And the pod seems like working:

/ # curl -vk https://10.32.0.4:10250/mutate?timeout=30s
*   Trying 10.32.0.4:10250...
* TCP_NODELAY set
* Connected to 10.32.0.4 (10.32.0.4) port 10250 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=cert-manager.system
*  start date: Feb 18 13:54:21 2020 GMT
*  expire date: Feb 17 13:54:21 2021 GMT
*  issuer: O=cert-manager.system; CN=cert-manager.webhook.ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
> GET /mutate?timeout=30s HTTP/1.1
> Host: 10.32.0.4:10250
> User-Agent: curl/7.67.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Date: Tue, 18 Feb 2020 14:57:14 GMT
< Content-Length: 0
< 
* Connection #0 to host 10.32.0.4 left intact

Magicloud on 18 Feb 2020

👍1

I am installing this today with the same issue, I am perplexed as I have many other installs:

root@bastion:/# curl -vk https://cert-manager-webhook.devops.svc:443/mutate?timeout=30s
* Expire in 0 ms for 6 (transfer 0x558fbf2f8f50)
* Expire in 1 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 0 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 2 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 1 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 1 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 4 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 3 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 3 ms for 1 (transfer 0x558fbf2f8f50)
* Expire in 4 ms for 1 (transfer 0x558fbf2f8f50)
*   Trying 10.106.156.103...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x558fbf2f8f50)
* Connected to cert-manager-webhook.devops.svc (10.106.156.103) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=cert-manager.system
*  start date: Feb 24 01:01:47 2020 GMT
*  expire date: Feb 23 01:01:47 2021 GMT
*  issuer: O=cert-manager.system; CN=cert-manager.webhook.ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
> GET /mutate?timeout=30s HTTP/1.1
> Host: cert-manager-webhook.devops.svc
> User-Agent: curl/7.64.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/1.1 400 Bad Request
< Date: Mon, 24 Feb 2020 02:47:38 GMT
< Content-Length: 0
<
* Connection #0 to host cert-manager-webhook.devops.svc left intact

The service looks good and is responding. I can confirm on the other side:

cert-manager-webhook-578b69fd94-lqczb cert-manager 2020-02-24T03:07:14.721845662Z E0224 03:07:14.721720       1 server.go:289]  "msg"="failed to decode request body" "error"="couldn't get version/kind; json parse error: unexpected end of JSON input"

the webhook pod is reporting:

+ cert-manager-webhook-578b69fd94-rtpqp › cert-manager
cert-manager-webhook-578b69fd94-rtpqp cert-manager 2020-02-24T01:02:13.997886347Z I0224 01:02:13.997759       1 main.go:64]  "msg"="enabling TLS as certificate file flags specified"
cert-manager-webhook-578b69fd94-rtpqp cert-manager 2020-02-24T01:02:13.99813673Z I0224 01:02:13.998084       1 server.go:126]  "msg"="listening for insecure healthz connections"  "address"=":6080"
cert-manager-webhook-578b69fd94-rtpqp cert-manager 2020-02-24T01:02:13.998224741Z I0224 01:02:13.998180       1 server.go:138]  "msg"="listening for secure connections"  "address"=":10250"
cert-manager-webhook-578b69fd94-rtpqp cert-manager 2020-02-24T01:02:13.998263584Z I0224 01:02:13.998240       1 server.go:155]  "msg"="registered pprof handlers"
cert-manager-webhook-578b69fd94-rtpqp cert-manager 2020-02-24T01:02:13.998463416Z I0224 01:02:13.998416       1 tls_file_source.go:144]  "msg"="detected private key or certificate data on disk has changed. reloading certificate"

nothing exciting there.

I have tried deleting all of the pods without success.

It is defiantly a problem with the pod, as I see the cert-manager trying to contact the webhook with the same issue and the error you see when applying the ClusterIssuer is an error from the kubeApi.

I checked the IP of teh Service vs using ping inside a container, they are the same. I also tried creating two replicas and giving them resources (100m CPU and 128mi MEM) and still failing.

I am disabling the webhook on this cluster for now, I need to move on.

edits: more notes and information, some typos fixed.

jurgenweber on 24 Feb 2020

I resolved my issue. May not apply to everyone, but still.

During cert creation, the API server accesses the webhook. But in my case, the API server cannot access pods in overlay network. So I have webhook running in hostNetwork mode. Now the error is gone.

Magicloud on 24 Feb 2020

👍1

where are you setting this option? hostNetwork ?

jurgenweber on 24 Feb 2020

I found it, in the webhook spec. Thanks

webhook:
  hostNetwork: true

jurgenweber on 24 Feb 2020

I have the same issue with:

Kubernetes version (e.g. v1.10.2): v1.17.0
Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): baremetal (openstack queen)
cert-manager version (e.g. v0.4.0): 0.13.1
Install method (e.g. helm or static manifests): helm 2.16.3

I think it is a timing problem because it was working when I was "manually" installing it, but not anymore in my terraform configuration script. Quite rough but I resolved it by adding a system delay just before applying the ingress and after the install:

sudo helm install \
  --name cert-manager \
  --namespace cert-manager \
  --version v0.13.1 \
  jetstack/cert-manager
sleep 1m
kubectl apply -f staging-issuer.yaml

ltetrel on 27 Feb 2020

👍2

I resolved my issue. May not apply to everyone, but still.

During cert creation, the API server accesses the webhook. But in my case, the API server cannot access pods in overlay network. So I have webhook running in hostNetwork mode. Now the error is gone.

我在我的本地集群中没有发生这个问题，但是在华为云自建集群中，出现的这个问题。不过我是把cert-manager-webhook部署到k8s apiserver服务器上解决了这个问题（node-selector -> master server）,猜测问题可能在于，kube-apiserver需要与Pod对话，但是Pod默认位于我的华为云网络中，apiserver无法通过该网络外部路由到达cert-manager-webhook。这个可能和修改webhook -> hostNetwork功能相同。

suisrc on 10 Mar 2020

@papanito i check coredns and then it works

Can you please explain what you did with coredns to correct the problem?

h00pz on 17 Mar 2020

👍1

I have run into this problem with:
Kubernetes 1.17.4
cert-manager 0.13.1/0.14.0
installed with and without helm
with and without hostNetwork: true
with flannel and calico cni's
restarted all pods multiple times.

@papanito could you please explain what you did with coredns to get it working.

h00pz on 20 Mar 2020

@h00pz question goes to @javachen as he mentioned

@papanito i check coredns and then it works

Really don't know what that means

papanito on 20 Mar 2020

Folks I tracked it down to using _--service-dns-domain="k8.example.com"_ in my kubeadm init.

h00pz on 20 Mar 2020

👍1

For the record, I had exactly the same under k3s v1.17.3+k3s1 installed without traefik + / cert-manager 0.14 installed either with regular manifests or with helm.

After several full cluster removal / recreation, I managed to make it work with :

installing the nginx controller after cert-manager (not sure if it was the cause)
Waiting at each step that everything was up
using the regular cert-manager helm installer 0.14 instructions without any customization

bflorat on 21 Mar 2020

I'm currently facing this on k3s and even with a completely clean system, and "waiting at each step".

I wasn't sure if it was some kind of network timeout or blocking but I confirmed that wasn't the case because certbot worked absolutely fine :(

Unfortunately "Context deadline exceeded" doesn't help for debugging. There also used to be the option in the helm chart to not use the webhook and that is gone now 👎 so even though I have a working setup on one machine, I cannot with the new machine set up the same unless I go with older versions (~~and first I have to work out _which_ version removed that option from the helm chart~~ Looking at the releases told me that, v0.14.0 made it mandatory).

iMartyn on 25 Mar 2020

Interestingly for me, v0.13.0 does not solve the problem but does improve the error message.

E0325 16:10:08.780439       1 sync.go:81] cert-manager/controller/clusterissuers "msg"="error setting up issuer" "error"="Get https://acme-v02.api.letsencrypt.org/directory: dial tcp: i/o timeout" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt" "resource_namespace"=""

I have a suspicion that some people are getting this because CloudFlare is returning an ipv6 address in their response for the API DNS lookup. This is going to hurt for a lot of people because there's all the layers of interaction and ipv6 + kubernetes networking is always pain.

Certainly I am not getting I/O timeouts when running curl on the host machine, but according to that message, I am inside the pod. Of course, following best security practices, there's no shell in v0.14.0's container so I can't test that theory without building a different container which then is not the official one. Let's see how deep I can follow this rabbit hole...

iMartyn on 25 Mar 2020

The haiku about DNS is true :

It’s not DNS
There is a no way it’s DNS
It was DNS

So, for those who are in this boat who are confused as heck by it, check that you can run dns queries from inside your pods Context deadline exceeded in my case indicated the pod couldn't look up the acme api endpoint.

Also in my case, this was any outbound traffic to the internet at all from pods being blocked.

Furthermore, for those who land here where cert-manager is the first thing they set up on a cluster and are bitten by this : k3s on debian 10 requires you use the legacy iptables command (this may apply to other k8s distros, but definitely does to k3s) - https://github.com/coredns/coredns/issues/2693

I still submit that Context deadline exceeded is a poor error message and something more helpful here would be good.

iMartyn on 29 Mar 2020

👍2 😄1

I also have the same issue, I have permetal cluster and i deployed the cert-manager version 0.14.1 and .014.0 using helm , the official chart.
every thing is working find, all pods on the cert-manager namespaces is up and no errors in the logs.
but when i tried to create issuer and getting error:

I tried to use the hostNetwork: true in pod specs but i got error :

"msg"="error running server" "error"="listen tcp :10250: bind: address already in use"

zzaareer on 30 Mar 2020

I have the same issue, since kubernetes 1.17. The last kubernetes version was 1.16 wet it workd. I ise centos 7, calico with wxlan and selfsignd root CA. So my problem is not nfttables or acme api. I tryed diferent vesions os cert-nanagger and deployment type (helm2 helm3 yaml) same error.

devopstales on 3 Apr 2020

finaly ive found what happen, just resove issue with weave CNI...
well.. sry guys.

smalinskiy on 7 Apr 2020

How are you testing if the letsencrypt API is reachable from cert-manager?
And from which pod should I do it?

cert-manager-5d9cd85cbb-96dfb             1/1     Running   0          19m
cert-manager-cainjector-95c885477-bk5r8   1/1     Running   0          19m
cert-manager-webhook-6ff9487489-5z57m     1/1     Running   0          19m

I think I'm facing exactly the same problem:

Events:
  Type     Reason                Age                    From          Message
  ----     ------                ----                   ----          -------
  Warning  ErrVerifyACMEAccount  2m13s (x7 over 7m23s)  cert-manager  Failed to verify ACME account: context deadline exceeded
  Warning  ErrInitIssuer         2m13s (x7 over 7m23s)  cert-manager  Error initializing issuer: context deadline exceeded

Or is it something different?

sushi86 on 8 Apr 2020

For me it was an issue with debian 10 and iptables, see here: https://discuss.kubernetes.io/t/kubernetes-compatible-with-debian-10-buster/7853

sushi86 on 9 Apr 2020

👍2

Would anyone be able to open a PR against the documentation/compatibility page to advise users of what to check if they see this message? Specifically it'd be good to call out/link to information about nftables, as there's clearly quite a few people running into this!

https://github.com/cert-manager/website/blob/master/content/en/docs/installation/compatibility.md

/kind documentation
/remove-kind bug
/area deploy
/priority important-longterm

munnerz on 23 Apr 2020

@munnerz I think that's something I could do

papanito on 23 Apr 2020

https://github.com/cert-manager/website/pull/187

papanito on 27 Apr 2020

Hi guys. I have the same issue. Any fixes yet?

mokhos on 3 May 2020

One of the problems I found is with a wrapper charts. To reproduce, wrap cert-manager to cert-manager-wrapped chart (e.g. to add a custom cluster issuer)

Webhook service will be named cert-manager-wrapped-webhook, but CRD will point to the original name cert-manager-webhook, which does not exist.

aplex on 9 May 2020

@munnerz did you see my PR? Is that about ok?

papanito on 9 May 2020

I can confirm that I have exactly the same issue. My environment:

EKS v1.16.8
CNI: Calico
Cert-Manager: v0.15.1 installed using HELM

I'm getting errors like:

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

I'm not 100%, but I suspect the issue with a connection from the API to the webhook (Calico creates new subnet, not sure if API is able to access it)...

andrewkaczynski on 29 May 2020

👍5

I resolved my issue. May not apply to everyone, but still.

During cert creation, the API server accesses the webhook. But in my case, the API server cannot access pods in overlay network. So I have webhook running in hostNetwork mode. Now the error is gone.

How did you install? I am trying helm install, with weave, on eks and I am getting the same errors. failed calling webhook "webhook.cert-manager.io" - The chart has hostNetwork set to false and it seems most of the instructions on how to get it to work are using older versions. I tried forking the charts and making the change, but then there was some type of image dependency. What was your method?

techcto on 5 Jun 2020

👍1

I am having the same issues as @andrewkaczynski

pyr0hu on 9 Jun 2020

👀2

If anyone is interested, we were experiencing these issues in our deployment, and after some debugging I determined it was because while our VM supported the size of frames leaving the boxes, the actual underlying fabric was losing network packets before they arrived at the worker node (where the webhook Pod was running.)

This was fairly easy to prove out using tcpdump on both sides of the connection, and seeing that larger application data packets weren't ever hitting eth0 on the worker.

If this is an MTU problem for you, a potential solution is to check the settings of every interface in your stack (Calico has recommendations for different environments,) as you can change the tunl0 device MTU easily enough.

Edit to add, the default IPIP configuration we were using added enough encapsulation that it _just_ pushed the packet size over what was acceptable. This manifested itself as other issues.

adam-thg on 18 Jun 2020

We are running a fresh install of 1.17.7 via kubeadm, using the flannel VXLAN CNI. We're also seeing the following error:

Error from server (InternalError): error when creating "./selfsigned-issuer.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

Doing some googling, I believe this is due to DNS. If I exec onto one of my NGINX pods (using this pod arbitrarily, nothing special about it) and try to resolve the above address, I get this:

nslookup cert-manager-webhook.cert-manager.svc
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN
** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN

However, when you append .cluster.local (the full domain name), it'll resolve just fine:

bash-5.0$ nslookup cert-manager-webhook.cert-manager.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10:53


Name:   cert-manager-webhook.cert-manager.svc.cluster.local
Address: 10.111.57.175

And as you can see, this is the IP address of my webhook pod:

NAMESPACE              NAME                               TYPE        CLUSTER-IP
cert-manager           cert-manager-webhook               ClusterIP   10.111.57.175

So this is where I get confused... I read that .svc is the equivalent (or rather, short-version) of .svc.cluster.local. Why is it not working? Is this configurable? Reading a different issue, some person had to re-create their cluster and supply some DNS options into KubeSpray. However I'm not using KubeSpray.

Appreciate any help, thanks.

Edit: Here is the other issue I was referring to: https://github.com/jetstack/cert-manager/issues/2640

AMoghrabi on 25 Jun 2020

👀1

Hello,

Just wanted to let everyone know that I have it working now. Some information on our cluster:

Cluster Version: 1.17.7
CNI: Flannel, VXLAN
Provisioner: kubeadm
Cert-Manager version: 0.11.1

What worked for me is the following guide here: https://docs.cert-manager.io/en/release-0.11/getting-started/install/kubernetes.html

It is absolutely important that nothing is lingering around from your old deploy. Run kubectl get crd and delete all (new and old) cert-manager CRD's.

Run kubectl get apiservice and make sure there is nothing related to certificates

Running kubectl get cert or kubectl get clusterissuer should say something along the lines of "This resource type does not exist" (I don't have the exact error, but you get the point).

Great. Now install the 0.11.1 CRD's:

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v0.11.1/cert-manager.yaml

Now install cert-manager 0.11.1. Make sure you install 0.11.1, not 0.11.0.... That version doesn't seem to work either.

Great. Now make sure your ClusterIssuer and Certificate's are using the apiVersion: cert-manager.io/v1alpha2.

My Suspicions:

When installing 1.15.11, looking at the kube-apiserver logs, it appears that it's trying to communicate to the webhook service by using its DNS (https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s). As I said above, this short-version of the DNS name does not work for some reason. Maybe it's a kubeadm thing.

When using 0.11.1, it tries to communicate via IP address instead, and I suppose this is what is making it work.

Something important that I found during my research, the kube-apiserver can't actually resolve cluster DNS. The /etc/resolv.conf is inherited from the master node. This is designed intentionally because apparently kube-apiserver is the source of truth for DNS.

Something I don't understand is from a node, why can't you ping a Service ClusterIP? You can do it for any pod on any node, not Services though. So I don't get how the kube-apiserver is making calls to the webhook.

Sorry for rambling. Please let me know if you're still struggling. I can try to help.

AMoghrabi on 26 Jun 2020

👍1

I resolved my issue. May not apply to everyone, but still.
During cert creation, the API server accesses the webhook. But in my case, the API server cannot access pods in overlay network. So I have webhook running in hostNetwork mode. Now the error is gone.

我在我的本地集群中没有发生这个问题，但是在华为云自建集群中，出现的这个问题。不过我是把cert-manager-webhook部署到k8s apiserver服务器上解决了这个问题（node-selector -> master server）,猜测问题可能在于，kube-apiserver需要与Pod对话，但是Pod默认位于我的华为云网络中，apiserver无法通过该网络外部路由到达cert-manager-webhook。这个可能和修改webhook -> hostNetwork功能相同。

我在华为云网络自建集群遇到同样的问题。具体怎么操作解决呢？

TonyLuo on 7 Jul 2020

I can confirm that I have exactly the same issue. My environment:

EKS v1.16.8
CNI: Calico
Cert-Manager: v0.15.1 installed using HELM

I'm getting errors like:

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: Address is not allowed

I'm not 100%, but I suspect the issue with a connection from the API to the webhook (Calico creates new subnet, not sure if API is able to access it)...

I got these problem too..

knightXun on 10 Jul 2020

I am having the same issues too.

feichaohao on 11 Aug 2020

I can confirm that I have exactly the same issue. My environment:
EKS v1.16.8
CNI: Calico
Cert-Manager: v0.15.1 installed using HELM
I'm getting errors like:
Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded
Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: Address is not allowed
I'm not 100%, but I suspect the issue with a connection from the API to the webhook (Calico creates new subnet, not sure if API is able to access it)...

I got these problem too..

I resolved using doc https://cert-manager.io/docs/installation/compatibility/#aws-eks

rootsh on 2 Sep 2020

@wallrj is working on this in https://github.com/cert-manager/website/issues/321.
Let's move the convesation there.

/close

meyskens on 14 Sep 2020

@meyskens: Closing this issue.

In response to this:

@wallrj is working on this in https://github.com/cert-manager/website/issues/321.
Let's move the convesation there.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot on 14 Sep 2020

I can confirm that I have exactly the same issue. My environment:

EKS v1.16.8
CNI: Calico
Cert-Manager: v0.15.1 installed using HELM

I'm getting errors like:

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

Error from server (InternalError): error when creating "ClusterIssuerDns.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: Address is not allowed

I'm not 100%, but I suspect the issue with a connection from the API to the webhook (Calico creates new subnet, not sure if API is able to access it)...

Did you find any solution with calico?

mostafa8026 on 1 Dec 2020

My solution was: Download the Cert-Manager manifest (i.e. https://github.com/jetstack/cert-manager/releases/download/v1.1.0/cert-manager.yaml) and inserting the following block after each "containers:"-declaration in the manifest and appling it:
```dnsPolicy: "None"
dnsConfig:
nameservers:
- 8.8.8.8
- 8.8.4.4

snukone on 3 Dec 2020

And my solution was using these changes in yaml file:

Adding the hostNetwork: true to the spec of webhook, and changing the securePort and its relative ports to something other than 10250 (like 10666 that I choose 😄, also don't forget to change the related service), here are the changes:

...
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: webhook
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/name: webhook
  name: cert-manager-webhook
  namespace: cert-manager
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: webhook
      app.kubernetes.io/instance: cert-manager
      app.kubernetes.io/name: webhook
  template:
    metadata:
      labels:
        app: webhook
        app.kubernetes.io/component: webhook
        app.kubernetes.io/instance: cert-manager
        app.kubernetes.io/name: webhook
    spec:
      hostNetwork: true
      containers:
      - args:
        - --v=2
        - --secure-port=10666
        - --dynamic-serving-ca-secret-namespace=$(POD_NAMESPACE)
        - --dynamic-serving-ca-secret-name=cert-manager-webhook-ca
        - --dynamic-serving-dns-names=cert-manager-webhook,cert-manager-webhook.cert-manager,cert-manager-webhook.cert-manager.svc,$(NODE_NAME)
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        image: quay.io/jetstack/cert-manager-webhook:v1.1.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: 6080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: cert-manager
        ports:
        - containerPort: 10666
          name: https
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 6080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
      serviceAccountName: cert-manager-webhook
---
...
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: webhook
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/name: webhook
  name: cert-manager-webhook
  namespace: cert-manager
spec:
  ports:
  - name: https
    port: 443
    targetPort: 10666
  selector:
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/name: webhook
  type: ClusterIP
...

mostafa8026 on 4 Dec 2020

👀2

Why is this issue closed ? I am facing same issue, with the helm chart of cert-manager v1.1.0

main.go:38] cert-manager "msg"="error executing command" "error"="listen tcp :10250: bind: address already in use"

@mostafa8026 seems to have corrected the issue by changing the port 10250. Why is this even a port issue ?

rahbal on 3 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

!!! cert-manager cannot access GoogleCloud API

munjal-patel · 3Comments

Add CRD categories

howardjohn · 3Comments

http01 challenge succeeds, but authorization fails

timblakely · 4Comments

cert-manager-acmesolver v0.14.0 runs as root which leads to failure in restricted namespaces

r0bj · 3Comments

failed to install canary builds via helm

dontreboot · 3Comments