Cert-manager: i/o timeout from apiserver when connecting to webhook on k3s

Created on 16 Apr 2020 · 16Comments · Source: jetstack/cert-manager

Bugs should be filed for issues encountered whilst operating cert-manager.
You should first attempt to resolve your issues through the community support
channels, e.g. Slack, in order to rule out individual configuration errors.
Please provide as much detail as possible.

Describe the bug:
I was following along the steps at here: https://cert-manager.io/docs/installation/kubernetes/

Expected behaviour:
I got an issue when trying to test the installation

Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.43.18.
211:443: i/o timeout

Steps to reproduce the bug:
Steps to reproduce the bug should be clear and easily reproducible to help people
gain an understanding of the problem.

Following the steps in the link above and got the issue when testing the installation

Anything else we need to know?:
The installation step is successfully, as I verified as follow

kubectl get pods --namespace cert-manager                                                                                                                                                                         Thu Apr 16 18:33:31 2020

NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-cainjector-79f4496665-7gptd   1/1     Running   0          11m
cert-manager-57cdd66b-7xvj2                1/1     Running   0          11m
cert-manager-webhook-6d57dbf4f-28zjc       1/1     Running   0          11m

I'm using VMs from Google.
Environment details::

Kubernetes version (e.g. v1.10.2): v1.17.4
Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): k3s
cert-manager version (e.g. v0.4.0): v0.14.0
Install method (e.g. helm or static manifests): Helm

/kind bug

aredeploy triagsupport

Source

sangnguyen7

😄1

Most helpful comment

Same issue here using k3s. Resolved it using the flag --flannel-backend host-gw during the k3s setup. So it looks like something is wrong in the default flannel setup, but I didn't investigate further

alecunsolo on 26 Apr 2020

👍3

All 16 comments

I have the same this scenario, installed the last version of cert-manager (0.15 alpha.1
we trying to create issuer and certificate by the test-resources.yaml i am getting the following error, the status of every thing is up and running but still i am facing this error:

Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

zzaareer on 24 Apr 2020

Hey @sangnguyen7

How have you deployed k3s/what environment is it deployed into? This error indicates that your apiserver is unable to route traffic to the cert-manager webhook pod, which is a required component.

This is something that is required to be working in order for Kubernetes conformance tests to pass as far as I'm aware, so this indicates that somewhere along the line your cluster is not configured correctly.

Are you able to run Sonobuoy to check and ensure your cluster is set up properly? This will hopefully help you to pinpoint what's going on 😄

/triage support
/area deploy
/remove-kind bug

munnerz on 24 Apr 2020

👍1

Thanks @munnerz. No, I have not tried the tool yet. I will double check with.

sangnguyen7 on 25 Apr 2020

I had the same issue in k8s, not k3s. I resolved it by switching from flannel to calico.

esvirskiy on 26 Apr 2020

alecunsolo on 26 Apr 2020

👍3

Are there any related issues in the k3s repository that this could be linked to?

munnerz on 27 Apr 2020

@munnerz, it seems the issue is related to the network setup and not related to cert-manager, if there is nobody else having other issues, you can close this issue. Thanks!

sangnguyen7 on 30 Apr 2020

alecunsolo on 30 Apr 2020

@alecunsolo, same issue in k3s and i change flannel backend from vxlan to host-gw, but it doesn't work.

method to change flannel backend.

vim /etc/systemd/system/k3s.service

what I modified.

ExecStart=/usr/local/bin/k3s server --flannel-backend host-gw

check my k3s net-conf

[root@bowser1704 ~]# cat /var/lib/rancher/k3s/agent/etc/flannel/net-conf.json
{
    "Network": "10.42.0.0/16",
    "Backend": {
    "Type": "host-gw"
}
}

But it still doesn't work.

[root@bowser1704 ~]# kubectl apply -f cert-manager/cluster-issuer.yaml
Error from server (InternalError): error when creating "cert-manager/cluster-issuer.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded

Do you have any suggestions?
thanks.

maybe realted to rancher/k3s#1266 rancher/k3s#1958

Bowser1704 on 26 Jun 2020

It tooks a few days to fix this problem.
it maybe related to coreos/flannel#1268. or vxlan bugs.

Super slow access to service IP from host, maybe 60s delay

Can't access pods by service cluster ip except you access from the node where pod in

So, one method to reslove this issue is to dispatch cert-manager-webook to the node where you are by using spec.nodeSelector.
another method is changing the flannel backend from vxlan to host-gw. P.S. but it seems doesn't work for me.

Bowser1704 on 28 Jun 2020

maybe related to rancher/k3s#1638
I fix by this command.

ethtool -K flannel.1 tx-checksum-ip-generic off

Bowser1704 on 28 Jun 2020

Partly comment, partly question:

We're running k3s (v1.17.7+k3s1, flannel, AWS AMI2) but the master is separated from the cluster (by firewall - allowing port 6443 from workers towards master EDIT: master runs no pods, not part of the cluster workload). Until now we've been using the "no-webhook" version of cert manager (until 0.13) without problems but after upgrade yesterday getting the same error when we try and list certs using kubectl on the master:

# /usr/local/bin/kubectl get cert --all-namespaces
Error from server: conversion webhook for cert-manager.io/v1alpha2, Kind=Certificate failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: context deadline exceeded

From a pod running on the cluster I can retrieve the certs (I tested the pod on different nodes, it always works):

curl -s --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  https://kubernetes.default.svc/apis/cert-manager.io/v1alpha2/certificates

Any idea why this would happen?

mogoman on 30 Jun 2020

Partly comment, partly question:

We're running k3s (v1.17.7+k3s1, flannel, AWS AMI2) but the master is separated from the cluster (by firewall - allowing port 6443 from workers towards master EDIT: master runs no pods, not part of the cluster workload). Until now we've been using the "no-webhook" version of cert manager (until 0.13) without problems but after upgrade yesterday getting the same error when we try and list certs using kubectl on the master:
# /usr/local/bin/kubectl get cert --all-namespaces
Error from server: conversion webhook for cert-manager.io/v1alpha2, Kind=Certificate failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: context deadline exceeded 
From a pod running on the cluster I can retrieve the certs (I tested the pod on different nodes, it always works):
curl -s --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  https://kubernetes.default.svc/apis/cert-manager.io/v1alpha2/certificates
Any idea why this would happen?

UPDATE: I moved my master server into the same security group and started k3s agent on the master node and it started working. So for anyone Googling this : if the master node is separated from the cluster and not running k3s agent, it doesn't appear that it will be able to contact the webhook server.

mogoman on 30 Jun 2020

Having the same issue with our EKS cluster. Wondering if there is a fix to stop the pod from restarting on the time out and gracefully just adding an error to the logs ?

eroteme on 4 Aug 2020

Error from server (InternalError): error when creating "clusterIssuer.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: context deadline exceeded
Getting above error.

gudipudipradeep on 7 Sep 2020

这个问题可能是cni导致的，我修改了calico的mtu后这个问题解决了(This problem may be caused by cni. After I modified the mtu of calico, the problem was solved.)

"mtu": 1440-> "mtu": 1420,

{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "log_level": "info",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "datastore_type": "kubernetes",
      "nodename": "k3s-operator-1",
      "mtu": 1420,
      "ipam": {
          "type": "calico-ipam"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    },
    {
      "type": "bandwidth",
      "capabilities": {"bandwidth": true}
    }
  ]
}