Cert-manager: Self check always fail

Created on 29 Aug 2018 · 21Comments · Source: jetstack/cert-manager

Describe the bug:
Unable to pass "self check" when Ingress Service is using NodePort and public IP is on HA proxy (tcp mode) outside the Kubernetes cluster. We can simulate the test from cert-manger container (kubectl exec) using curl (fetching /.well-known/...), which is successful. The same applies from outside the cluster.

Logs:

helpers.go:188 Found status change for Certificate "myip-secret" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-08-29 14:36:25.387757463 +0000 UTC m=+2049.620517469
sync.go:244 Error preparing issuer for certificate pwe/pwe-secret: http-01 self check failed for domain "www.example.com"
controller.go:190 certificates controller: Re-queuing item "default/myip-secret" due to error processing: http-01 self check failed for domain "www.example.com"

We replaced real domain name in this bug report for www.example.com

The cert-manager is working only when public IP is on Kubernetes cluster and Ingress Service is using LoadBalancer method.

Expected behaviour:
self check to pass with NodePort on Ingress Service

Steps to reproduce the bug:

cat <<EOF > /root/nginx-ingress.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-ingress
  namespace: nginx-ingress
spec:
  externalTrafficPolicy: Local
  type: NodePort
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
    name: http
    nodePort: 31080
  - port: 443
    targetPort: 443
    protocol: TCP
    name: https
    nodePort: 31443
  selector:
    app: nginx-ingress
EOF


cat <<EOF > /root/letsencrypt-staging.yml
---
apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  # Adjust the name here accordingly
  name: letsencrypt-staging
spec:
  acme:
    # The ACME server URL
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    # Email address used for ACME registration
    email: [email protected]
    # Name of a secret used to store the ACME account private key from step 3
    privateKeySecretRef:
      name: letsencrypt-staging-private-key
    # Enable the HTTP-01 challenge provider
    http01: {}
EOF

cat <<EOF > /root/myip-ingress.yml
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: myip-ingress
  annotations:
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "nginx"
    certmanager.k8s.io/cluster-issuer: letsencrypt-staging
spec:
  tls:
  - hosts:
    - www.example.com
    secretName: myip-secret
  rules:
  - host: www.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: myip-svc
          servicePort: 80
EOF

# Nginx ingress
kubectl apply -f https://raw.githubusercontent.com/nginxinc/kubernetes-ingress/master/install/common/ns-and-sa.yaml
kubectl apply -f https://raw.githubusercontent.com/nginxinc/kubernetes-ingress/master/install/common/default-server-secret.yaml
kubectl apply -f https://raw.githubusercontent.com/nginxinc/kubernetes-ingress/master/install/common/nginx-config.yaml
kubectl apply -f https://raw.githubusercontent.com/nginxinc/kubernetes-ingress/master/install/rbac/rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/nginxinc/kubernetes-ingress/master/install/daemon-set/nginx-ingress.yaml
kubectl create -f /root/nginx-ingress.yaml

# CertManager
kubectl create -f https://raw.githubusercontent.com/jetstack/cert-manager/master/contrib/manifests/cert-manager/with-rbac.yaml
kubectl create -f /root/letsencrypt-staging.yml

# MyApp
kubectl run myip --image=cloudnativelabs/whats-my-ip --replicas=1 --port=8080
kubectl expose deployment myip-svc --port=8080 --target-port=8080
kubectl create -f /root/myip-ingress.yml
openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -keyout /root/tls.key -out /root/tls.crt -subj "/CN=www.example.com"
kubectl create secret tls myip-secret --key /root/tls.key --cert /root/tls.crt

Anything else we need to know?:
It is not clear to us, what exactly the self check is expecting to find, because the fetch of /well-known key is successful (confirmed via wireshark), but the self check is running again and again and still failing. Some more details about the reason of fail would be great.

Wireshark captured data - request from Cluster Node to HA proxy:

GET /.well-known/acme-challenge/B2tNUfzfPgK_VOF7AAQEktKaikWxwBQlD0uL77d0N8k HTTP/1.1
Host: pwe.kube.freebox.cz
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip

HTTP/1.1 200 OK
Server: nginx/1.15.2
Date: Wed, 29 Aug 2018 14:42:26 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 87
Connection: keep-alive

B2tNUfzfPgK_VOF7AAQEktKaikWxwBQlD0uL77d0N8k.6RElade5K0jHqS1ysziuv2Gm3_LgD-D9APNRg5k8sak

Environment details::

Kubernetes version v1.11.2
cert-manager version (v0.4.1)
nginx-ingress (v1.15.2)
Install method (primary via kubectl, but we also tried helm [following this guide - https://dzone.com/articles/secure-your-kubernetes-services-using-cert-manager] with the same result):

/kind bug

kinbug lifecyclstale

Source

slavoren

👍6

Most helpful comment

The problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.

Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol or to disable validation for that domain it would help a lot.

For curl if I do (from inside the cluster):
curl -I https://myhost.domain.com
it fails.

If I do (from inside the cluster):
curl -I https://myhost.domain.com --haproxy-protocol
it works.

I was informed by DigitalOcean team that there is a fix for this behavior. They added an additional annotation to nxinx-ingress controller service that forces Kubernetes to use domain name of public IP instead of IP and that tricks Kubernetes to think that it is not "ours" and routes network around through LB.

https://github.com/digitalocean/digitalocean-cloud-controller-manager/blob/master/docs/controllers/services/examples/README.md#accessing-pods-over-a-managed-load-balancer-from-inside-the-cluster
This is it: (I just added this one)

kind: Service
apiVersion: v1
metadata: 
  name: nginx-ingress-controller
  annotations: 
    service.beta.kubernetes.io/do-loadbalancer-hostname: "hello.example.com"

MichaelOrtho on 18 Dec 2019

👍18 ❤8 🎉2

All 21 comments

It happens the same to me. I'm setting a HA cluster and this is blocking us from moving the apps.

We have used the helm package for installing it. Is there any workaround that could help us to continue deploying our infrastructure?

AdrianRibao on 30 Aug 2018

Fixed in my case. The problem was that the nginx configuration in the load balancer was redirecting connections to port 80 to 443.

AdrianRibao on 30 Aug 2018

the same here.
i got a ha cluster with nginx reverse proxy (pointing dns entry on it) and i redirect http/https port on public ips of the kubernetes nodes

Then i have my kubernetes cluster with ingress-nginx controller configured like this:

apiVersion: v1
kind: Service
metadata:
name: ingress-nginx
namespace: ingress-nginx
spec:
type: NodePort
ports:

name: http
port: 80
targetPort: 80
protocol: TCP
name: https
port: 443
targetPort: 443
protocol: TCP
externalIPs:
- public-IP-node1
- public-IP-node2
- public-IP-node3
  
  selector:
  
  app: ingress-nginx

This way when i use cert-manager to ger my cert, i have always a self check error (by the way, all acme challenge are checked if i do it manually inside and outside the cluster.

if i change my dns entry for one of the kubernetes nodes public ip, all is good and the certificate is issuing (but this is a big SPOF if the node where is the dns entry is going down)

julienfig on 5 Sep 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

retest-bot on 5 Dec 2018

The same happens here, but using DNAT on public IP to internal MetalLB load balance configuration.

raphaelpereira on 30 Dec 2018

I found out that the problem was that the cluster wasn't able to resolve the DNS. I solved that and it worked.

raphaelpereira on 31 Dec 2018

Solve this myself too after a long time of messing about. Self-check is kinda tricky on your network confirmation. The certificate-mgr resolver tries to connect to itself to verify LetsEncrypt can access data at .well-known/acme-challenge/. This is often deceptively complicated in many networks. It requires the resolver being able to connect to itself using what would often resolve to a public IP address. Do a wget/curl to the .well-known/acme-challenge to see if it succeeds from the resolver container. In my case, I had to setup hairpin NAT at the router.

Is it a good idea to optionally skip self-check?

ptjhuang on 4 Jan 2019

I'm going to close this issue out as it seems to be more related to network configuration than anything else. Let's Encrypt needs to be able to access your Ingress controller on port 80 in order to validate challenges, and exposing your ingress controller to the public internet (either via a LoadBalancer service or a NodePort) is outside the scope of cert-manager itself. We just need port 80 to work 😄

munnerz on 11 Jan 2019

👎8

Port 80 isn't the issue, that's a given. The IP address is though. All installations behind NAT is likely going to fail without hairpin config. If not allow self-check be disabled, maybe mention it in docs?

ptjhuang on 22 Jan 2019

👍4

Let's Encrypt needs to be able to access your Ingress controller on port 80 in order to validate challenges

I guess this means "Cloudflare Always Use HTTPS" was causing this for me. Perhaps something about requiring port 80 and HTTP access to the domain here would be good: https://docs.cert-manager.io/en/latest/getting-started/troubleshooting.html

intellix on 20 Apr 2019

👍1

Same issue here. I would like to disable self-check or provide the ip address of the loadbalancer because of hairpinning

myrage2000 on 1 May 2019

The problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.

Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol or to disable validation for that domain it would help a lot.

For curl if I do (from inside the cluster):

curl -I https://myhost.domain.com

it fails.

If I do (from inside the cluster):

curl -I https://myhost.domain.com --haproxy-protocol

it works.

MichaelOrtho on 17 Dec 2019

The problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.

Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol or to disable validation for that domain it would help a lot.

For curl if I do (from inside the cluster):
curl -I https://myhost.domain.com
it fails.

If I do (from inside the cluster):
curl -I https://myhost.domain.com --haproxy-protocol
it works.

kind: Service
apiVersion: v1
metadata: 
  name: nginx-ingress-controller
  annotations: 
    service.beta.kubernetes.io/do-loadbalancer-hostname: "hello.example.com"

MichaelOrtho on 18 Dec 2019

👍18 ❤8 🎉2

@MichaelOrtho Hi, do you know if a similar workaround exists for Scaleway? I am testing their managed Kubernetes and am having the same problem. Thanks

vitobotta on 29 Mar 2020

👍1

@vitobotta I have found on Scaleway you need to restart coredns and it will usually succeed.

AlexsJones on 16 Apr 2020

@AlexsJones Not for me. I had to add the annotation below

"service.beta.kubernetes.io/scw-loadbalancer-use-hostname": "true"

vitobotta on 16 Apr 2020

👍3 ❤2

...
apiVersion: v1
kind: Service
metadata:
  name: nginx-ingress
  namespace: nginx-ingress
spec:
  externalTrafficPolicy: Local
  type: NodePort
...

After changing externalTrafficPolicy: Local to externalTrafficPolicy: Cluster, I was able to perform self check.

Reason being, pod with the certificate-issuer wound up on a different node than the load balancer did, so it couldn’t talk to itself through the ingress.

btwiuse on 29 May 2020

❤2 🎉2 👍2

Hi all, I ran into the same issue. I've recently published hairpin-proxy which works around the issue, specifically for cert-manager self-checks. https://github.com/compumike/hairpin-proxy

It uses CoreDNS rewriting to intercept traffic that would be heading toward the external load balancer. It then adds a PROXY line to requests originating from within the cluster. This allows cert-manager's self-check to pass.

compumike on 8 Nov 2020

❤1

@munnerz I think you misunderstood the problem here. You wrote:

I'm going to close this issue out as it seems to be more related to network configuration than anything else. Let's Encrypt needs to be able to access your Ingress controller on port 80 in order to validate challenges, and exposing your ingress controller to the public internet (either via a LoadBalancer service or a NodePort) is outside the scope of cert-manager itself. We just need port 80 to work smile

The problem is not that Let's Encrypt can't reach the LoadBalancer... the problem is that certificate manager self-check can't reach it. The connection from LE to the LoadBalancer is fine, due to Destination NAT. The certificate manager inside the cluster how ever tries to resolve the domain name with the external IP and this will fail in DNAT scenarios.

@munnerz there is already a whole project just for fixing this issue. Is there really no option to just disable self-checks?

shibumi on 4 Jan 2021

Here is another possible solution:

You can use coredns for broadcasting wrong DNS records. Just create host aliases for the domains and link them to the internal cluster IPs. Then propagate these host/IP tuples via:

hosts {
    fallthrough
}

in your coredns config. This way you can use the internal IP addresses inside of your cluster. You just have to maintain another list (or you might just automate this via a custom operator or script).

shibumi on 4 Jan 2021

In DNAT Scenarios just set externalIP of a ingress Service to your external IP Addresses.

apiVersion: v1
kind: Service
metadata:
  name: nginx-ingress-ext
  namespace: nginx-ingress
spec:
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
    name: http
  - port: 443
    targetPort: 443
    protocol: TCP
    name: https
  selector:
    app: nginx-ingress-ext
  externalIPs:
    - 11.22.33.44

kubernetes, configured with iptables, mostly standard setup,
creates iptables rules to redirect cluster internal requests to external ip's to apropriate services.

$ sudo iptables-save  | grep 11.22.33.44
-A KUBE-SERVICES -d 11.22.33.44/32 -p tcp -m comment --comment "nginx-ingress/nginx-ingress-ext:http external IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 11.22.33.44/32 -p tcp -m comment --comment "nginx-ingress/nginx-ingress-ext:http external IP" -m tcp --dport 80 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-VMPDTJD5TKOUD6KL
-A KUBE-SERVICES -d 11.22.33.44/32 -p tcp -m comment --comment "nginx-ingress/nginx-ingress-ext:http external IP" -m tcp --dport 80 -m addrtype --dst-type LOCAL -j KUBE-SVC-VMPDTJD5TKOUD6KL
-A KUBE-SERVICES -d 11.22.33.44/32 -p tcp -m comment --comment "nginx-ingress/nginx-ingress-ext:https external IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 11.22.33.44/32 -p tcp -m comment --comment "nginx-ingress/nginx-ingress-ext:https external IP" -m tcp --dport 443 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-SUC36V4R4VKNMIWK
-A KUBE-SERVICES -d 11.22.33.44/32 -p tcp -m comment --comment "nginx-ingress/nginx-ingress-ext:https external IP" -m tcp --dport 443 -m addrtype --dst-type LOCAL -j KUBE-SVC-SUC36V4R4VKNMIWK