Cert-manager: Propagation check failed, wrong service used by cm-acme-http-solver

Created on 20 Nov 2019  路  28Comments  路  Source: jetstack/cert-manager

Describe the bug:
Propagation check failed, wrong service used by cm-acme-http-solver and ACME cannot check validity of the domain and deliver certificate

Expected behaviour:
Challenge success and certificate delivered

Steps to reproduce the bug:
Following https://docs.cert-manager.io/en/latest/tutorials/acme/http-validation.html , but I have multiple subdomains at the same time.

Anything else we need to know?:
I move from cert-manager 0.8 to 0.11 few weeks. All works fine, with new subdomains added. Since few days, new subdomains failed to be validated.

Environment details::

  • Kubernetes version : v1.13.10
  • Cloud-provider : Azure AKS
  • cert-manager version : v0.11.0
  • Install method : helm

/kind bug

kubectl describe challenge:

Name:         tls-secret-1495667673-716095195-908999738
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  acme.cert-manager.io/v1alpha2
Kind:         Challenge
Metadata:
  Creation Timestamp:  2019-11-20T10:00:11Z
  Finalizers:
    finalizer.acme.cert-manager.io
  Generation:  1
  Owner References:
    API Version:           cert-manager.io/v1alpha2
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Order
    Name:                  tls-secret-1495667673-716095195
    UID:                   81e087b8-0b7c-11ea-95a0-7e8b3f31c3c5
  Resource Version:        10962589
  Self Link:               /apis/acme.cert-manager.io/v1alpha2/namespaces/default/challenges/tls-secret-1495667673-716095195-908999738
  UID:                     8a97cf77-0b7c-11ea-95a0-7e8b3f31c3c5
Spec:
  Authz URL:  https://acme-v02.api.letsencrypt.org/acme/authz-v3/1322577887
  Dns Name:   domain20.convertigo.net
  Issuer Ref:
    Group:  cert-manager.io
    Kind:   Issuer
    Name:   letsencrypt-prod
  Key:      BZiefIaNveMd0bwXbjywYExT6wGHdETnJLs5D6iZOAY.zIJBhOqgURIGfuNbqfatmAXt5je_GyTDV34tQ02Xqmw
  Solver:
    Http 01:
      Ingress:
        Class:  nginx
  Token:        BZiefIaNveMd0bwXbjywYExT6wGHdETnJLs5D6iZOAY
  Type:         http-01
  URL:          https://acme-v02.api.letsencrypt.org/acme/chall-v3/1322577887/aUEJmg
  Wildcard:     false
Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for http-01 challenge propagation: wrong status code '503', expected '200'
  State:       pending
Events:        <none>

I have a cm-acme-http-solver-wmpps Ingress with:

  "spec": {
    "rules": [
      {
        "host": "domain20.convertigo.net",
        "http": {
          "paths": [
            {
              "path": "/.well-known/acme-challenge/BZiefIaNveMd0bwXbjywYExT6wGHdETnJLs5D6iZOAY",
              "backend": {
                "serviceName": "cm-acme-http-solver-9mddf",
                "servicePort": 8089
              }
            }
          ]
        }
      }
    ]
  },

It refers a Service cm-acme-http-solver-9mddf that don't exist but I have a cm-acme-http-solver-6c7r2. Is this normal ?

Do you need another information or do you know a work around ?

Thx !

kinbug

Most helpful comment

I delete the Ingress of the Challenge that point a wrong service and ... a new valid Ingress was created !
The certificate is now good.
I let the ticket open in case of new cases for few days.

All 28 comments

I delete the Ingress of the Challenge that point a wrong service and ... a new valid Ingress was created !
The certificate is now good.
I let the ticket open in case of new cases for few days.

@nicolas-albert we are experiencing the exact same issue. Ingress created by cert-manager points to an acme solver service that doesn't exist in the namespace (another one exists with a different name)

This looks like a bug.

We tend to issue multiple requests like this (for many subdomains at a time), and most times they appear to succeed, but there's always the odd one that gets stuck like this.

@greywolve try to remove the service, a new one should be created with the right name. It was working for us but it can be nice if you confirm that.

@nicolas-albert that does work for us too.

Though in our case, we often issue multiple requests for certs like this per day (for different domains), so inevitably some get stuck and require manual intervention like that - not ideal.

We got this issue once, but we don't create many subdomains for now.
In few weeks, we will request some subdomains by days.
I hope cert-manager 0.12 fix this, but it isn't released yet.
Do you have tested the beta ?

0.12 seems out? https://github.com/jetstack/cert-manager/releases/tag/v0.12.0

We've just upgraded, so I'm holding thumbs that it will sort of this issue.

0.12 doesn't appear to fix it, we've upgraded and certificates still get stuck with the ingress pointing to a service that doesn't exist anymore

Bad news :-/
Are you using AKS too ?

Same here, AKS too (on a new created cluster).

I upgrade to cert-manager v0.12 and create a new subdomain : same error.
But when I delete the wrong named service, another wrong service is created and the validation failed again.
Do you have the same issue with the v0.12 @greywolve ?

I finally got it work by changing the service name directly to the Ingress, the 2nd time ...
This is very annoying and it's strange to be only 3 witnesses here.

Yes @nicolas-albert, we (along with @greywolve) are still seeing this issue on 0.12.

In case the info is at all useful, we are issuing the certs via labels on an ingress, and they use http01 challenges via nginx-ingress. As stated before, this works some of the time, but some other times we get this bug.

Heyo! I got the same issue on a K3S cluster with traefik as Ingress. Once I edited the Ingress object to point towards the correct service it resolved itself quickly. Also version 0.12

Thanks for the reports here - it seems like some people are running into this fairly frequently. Has anyone been able to isolate the issue and come up with a reproducible example so we can begin investigating?

Alternatively, if you can provide as much info as possible about when you've seen this occur (i.e. immediately after an upgrade, or after a controller restart, or after manually modifying some other resource, etc.) that'd be great and really help to begin working out what is going on!

Hard to come up with a reproducible example because it doesn't seem to happen every time, only sometimes.

I'll do my best to give you as much alternative info as possible. Going to see if I can provoke it again, and record all the logs etc.

I use a single Ingress for multiple subdomains.
One configuration file is generated and applied once.
New domains are added by a regeneration of a big configuration file.
Here a sample of our configuration :

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
 name: c8o-ingress
 annotations:
   kubernetes.io/ingress.class: nginx
   cert-manager.io/issuer: letsencrypt-prod
   nginx.ingress.kubernetes.io/rewrite-target: /
   nginx.ingress.kubernetes.io/affinity: "cookie"
   nginx.ingress.kubernetes.io/session-cookie-name: "route"
   nginx.ingress.kubernetes.io/session-cookie-hash: "sha1"
   nginx.ingress.kubernetes.io/proxy-body-size: 500m
   nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
   nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
 tls:
 - hosts:
   - 'domain1.convertigo.net'
   - 'domain2.convertigo.net'
   - 'domain3.convertigo.net'
   - ...
   secretName: tls-secret
 rules:
 - host: domain1.convertigo.net
   http:
     paths:
     - path: /
       backend:
         serviceName: c8o-front-domain1
         servicePort: 80
 - ....
---
apiVersion: cert-manager.io/v1alpha2
kind: Issuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
---
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
metadata:
 name: tls-secret
spec:
 secretName: tls-secret
 dnsNames:
 - 'domain1.convertigo.net'
 - 'domain2.convertigo.net'
 - 'domain3.convertigo.net'
 - ...

 acme:
   config:
   - http01:
       ingressClass: nginx
     domains:
     - 'domain1.convertigo.net'
     - 'domain2.convertigo.net'
     - 'domain3.convertigo.net'
     - ...

 issuerRef:
   name: letsencrypt-prod
   kind: Issuer

If you tell me what exactly you want to see, I can try to add another domain and retrieve logs from cert-manager (from 3 pods ?).

Thank for watching this.

We upgraded to Helm v3 yesterday, and since then, we haven't been able to reproduce this bug. I'm not sure if this was caused by us using a pre v3 Helm or not. @nicolas-albert did you install cert-manager with Helm, and if so, which version?

In summary. cert-manager v0.12.0 with Helm v3 appears to fix this issue, but we'll keep you updated if we spot this happening again.

@munnerz @nicolas-albert @schemen we've had this bug pop up again on the weekend. I spent some time digging, and I now have a decent hypothesis on what's happening. I created a new issue to describe the bug here: https://github.com/jetstack/cert-manager/issues/2442 .

I have the same problem, cert-manager v0.10.1, Helm v2.14.2

It's really easy to reproduce when you have lots of domains on your certificate. Let's Encrypt allows up to 100 alternate names, and some of our certs have more than 50. It's trivial to reproduce this problem with that many domains.

Should be fixed by https://github.com/jetstack/cert-manager/pull/2460
Waiting for the v0.13 release to test it.
Unless there is a simple way to use this patch now.

We are currently using the patch by temporarily editing the cert manager deployment to use https://hub.docker.com/r/oliverpowell84/cert-manager-controller/tags for the controller image @nicolas-albert

@greywolve Thanks! I've just switch on it, create one domain and it works!
I hope next domains will be ok too :)

How can I apply this patch now?

I am using the following command to deploy the cert-manager:
helm install cert-manager \
--namespace cert-manager \
--version v0.12.0 \
jetstack/cert-manager

bare-metal k8s v1.16.3, kind: certificate with 10 domains.
have same problem with cert-manager 0.12

all fine work with cert-manager v0.13

For me the problem still exists with kubernetes/ingress-nginx and cert-manager v0.13.0/v0.13.1:

Name:         httpd-sample-tls-2519976466-3240946659-870949068
Namespace:    httpd
Labels:       <none>
Annotations:  <none>
API Version:  acme.cert-manager.io/v1alpha2
Kind:         Challenge
Metadata:
  Creation Timestamp:  2020-02-26T12:24:48Z
  Finalizers:
    finalizer.acme.cert-manager.io
  Generation:  1
  Owner References:
    API Version:           acme.cert-manager.io/v1alpha2
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Order
    Name:                  httpd-sample-tls-2519976466-3240946659
    UID:                   4a28e6c7-b051-4675-a7f9-df4ee8a162ed
  Resource Version:        8605295
  Self Link:               /apis/acme.cert-manager.io/v1alpha2/namespaces/httpd/challenges/httpd-sample-tls-2519976466-3240946659-870949068
  UID:                     19001ea2-e58c-47c9-abad-531d93d46841
Spec:
  Authz URL:  https://api.internal.de/authZ/5e52fe3887546608acde70f2
  Dns Name:   httpd.internal.de
  Issuer Ref:
    Group:  cert-manager.io
    Kind:   ClusterIssuer
    Name:   acme-qs
  Key:      rw1YLophKqrq_EOnlr4slkcj7RyezOnJhrHNPIHfqrEiwZggtlE_w0mFNCMRohS5xSjhTT-D4MOt_L4Rh7qSTU_5kYMAe8L1JgxGDh_VWO8jD-iTZ-rra4k8fbg1V_6N6J6QTGfDkny2VeVE_lpV-NwBPkvyg6Q2PVdbbpsHqes.ccsI09oPobbrh46E1NUwvXp3E4ATQhowtbbjUlSz8Ro
  Solver:
    http01:
      Ingress:
        Class:  nginx
  Token:        rw1YLophKqrq_EOnlr4slkcj7RrHNPIHfqrEiwZggtlE_w0mFNCMRohS5xSjhTT-D4MOt_L4Rh7qSTU_5kYMAe8L1JgxGDh_VWO8jD-iTZ-rra4k8fbg1V_6N6J6QTGfDkny2VeVE_lpV-NwBPkvyg6Q2PVdbbpsHqes
  Type:         http-01
  URL:          https://api.internal.de/chalZ/5e52fe3887546608acde70f3
  Wildcard:     false
Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for http-01 challenge propagation: wrong status code '503', expected '200'
  State:       pending
Events:
  Type    Reason     Age   From          Message
  ----    ------     ----  ----          -------
  Normal  Started    24s   cert-manager  Challenge scheduled for processing
  Normal  Presented  24s   cert-manager  Presented challenge using http-01 challenge mechanism
curl httpd.internal.de/.well-known/acme-challenge/rw1YLophKqrq_EOnlr4slkcj7RyezOnJhrHNPIHfqrEiwZggtlE_w0mFNCMRohS5xSjhTT-D4MOt_L4Rh7qSTU_5kYMAe8L1JgxGDh_VWO8jD-iTZ-rra4k8fbg1V_6N6J6QTGfDk
ny2VeVE_lpV-NwBPkvyg6Q2PVdbbpsHqes
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx/1.17.8</center>
</body>
</html>

Since #2460 has merged and #2442 closed, I think this issue can also be closed.

@macevil if you are still experiencing this problem, could you open an issue with some info explaining exactly what you're seeing, i.e. including output for all of kubectl get clusterissuer,issuer,certificate,certificaterequest,order,challenge,pod,svc,ing -o yaml -n {relevant_namespace}, as well as a copy of logs from cert-manager too (the more the better - we can grep through them ourselves, and sometimes issues can be hard to pin-point exactly where/when they began).

Hi,
I found the same issue with gateway timeout.
Capture

Badly that the pod cm-acme-http-solver only created in 1 node (u1) although i have 10 nodes.
I delete the pod solver in u1 but then scheduler still created solver in that node u1.

So can i change nodeSelector: u2,u3 only when create solver?

Thank you!

@macevil I added a node so the pods could start, this fixed it for me!

Was this page helpful?
0 / 5 - 0 ratings