Cert-manager: Cert-manager gets tripped up by certain DNS configurations

Created on 3 Nov 2018  路  8Comments  路  Source: jetstack/cert-manager

Describe the bug:
To be honest, I'm not familiar enough with cert-manager's logging to know exactly what's going on. What I'm noticing, though, is that since adding a second domain to the certificate request, cert-manager takes a very long time to issue certificates and often gets itself banned for an hour due to Let's Encrypt's failed validation limit of 5 per hour.

We're using DNS01 challenge requests with wildcard certificates, and we're using cert-manager's built-in Cloudflare integration.

When looking at the logs, it appears that cert-manager is having a hard time getting both domains ACME challenge DNS records to propagate, and essentially gets itself into a loop. When watching the DNS records, it appears they have TTL of 120 seconds, but cert-manager is only waiting 60 seconds, which might be why the DNS records haven't propagated yet.

It also seems like cert-manager scraps the DNS records if they haven't propagated as fast as it wants them too, and then starts the process over. What I would expect would be simply to leave the DNS records alone and just wait for them to propagate.

Of course, this is all speculation based on what I'm seeing in the logs and the DNS records it creates. Here are the logs it's producing in CSV format for closer examination by someone more familiar with cert-manager.

I'm also not sure how it's hitting the failed validation limit with Let's Encrypt, because it seems to stop itself with the self-check.

Also, it's worth mentioning that it always _does_ eventually work. It can just sometimes take several hours to issue a single certificate鈥攅specially if it gets itself banned.

Expected behaviour:
A certificate to be provisioned in a couple of minutes without getting itself stuck in a loop.

Steps to reproduce the bug:
Not totally sure; maybe it's due to having two wildcard certificates? Maybe the DNS records of these two domains are set up in an absnormal way? Possibly Cloudflare made a change and DNS records simply aren't propagating as fast as they used to. Not sure how exactly to reproduce, sorry.

Anything else we need to know?:
I can provide additional details as needed to help figure out what's going on.

Environment details::

  • Kubernetes version (e.g. v1.10.2): v1.11.2
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): GKE
  • cert-manager version (e.g. v0.4.0): v0.5.0
  • Install method (e.g. helm or static manifests): Static manifests

/kind bug

kinbug

Most helpful comment

We fixed this by adding _acme-challenge.example2.com CNAME cert-manager-hack.example2.com. (i.e. at the same level as the wildcard, pointing to the same zone)

This means when cert-manager follows that specific CNAME it stays in an updateable zone.

All 8 comments

Here's the error message it logs after getting temporarily rate limited by Let's Encrypt. Currently, it seems to always get rate limited before it can successfully issue a certificate; but again, it somehow always ends up successfully issuing one within a few hours.

Error preparing issuer for certificate namespace-name/ingress-tls: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/

Additionally, I was able to confirm that the TXT record that it creates with Cloudflare has a TTL of 120 seconds. However, the logs seem to indicate cert-manager is only waiting 60 seconds:

Waiting DNS record TTL (60s) to allow propagation of DNS record for domain "example.com."

Upon further testing, it appears that the configuration of the DNS for the two domains is what's tripping up cert-manager. Here's the setup that's problematic:

example1.com

Type | Host | Content
:--- | :--- | :---
A | load-balancer.example1.com | 1.2.3.4
CNAME | *.example1.com | load-balancer.example1.com
CNAME | example1.com | load-balancer.example1.com

example2.com

Type | Host | Content
:--- | :--- | :---
CNAME | *.example2.com | load-balancer.example1.com
CNAME | example2.com | load-balancer.example1.com

_Note: having a CNAME at the root is permissible with Cloudflare due to their CNAME flattening feature._

By changing the setup of example2.com to the below table, the issue is fixed and certificates are provisioned in a timely manner.

Type | Host | Content
:--- | :--- | :---
A | load-balancer.example2.com | 1.2.3.4
CNAME | *.example2.com | load-balancer.example2.com
CNAME | example2.com | load-balancer.example2.com

And while we've implemented this change for now, this duplicates the A record unnecessarily for secondary domains that use the same stack.

@WesCossick I'm seeing similar behavior with our attempts to use the staging server in 0.4.1. On slack someone mentioned that #837 may be related.

In our case, we have:

| Type | Host | Content |
|-------|------|-------|
| A | mydomain.com | 1.2.3.4 |
| CNAME | *.mydomain.com | mydomain.com |

For this I get an endless dns-01 self check failed for domain with another authorization for domain is in progress.

Log snippet:

I1106 22:36:35.500237       1 dns.go:79] Checking DNS propagation for "mydomain.com" using name servers: [10.31.240.10:53]
I1106 22:36:35.514030       1 dns.go:86] DNS record for "mydomain.com" not yet propagated
I1106 22:36:35.514160       1 dns.go:73] Presenting DNS01 challenge for domain "mydomain.com"
I1106 22:36:36.811516       1 helpers.go:188] Found status change for Certificate "crt-wildcard" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-11-06 22:36:36.811500755 +0000 UTC m=+874.351559328
I1106 22:36:36.811569       1 sync.go:244] Error preparing issuer for certificate prod-app/crt-wildcard: [dns-01 self check failed for domain "mydomain.com", another authorization for domain "mydomain.com" is in progress]
E1106 22:36:36.811593       1 sync.go:165] [prod-app/crt-wildcard] Error getting certificate 'crt-wildcard': secret "crt-wildcard" not found
E1106 22:36:36.818995       1 controller.go:190] certificates controller: Re-queuing item "prod-app/crt-wildcard" due to error processing: [dns-01 self check failed for domain "mydomain.com", another authorization for domain "mydomain.com" is in progress]

It just validated after 42m the first one:

kubectl describe certificate crt-wildcard -n prod-app
Name:         crt-wildcard
Namespace:    prod-app
Labels:       ksonnet.io/component=cert-manager.crtWildcard
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"certmanager.k8s.io/v1alpha1","kind":"Certificate","metadata":{"annotations":{},"labels":{"ksonnet.io/component":"cert-manager.crtWildcar...
API Version:  certmanager.k8s.io/v1alpha1
Kind:         Certificate
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-06T22:22:43Z
  Generation:          1
  Resource Version:    19303520
  Self Link:           /apis/certmanager.k8s.io/v1alpha1/namespaces/prod-app/certificates/crt-wildcard
  UID:                 7b396564-e212-11e8-9125-42010a800099
Spec:
  Acme:
    Config:
      Dns 01:
        Provider:  prod-dns
      Domains:
        *.mydomain.com
        mydomain.com
  Common Name:  *.mydomain.com
  Dns Names:
    mydomain.com
  Issuer Ref:
    Kind:       ClusterIssuer
    Name:       letsencrypt-staging
  Secret Name:  crt-wildcard
Status:
  Acme:
    Order:
      Challenges:
        Authz URL:  https://acme-staging-v02.api.letsencrypt.org/acme/authz/Hiu-HTgAcffanOqHM4QOS0ZATy_SDry3vpCbzUEO_jM
        Dns 01:
          Provider:  prod-dns
        Domain:      mydomain.com
        Key:         pHW9FXo_OCouetzAwdB2ah84ObzTjanBoGfr8wtf7cg
        Token:       KSEUZDo5hEjGGXt7cXzEqcfb3v7eo9ASEciA8s_C2Ig
        Type:        dns-01
        URL:         https://acme-staging-v02.api.letsencrypt.org/acme/challenge/Hiu-HTgAcffanOqHM4QOS0ZATy_SDry3vpCbzUEO_jM/192644049
        Wildcard:    false
      URL:           https://acme-staging-v02.api.letsencrypt.org/acme/order/7299857/12624110
  Conditions:
    Last Transition Time:  2018-11-06T23:04:26Z
    Message:               another authorization for domain "mydomain.com" is in progress
    Reason:                ValidateError
    Status:                False
    Type:                  Ready
Events:
  Type    Reason          Age   From          Message
  ----    ------          ----  ----          -------
  Type    Reason          Age              From          Message
  ----    ------          ----             ----          -------
  Normal  CreateOrder     46m              cert-manager  Created new ACME order, attempting validation...
  Normal  DomainVerified  1m (x2 over 4m)  cert-manager  Domain "mydomain.com" verified with "dns-01" validation
  Normal  IssueCert       1m               cert-manager  Issuing certificate...
  Normal  CertObtained    1m               cert-manager  Obtained certificate from ACME server
  Normal  CertIssued      1m               cert-manager  Certificate issued successfully

@rosskevin After reading #837, I see what's going on, at least for our company's case since we're using v0.5.0. #670, which seems to have been released in v0.5.0, introduced logic that rewrites the FQDN when it thinks the ACME challenge subdomain exists and that it's a CNAME.

The problem, though, is that we're using wildcard CNAMEs, which means every subdomain is going to look like it exists when you query the nameservers. So when cert-manager checks if the acme-challenge subdomain exists, our DNS providers will say it does and that its CNAME content is the content of the wildcard CNAME.

Then it rewrites the FQDN and that trips it up. I'm not sure why that only delays the provisioning process rather than outright break it. And I'm still unsure why it thinks the TTL is 60s when it's actually 120s in Cloudflare.

I switched to cloudflare and 0.5.0 and the following created successful within 5 or so minutes on staging.

| Type | Host | Content |
|-------|------|-------|
| A | mydomain.com | 1.2.3.4 |
| CNAME | *.mydomain.com | mydomain.com |

@rosskevin This issue is only a problem if the wildcard CNAME record points to a different domain.

We fixed this by adding _acme-challenge.example2.com CNAME cert-manager-hack.example2.com. (i.e. at the same level as the wildcard, pointing to the same zone)

This means when cert-manager follows that specific CNAME it stays in an updateable zone.

@WesCossick We ran into the issue when the wildcard CNAME pointed to the same domain, at the same level as the wildcard. Not sure the difference in our configuration, but it seems more or less identical.

Was this page helpful?
0 / 5 - 0 ratings