Describe the bug:
To be honest, I'm not familiar enough with cert-manager's logging to know exactly what's going on. What I'm noticing, though, is that since adding a second domain to the certificate request, cert-manager takes a very long time to issue certificates and often gets itself banned for an hour due to Let's Encrypt's failed validation limit of 5 per hour.
We're using DNS01 challenge requests with wildcard certificates, and we're using cert-manager's built-in Cloudflare integration.
When looking at the logs, it appears that cert-manager is having a hard time getting both domains ACME challenge DNS records to propagate, and essentially gets itself into a loop. When watching the DNS records, it appears they have TTL of 120 seconds, but cert-manager is only waiting 60 seconds, which might be why the DNS records haven't propagated yet.
It also seems like cert-manager scraps the DNS records if they haven't propagated as fast as it wants them too, and then starts the process over. What I would expect would be simply to leave the DNS records alone and just wait for them to propagate.
Of course, this is all speculation based on what I'm seeing in the logs and the DNS records it creates. Here are the logs it's producing in CSV format for closer examination by someone more familiar with cert-manager.
I'm also not sure how it's hitting the failed validation limit with Let's Encrypt, because it seems to stop itself with the self-check.
Also, it's worth mentioning that it always _does_ eventually work. It can just sometimes take several hours to issue a single certificate鈥攅specially if it gets itself banned.
Expected behaviour:
A certificate to be provisioned in a couple of minutes without getting itself stuck in a loop.
Steps to reproduce the bug:
Not totally sure; maybe it's due to having two wildcard certificates? Maybe the DNS records of these two domains are set up in an absnormal way? Possibly Cloudflare made a change and DNS records simply aren't propagating as fast as they used to. Not sure how exactly to reproduce, sorry.
Anything else we need to know?:
I can provide additional details as needed to help figure out what's going on.
Environment details::
/kind bug
Here's the error message it logs after getting temporarily rate limited by Let's Encrypt. Currently, it seems to always get rate limited before it can successfully issue a certificate; but again, it somehow always ends up successfully issuing one within a few hours.
Error preparing issuer for certificate namespace-name/ingress-tls: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/
Additionally, I was able to confirm that the TXT record that it creates with Cloudflare has a TTL of 120 seconds. However, the logs seem to indicate cert-manager is only waiting 60 seconds:
Waiting DNS record TTL (60s) to allow propagation of DNS record for domain "example.com."
Upon further testing, it appears that the configuration of the DNS for the two domains is what's tripping up cert-manager. Here's the setup that's problematic:
Type | Host | Content
:--- | :--- | :---
A | load-balancer.example1.com | 1.2.3.4
CNAME | *.example1.com | load-balancer.example1.com
CNAME | example1.com | load-balancer.example1.com
Type | Host | Content
:--- | :--- | :---
CNAME | *.example2.com | load-balancer.example1.com
CNAME | example2.com | load-balancer.example1.com
_Note: having a CNAME at the root is permissible with Cloudflare due to their CNAME flattening feature._
By changing the setup of example2.com to the below table, the issue is fixed and certificates are provisioned in a timely manner.
Type | Host | Content
:--- | :--- | :---
A | load-balancer.example2.com | 1.2.3.4
CNAME | *.example2.com | load-balancer.example2.com
CNAME | example2.com | load-balancer.example2.com
And while we've implemented this change for now, this duplicates the A record unnecessarily for secondary domains that use the same stack.
@WesCossick I'm seeing similar behavior with our attempts to use the staging server in 0.4.1. On slack someone mentioned that #837 may be related.
In our case, we have:
| Type | Host | Content |
|-------|------|-------|
| A | mydomain.com | 1.2.3.4 |
| CNAME | *.mydomain.com | mydomain.com |
For this I get an endless dns-01 self check failed for domain with another authorization for domain is in progress.
Log snippet:
I1106 22:36:35.500237 1 dns.go:79] Checking DNS propagation for "mydomain.com" using name servers: [10.31.240.10:53]
I1106 22:36:35.514030 1 dns.go:86] DNS record for "mydomain.com" not yet propagated
I1106 22:36:35.514160 1 dns.go:73] Presenting DNS01 challenge for domain "mydomain.com"
I1106 22:36:36.811516 1 helpers.go:188] Found status change for Certificate "crt-wildcard" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-11-06 22:36:36.811500755 +0000 UTC m=+874.351559328
I1106 22:36:36.811569 1 sync.go:244] Error preparing issuer for certificate prod-app/crt-wildcard: [dns-01 self check failed for domain "mydomain.com", another authorization for domain "mydomain.com" is in progress]
E1106 22:36:36.811593 1 sync.go:165] [prod-app/crt-wildcard] Error getting certificate 'crt-wildcard': secret "crt-wildcard" not found
E1106 22:36:36.818995 1 controller.go:190] certificates controller: Re-queuing item "prod-app/crt-wildcard" due to error processing: [dns-01 self check failed for domain "mydomain.com", another authorization for domain "mydomain.com" is in progress]
It just validated after 42m the first one:
kubectl describe certificate crt-wildcard -n prod-app
Name: crt-wildcard
Namespace: prod-app
Labels: ksonnet.io/component=cert-manager.crtWildcard
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"certmanager.k8s.io/v1alpha1","kind":"Certificate","metadata":{"annotations":{},"labels":{"ksonnet.io/component":"cert-manager.crtWildcar...
API Version: certmanager.k8s.io/v1alpha1
Kind: Certificate
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-06T22:22:43Z
Generation: 1
Resource Version: 19303520
Self Link: /apis/certmanager.k8s.io/v1alpha1/namespaces/prod-app/certificates/crt-wildcard
UID: 7b396564-e212-11e8-9125-42010a800099
Spec:
Acme:
Config:
Dns 01:
Provider: prod-dns
Domains:
*.mydomain.com
mydomain.com
Common Name: *.mydomain.com
Dns Names:
mydomain.com
Issuer Ref:
Kind: ClusterIssuer
Name: letsencrypt-staging
Secret Name: crt-wildcard
Status:
Acme:
Order:
Challenges:
Authz URL: https://acme-staging-v02.api.letsencrypt.org/acme/authz/Hiu-HTgAcffanOqHM4QOS0ZATy_SDry3vpCbzUEO_jM
Dns 01:
Provider: prod-dns
Domain: mydomain.com
Key: pHW9FXo_OCouetzAwdB2ah84ObzTjanBoGfr8wtf7cg
Token: KSEUZDo5hEjGGXt7cXzEqcfb3v7eo9ASEciA8s_C2Ig
Type: dns-01
URL: https://acme-staging-v02.api.letsencrypt.org/acme/challenge/Hiu-HTgAcffanOqHM4QOS0ZATy_SDry3vpCbzUEO_jM/192644049
Wildcard: false
URL: https://acme-staging-v02.api.letsencrypt.org/acme/order/7299857/12624110
Conditions:
Last Transition Time: 2018-11-06T23:04:26Z
Message: another authorization for domain "mydomain.com" is in progress
Reason: ValidateError
Status: False
Type: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreateOrder 46m cert-manager Created new ACME order, attempting validation...
Normal DomainVerified 1m (x2 over 4m) cert-manager Domain "mydomain.com" verified with "dns-01" validation
Normal IssueCert 1m cert-manager Issuing certificate...
Normal CertObtained 1m cert-manager Obtained certificate from ACME server
Normal CertIssued 1m cert-manager Certificate issued successfully
@rosskevin After reading #837, I see what's going on, at least for our company's case since we're using v0.5.0. #670, which seems to have been released in v0.5.0, introduced logic that rewrites the FQDN when it thinks the ACME challenge subdomain exists and that it's a CNAME.
The problem, though, is that we're using wildcard CNAMEs, which means every subdomain is going to look like it exists when you query the nameservers. So when cert-manager checks if the acme-challenge subdomain exists, our DNS providers will say it does and that its CNAME content is the content of the wildcard CNAME.
Then it rewrites the FQDN and that trips it up. I'm not sure why that only delays the provisioning process rather than outright break it. And I'm still unsure why it thinks the TTL is 60s when it's actually 120s in Cloudflare.
I switched to cloudflare and 0.5.0 and the following created successful within 5 or so minutes on staging.
| Type | Host | Content |
|-------|------|-------|
| A | mydomain.com | 1.2.3.4 |
| CNAME | *.mydomain.com | mydomain.com |
@rosskevin This issue is only a problem if the wildcard CNAME record points to a different domain.
We fixed this by adding _acme-challenge.example2.com CNAME cert-manager-hack.example2.com. (i.e. at the same level as the wildcard, pointing to the same zone)
This means when cert-manager follows that specific CNAME it stays in an updateable zone.
@WesCossick We ran into the issue when the wildcard CNAME pointed to the same domain, at the same level as the wildcard. Not sure the difference in our configuration, but it seems more or less identical.
Most helpful comment
We fixed this by adding
_acme-challenge.example2.com CNAME cert-manager-hack.example2.com.(i.e. at the same level as the wildcard, pointing to the same zone)This means when cert-manager follows that specific CNAME it stays in an updateable zone.