Cert-manager: DNS01 CNAME support breaks wildcard support for nginx ingress

Created on 17 Aug 2018  Â·  19Comments  Â·  Source: jetstack/cert-manager

Describe the bug:

We have a wildcard domain pointing at an nginx ingress controller, which basically means that the wildcard domain resolves to an Elastic Load Balancer.

When trying to create the _acme-challenge record in the wildcarded domain, it sees the CNAME to the ELB and then tries to update the DNS in the ELB's domain (us-west-2.elb.amazonaws.com).

I0817 02:04:31.075401       1 logger.go:73] Calling GetAuthorization
I0817 02:04:31.203160       1 logger.go:98] Calling DNS01ChallengeRecord
I0817 02:04:31.203193       1 prepare.go:279] Cleaning up old/expired challenges for Certificate staging/staging-phoenix-my-tls
I0817 02:04:31.203206       1 logger.go:68] Calling GetChallenge
I0817 02:04:31.436572       1 wait.go:66] Updating FQDN: _acme-challenge.example.com. with it's CNAME: ab06d0c81742111e8b745062d6efc4d9-1815477658.us-west-2.elb.amazonaws.com.
I0817 02:04:31.075401       1 logger.go:73] Calling GetAuthorization
I0817 02:04:31.203160       1 logger.go:98] Calling DNS01ChallengeRecord
I0817 02:04:31.203193       1 prepare.go:279] Cleaning up old/expired challenges for Certificate staging/staging-wildcard-tls
I0817 02:04:31.203206       1 logger.go:68] Calling GetChallenge
I0817 02:04:31.436572       1 wait.go:66] Updating FQDN: _acme-challenge.example.com. with it's CNAME: ab06d0c81742111e8b745062d6efc4d9-1815477658.us-west-2.elb.amazonaws.com.
I0817 02:04:31.436589       1 dns.go:93] Checking DNS propagation for "example.com" using name servers: [100.64.0.10:53]
I0817 02:04:31.472333       1 dns.go:100] DNS record for "example.com" not yet propagated
I0817 02:04:31.472460       1 dns.go:83] Presenting DNS01 challenge for domain "example.com"
I0817 02:04:31.481949       1 wait.go:66] Updating FQDN: _acme-challenge.example.com. with it's CNAME: ab06d0c81742111e8b745062d6efc4d9-1815477658.us-west-2.elb.amazonaws.com.
I0817 02:04:31.841210       1 helpers.go:201] Found status change for Certificate "staging-wildcard-tls" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-08-17 02:04:31.841201695 +0000 UTC m=+9743.366825451
I0817 02:04:31.841235       1 sync.go:276] Error preparing issuer for certificate staging/staging-wildcard-tls: Failed to determine Route 53 hosted zone ID: Zone us-west-2.elb.amazonaws.com. not found in Route 53 for domain ab06d0c81742111e8b745062d6efc4d9-1815477658.us-west-2.elb.amazonaws.com.
E0817 02:04:31.841254       1 sync.go:197] [staging/staging-wildcard-tls] Error getting certificate 'staging-wildcard-tls': secret "staging-wildcard-tls" not found 
E0817 02:04:31.854121       1 controller.go:180] certificates controller: Re-queuing item "staging/staging-wildcard-tls" due to error processing: Failed to determine Route 53 hosted zone ID: Zone us-west-2.elb.amazonaws.com. not found in Route 53 for domain ab06d0c81742111e8b745062d6efc4d9-1815477658.us-west-2.elb.amazonaws.com.

Expected behaviour:
The _acme-challenge TXT record is created in the wildcarded domain (example.com in the above)

Steps to reproduce the bug:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    certmanager.k8s.io/acme-challenge-type: dns01
    certmanager.k8s.io/acme-dns01-provider: route53
    certmanager.k8s.io/cluster-issuer: letsencrypt-staging
    kubernetes.io/ingress.class: nginx-external
  name: staging-frontend
spec:
  rules:
  - host: '*.example.com'
    http:
      paths:
      - backend:
          serviceName: staging-frontend
          servicePort: http
  tls:
  - hosts:
    - '*.example.com'
    secretName: staging-wildcard-tls

with a suitable nginx ingress controller pointing at an AWS ELB should do the trick

Anything else we need to know?:

The CNAME behaviour was introduced in #670 and the commit message is sufficient to understand the motivation behind the change, and there's plenty of support for the change within #670 - as such I don't know how best to fix this so that my use case is supported without breaking the use case that motivated #670.

cc: @gurvindersingh

Environment details::

  • Kubernetes version (e.g. v1.10.2): 1.9.7
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): AWS
  • cert-manager version (e.g. v0.4.0): quay.io/jetstack/cert-manager-controller:canary
  • Install method (e.g. helm or static manifests): static manifest

/kind bug

kinbug

Most helpful comment

I was thinking to keep things simple. As the earlier behavior is which people are used to in some setups, so we can have a config option e.g. enable-acme-cname which users decide to enable it if they want CNAME replacement for _acme-challengepart otherwise earlier behavior will be kept same.

If at later stage, people want to have more granular control for different domains different behavior then we can think about adding domain specific CNAME logic.

All 19 comments

It looks like #670 isn't yet in any actual releases (I'm using canary as that's what the default static manifests pointed me at) so I'll revert to v0.4.1 for now

I had this issue as well testing master.

Using the CNAME to create the text record should really be some kind of option, or at least it should test which domain we can create records in and use that one

@willthames I think we can use a config option to enable or disable the CNAME support. The default can be disabled to keep the behavior same as earlier. This code can be put under that condition check.

Hm - so from my understanding, we should be following CNAMEs for _
acme-challenge.example.com instead of for example.com itself.

This would in turn, resolve your issue.

I don't think the validation process will actually even work if we're
resolving CNAMEs for example.com in your example, as we'd not be proving
ownership of the domain.

@gurvindersingh would you mind clarifying the intent of the original PR? 😀

@cpu do you have any idea how we should handle CNAMEs? I assume only CNAME
records set on _well-known.example.com should be followed?

On Tue, 21 Aug 2018 at 10:39, Gurvinder Singh notifications@github.com
wrote:

@willthames https://github.com/willthames I think we can use a config
option to enable or disable the CNAME support. The default can be disabled
to keep the behavior same as earlier. This code
https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/util/dns.go#L23-L29
can be put under that condition check.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jetstack/cert-manager/issues/837#issuecomment-414614941,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMbPyYNUppW2lN78uWHyQ7lHoo_8sP6ks5uS9VBgaJpZM4WA3Om
.

@munnerz the code changes in the PR #670 does CNAME for acme-challenge.example.com not example.com, so not sure if that is the problem.

@munnerz reading more carefully the bug report, to me it seems the current code is doing what it suppose to do. As @willthames have wildcard domain CNAME to AWS LB so the code sees that there is a CNAME even for _acme.challenge and update the fqdn to use that. So the solution to it is to have a config option to enable this feature or not depending upon your cluster setup.

Thanks for taking a look @gurvindersingh.

That makes sense then - so you have *.example.com pointed at an ELB, which implies _acme-challenge.example.com is CNAMEd too.

This is a tricky one - from what I can see, there's no way for us to detect this, as wildcards are a DNS provider feature and not part of the DNS spec.

I think it's also fair that some users may want some domains to follow the CNAME, and some to not.

One solution, I'd guess, is if you were to provide an explicit CNAME for _acme-challenge.example.com to some other domain (e.g. acme-challenge.acme.example.com).

How do you think we can best represent this configuration option to users?

I was thinking to keep things simple. As the earlier behavior is which people are used to in some setups, so we can have a config option e.g. enable-acme-cname which users decide to enable it if they want CNAME replacement for _acme-challengepart otherwise earlier behavior will be kept same.

If at later stage, people want to have more granular control for different domains different behavior then we can think about adding domain specific CNAME logic.

I tried v0.5.0 and canary (master-5602) and it didn't work in either of those. I guess like mentioned above we have to wait for #670 to make it into a release.
I've reverted to v0.4.1 using the same helm chart without any changes other than the tag and just doing a helm upgrade (fingers crossed i've not put myself in a world of hurt longer term)
I am no longer seeing the CNAME related cannot find ZoneID for xyzxyz.elb.amazon.com domain error message.

I tried v0.5.0 and canary (master-5602) and it didn't work in either of those. I guess like mentioned above we have to wait for #670 to make it into a release.
I've reverted to v0.4.1 using the same helm chart without any changes other than the tag and just doing a helm upgrade (fingers crossed i've not put myself in a world of hurt longer term)
I am no longer seeing the CNAME related cannot find ZoneID for xyzxyz.elb.amazon.com domain error message.

Same worked for me as well! 🤔

670 feels like a major breaking change. I had installed 0.5.0 via helm on a new cluster and was rolling merrily along until somebody added a wildcard entry in DNS for my top-level domain. Then suddenly no certs could be issued. Luckily I noticed before I blew my LE API limit (not sure if that would be an issue, but I didn't want to find out). It took me forever to figure out what the issue was. Removing the wildcard for now from DNS allowed the certs to be issued.

Just to be clear, I am using the nginx ingress controller for now, but I'm not even using wildcards in my ingresses yet. I feel like people with a good working setup are going to either upgrade or if already on 0.5.0 have someone add a DNS entry that causes their certs not to be re-issued.

Please let me know if I'm way off on this.

1035 has more information about a very much delayed resolution of acme certificates in 0.4.1. In my case it succeeded after 45m. I'm not sure it is this exact issue but it does seem in this same area.

I agree with @keithlayne on this one. The change from #670 causes cert-manager to essentially break down when the user has a wildcard CNAME record. Reference #1035.

I completely see why #670 was needed, but I'd argue that a wildcard CNAME record is at least similarly as common, if not more common, of a use case. Therefore, an option like @gurvindersingh proposed is imperative.

We were tripped up by this bug as well after upgrading to 0.5.0. The cluster.example.com DNS zone is hosted in Azure DNS and has a wildcard CNAME pointing to cluster-example.trafficmanager.net.

After creating a certificate for foo.cluster.example.com we see this:

cluster.example.com zone (cert-manager 0.4.x)

| Name | Type | Value |
|---|---|---|
| * | CNAME | cluster-example.trafficmanager.net |
| _acme-challenge.foo | TXT | Zm9vYmFy... |

cluster.example.com zone (cert-manager 0.5.0)

| Name | Type | Value |
|---|---|---|
| * | CNAME | cluster-example.trafficmanager.net |
| cluster-example.trafficmanager.net | TXT | Zm9vYmFy... |

trafficmanager.net is Azure's global load balancer, so we can't create the _acme-challenge records on that domain.

I also downgraded to v0.4.1 for now. This would be awesome if you could toggle as suggested.

So I've done some research here, and have found that if someone has a CNAME record configured for _acme-challenge.example.com (including *.example.com), they will check both example.com and acme.insecure.com (the domain that the CNAME points at) for the TXT record, and if either has one, it will validate the challenge as successful.

This implies to me that we need to allow users to configure how cert-manager behaves.

I see two options going forward, and I'd love to hear feedback on either:

1) add a followCNAME option to issuer.spec.acme.dns01.solvers[] (defaults to false). If set to true, when cert-manager encounters a CNAME record it will traverse the CNAME and update the zone it points at (and check that domain during self checking).

2) utilise the certificate.spec.acme.config.domains[] field to allow users to configure this:

apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
  name: testcrt-acme
spec:
  acme:
    config:
    - domains:
      - example.com
      dns01:
        provider: cloudflare
  dnsNames:
  - example.com

The above would cause cert-manager to update _acme-challenge.example.com directly with a TXT record

apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
  name: testcrt-acme
spec:
  acme:
    config:
    - domains:
      - acme.insecure.com
      dns01:
        provider: cloudflare
  dnsNames:
  - example.com

In order to achieve option (2), we'll need to attempt a CNAME lookup for every domain listed in dnsNames in order to determine which domains are valid substitutions for example.com.

Over time, we want to remove the certificate.spec.acme configuration from Certificate resources anyway, which makes this simpler for end-users (as they will only need to request dnsNames: ["example.com"] and not have to think about how that'll be solved)

I'd rather go with option(1) since followCNAME is an explicit option whereas option(2) appears more like an implicit, derived functionality to me.

I agree with @timuthy here, I saw you also already took that approach @munnerz.

This is now fixed as part of #1136 😄

Was this page helpful?
0 / 5 - 0 ratings