Cert-manager: Cert-manager can't find GoogleCloud subdomain.

Created on 28 Mar 2019  路  13Comments  路  Source: jetstack/cert-manager

Describe the bug:
Cert-manager could not find GoogleCloud subdomain.

I has a zone a.foobar.com, which managed in CloudDNS.
And I want to create SSL certificates of x.a.foobar.com and y.a.foobar.com
But cert-manager attempt to find domain foobar.com

Expected behaviour:
Cert-manager attempt to find domain a.foobar.com

Steps to reproduce the bug:

kubectl logs cert-manager-54f65df574-mvmmf --namespace=cert-manager

E0328 04:39:16.988310       1 controller.go:208] challenges controller: Re-queuing item "default/dev-superset-tls-591878805-1" due to error processing: No matching GoogleCloud domain found for domain foobar.com.
E0328 04:39:17.062437       1 controller.go:208] challenges controller: Re-queuing item "default/dev-superset-tls-591878805-0" due to error processing: No matching GoogleCloud domain found for domain foobar.com.
I0328 05:09:16.988625       1 controller.go:206] challenges controller: syncing item 'default/dev-superset-tls-591878805-1'
I0328 05:09:16.988826       1 logger.go:103] Calling Discover
I0328 05:09:17.062725       1 controller.go:206] challenges controller: syncing item 'default/dev-superset-tls-591878805-0'
I0328 05:09:17.062857       1 logger.go:103] Calling Discover
I0328 05:09:17.178510       1 dns.go:89] Presenting DNS01 challenge for domain "x.a.foobar.com"
I0328 05:09:17.181176       1 dns.go:89] Presenting DNS01 challenge for domain "y.a.foobar.com"
E0328 05:09:18.445717       1 controller.go:208] challenges controller: Re-queuing item "default/dev-superset-tls-591878805-1" due to error processing: No matching GoogleCloud domain found for domain foobar.com.
E0328 05:09:18.506470       1 controller.go:208] challenges controller: Re-queuing item "default/dev-superset-tls-591878805-0" due to error processing: No matching GoogleCloud domain found for domain foobar.com.

Anything else we need to know?:
foobar.com. is managed in route53
a.foobar.com is managed in clouddns

Environment details::

  • Kubernetes version (e.g. v1.10.2): v1.11.7-gke.12
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): GKE
  • cert-manager version (e.g. v0.4.0): v0.7
  • Install method (e.g. helm or static manifests): static manifests

/kind bug

areacmdns01 triagsupport

Most helpful comment

I also encountered the same issue when using a Cloud DNS domain, which delegates to another CloudDNS subdomain. In my case, I was able to get everything working with only this flag: --set extraArgs={--dns01-recursive-nameservers-only=true}. Here's an example of the complete helm script:

helm upgrade --install \
  --wait \
  --version v0.7.2 \
  --set extraArgs={--dns01-recursive-nameservers-only=true} \
  --namespace cert-manager \
  "cert-manager" \
  jetstack/cert-manager

Hope this helps anyone else that encounters this!

All 13 comments

I don't know why but I can get wildcard certificate *.a.foobar.com and a.foobar.com

Same issue here, for a Cloud DNS subdomain delegated to another Cloud DNS zone. I am able to e.g. dig an A record in the subdomain so the delegate NS record is definitely set up properly.

This appears to be the same issue described at the bottom of https://github.com/jetstack/cert-manager/issues/728 - is this a regression? I would try v0.4.1 but it doesn't appear to be hosted on the jetstack helm repo any more:

$ helm search -l jetstack/cert-manager
NAME                    CHART VERSION   APP VERSION     DESCRIPTION
jetstack/cert-manager   v0.7.0          v0.7.0          A Helm chart for cert-manager
jetstack/cert-manager   v0.7.0-beta.0   v0.7.0-beta.0   A Helm chart for cert-manager
jetstack/cert-manager   v0.7.0-alpha.1  v0.7.0-alpha.0  A Helm chart for cert-manager
jetstack/cert-manager   v0.6.0          v0.6.0          A Helm chart for cert-manager
jetstack/cert-manager   v0.5.2          v0.5.2          A Helm chart for cert-manager

Edit: perhaps it took some time for the records to propagate, but it looks like it's working for me now. Not sure what else it could be, the only other thing I changed was adding the following flags:
--dns01-recursive-nameservers=8.8.8.8:53,8.8.4.4:53 --dns01-recursive-nameservers-only=true because I'm on a split-horizon DNS setup. Now my other issue is that cert-manager tries to update the wrong zone ID, but that's an unrelated problem...

I also encountered the same issue when using a Cloud DNS domain, which delegates to another CloudDNS subdomain. In my case, I was able to get everything working with only this flag: --set extraArgs={--dns01-recursive-nameservers-only=true}. Here's an example of the complete helm script:

helm upgrade --install \
  --wait \
  --version v0.7.2 \
  --set extraArgs={--dns01-recursive-nameservers-only=true} \
  --namespace cert-manager \
  "cert-manager" \
  jetstack/cert-manager

Hope this helps anyone else that encounters this!

Interestingly this worked fine in the first project I used it project-1.example.com. Then when I used the same TLD for a second project, project-2.example.com I got this error in project 2.

I am still debugging and making sure I haven't missed something. I can't think why this would change anything.

I need to try and recreate this, but deleting the zone in the other project, creating a new subdomain (to get around any DNS caching) worked as expected.

I have the same issue with route53:

  • foobar.com. is managed elsewhere (not by me)
  • a.foobar.com is managed in my route53

cert-manager's log:

I0804 10:01:26.673949       1 dns.go:101] Presenting DNS01 challenge for domain "a.foobar.com"
E0804 10:01:26.763464       1 controller.go:215] cert-manager/controller/challenges "msg"="re-queuing item  due to error processing" "error"="Failed to determine Route 53 hosted zone ID: Zone foobar.com. not found in Route 53 for domain _acme-challenge.a.foobar.com." "key"="prod/a-foobar-com-cert-3307869370-0"

challenge status:

Status:
  Presented:   false
  Processing:  true
  Reason:      Failed to determine Route 53 hosted zone ID: Zone foobar.com. not found in Route 53 for domain _acme-challenge.a.foobar.com.
  State:       pending
Events:
  Type     Reason        Age                  From          Message
  ----     ------        ----                 ----          -------
  Normal   Started       3m16s                cert-manager  Challenge scheduled for processing
  Warning  PresentError  50s (x6 over 3m16s)  cert-manager  Error presenting challenge: Failed to determine Route 53 hosted zone ID: Zone foobar.com. not found in Route 53 for domain _acme-challenge.a.foobar.com.

cert-manager v0.8.1

So cert-manager doesn't take your FQDN blindly and try to manage the zone for you with CloudDNS. What it does is it takes your FQDN and then searches from left to right for a SOA record.

So if you have a root domain cars.com hosted in Route53, with an NS record pointing to cool.cars.com hosted in CloudDNS, both have SOA records. Cert manager is searching for the SOA of cool.cars.com first, but for some reason cert-manager skips it and sees cars.com SOA and then stops. This definitely worked in the past from what I've seen.

Its this block of code which causes the issues:

https://github.com/jetstack/cert-manager/blame/79711c5e3454b846fd661ecc2b5788a8efb7a920/pkg/issuer/acme/dns/util/wait.go#L313-L349

  1. It splits your domain.
  2. It searches through the domains left to right.
  3. Tries to find an SOA record.
  4. If an SOA record appears try to find the managed zone and if we see it everything is OK.

So why does this code skip the first SOA? Well in our case it was because we had _multiple_ zones we were delegating and I put my NS record of interest in the wrong one.

For example, we have zones cars.com, cool.cars.com, and really.cool.cars.com. cool.cars.com has an NS record in cars.com which makes it an authority for cool.cars.com. I incorrectly put the NS record for really.cool.cars.com in cars.com. This creates a conflict because:

  • cars.com is an SOA for really.cool.cars.com
  • cool.cars.com is an SOA for cool.cars.com

So when we were querying for red.really.cool.cars.com we would sometimes get an SOA record starting at cool.cars.com (which didn't have our NS record to really.cool.cars.com) and sometimes we would get the SOA for really.cool.cars.com correctly.

The way to correct this was to remove really.cool.cars.com NS from cars.com to cool.cars.com.

cars.com -> NS -> cool.cars.com -> really.cool.cars.com

when before we had

cars.com -> NS -> really.cool.cars.com
                |-> NS -> cool.cars.com

The way to debug this is to just dig really.cool.cars.com several times and see if you get the same SOA record.

Just on my 403 issue, there is an internal tracking bug at Google, I am just trying to get the exact replication steps. It seems that you can only use a service account once, if you use the same service account but with a new key the Cloud DNS api won't accept the call. Call SA a different name, or clear all keys it seems to work.

I can check your code in a bit, but if you could let me know which version of the API you are using that would be a great help.

I'm running into this as well, but trying to generate a wildcard domain:

so I two zones in Google Cloud:

stage.mydomain.com
mydomain.com

I can generate a wildcard fine for:

*.mydomain.com

but fails for:

*.stage.mydomain.com

I think it's adding the entry in the wrong zone or something? Is there a way to fix this?

Adding the TXT record to the correct zone generates the cert successfully, but I had to move it manually.

Maybe I don't need two zones for the same domain? I have a zone for the domain and then also a zone for the subdomain. Maybe I need 1 zone for both the domain and the subdomain.

https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/util/wait.go#L306

If CertManager were to do the traversal on the zones which are in a premature state, it might well resolve to a wrong FQDN and would cache it until the process terminates. So when you got the strange "No matching GoogleCloud domain found for domain", try deleting the cert-manager pod and waiting for it to respawn.

IMO, the cache implementation should have got a TTL for each entry.

https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/util/wait.go#L306

If CertManager were to do the traversal on the zones which are in a premature state, it might well resolve to a wrong FQDN and would cache it until the process terminates. So when you got the strange "No matching GoogleCloud domain found for domain", try deleting the cert-manager pod and waiting for it to respawn.

IMO, the cache implementation should have got a TTL for each entry.

Thanks! Restarting cert-manager POD solved my issue.
In my case it was - I created new DNS zone in one GCP project, then deployed cert-manager, and after that added delegation to that zone in another project. So cert-manager was throwing this error all the time, till I restarted it.

As @Freyert points out very well in https://github.com/jetstack/cert-manager/issues/1507#issuecomment-547876015, I think this issue can be resolved by properly configuring your DNS hierarchy to point to the correct zone. I don't think there's an inherent issue in our resolution logic, rather it is working as intended.

/area acme/dns01

Was this page helpful?
0 / 5 - 0 ratings