Many organizations which host their own DNS servers use split-DNS resolution: internal recursive resolvers return different results than external resolvers. This means that attempts to pre-check the TXT records for a DNS01 challenge using recursive resolvers from /etc/resolv.conf will fail to return results available on external authoritative servers. It also means that even the identity of the authoritative nameservers for the domain may be different.
As a result, the PreCheckDNS func will fail in these scenarios. What is needed is an option to disable pre-check on DNS01 challenges altogether.
As a workaround, if you are using acme as a library, you can sub in your own implementation of acme.PreCheckDNS
(godoc) that does nothing. Not optimal, and not useful for cli, but it may work for you.
I never did like a pre check at all because of this. Remind me again why we don't simply let the acme challenge itself tell us whether it was successful?
The problem is almost all cloud dns providers have a delay between when their api accepts a change, and when the nameservers actually serve the correct records. This can be a few seconds, or potentially more if their infrastructure is "less optimized".
The safe thing is to pre-verify the nameservers are correct before asking acme validate, because the acme server will just check once and fail.
@captncraig the problem here is, see #544 is that the actualy verification is just not working well enough.
In my case, LEGO does way to many mistakes in guessing the SOA record / primary DNS and in the end, fails in doing so.
During that, both reference implementations certbot/acme.sh are working flawlessly in both environments. So what i have right now is, a LEGO lib broken in several ways in DNS-01 in exactly those cases where you need it most:
i guess most of the other people would /can use http without any issue, so no need for a cloud DNS provider or anything else.
For me LEGO is breaking the core aspects of the DNS-01 need and thats a really a bummer. I would really love to see how we could fix that - maybe just :
Propasal:
Go back to the roots of the reference implementations: Let the user define a static, safe timeout (60 seconds). Then tell LE to "check for the txt record" after that timeout. Since LE is not using any private DNS servers, they will never fail to publicly determine the SOA record and thus the whole process will never fail ( and does not for certbot/acme ).
Downside is, waiting for a longer time then maybe actually needed, due to the "safe timeout".
Alternatively, make it possible to boostrap LEGO with bot strategies, the current one ( default? ) and the new one "timeout based / no DNS precheck".
Sadly running into this as well with a K8S cluster running traefik with the virtual machines leveraging a custom vnet w/DNS. It does create the txt records in the Azure DNS but attempts to verify with the DNS in the custom vnet instead.
I fixed this using typetransparent
DNS servers. While this sounds dead eazy, its not, a lot of other aspects become way more complex using this mode.
"Having more private zones your forward to private DNS servers need changes"
You will need to have stub-zones
with first: yes
if you have several private zones and one is the authoritive to get NS lookups work, a foward-zone
will stop when it did not try the entry ( which is the case for NS ) and will not recurse to a public DNS server. with first: yes, it fails and then you initial DNS server will try the resolving part, cannot and due to typetransparent will resolve using a public DNS and finds the SOA records
Conclusion
This is the single-most issue with LEGO which will hold be back very much of advising anybody to use this library or Traefik, especially if its a company. This all has no real impact for private usage or simple networks - but for any bigger company, this is a road-blocker.
Not sure, why LEGO's implementation here trys to be extra clever and ends up being extra-broken for all those environments, where DNS-01 is really mandatory, corporate private networks.
Hopes
Just remove this entirely, use the same approach as any other official reference implementation of ACME, be it certbot/ACME.sh and others. Those work in corporate networks without any flaws and extra configuration.
Closed by #700
Most helpful comment
I fixed this using
typetransparent
DNS servers. While this sounds dead eazy, its not, a lot of other aspects become way more complex using this mode."Having more private zones your forward to private DNS servers need changes"
You will need to have
stub-zones
withfirst: yes
if you have several private zones and one is the authoritive to get NS lookups work, afoward-zone
will stop when it did not try the entry ( which is the case for NS ) and will not recurse to a public DNS server. with first: yes, it fails and then you initial DNS server will try the resolving part, cannot and due to typetransparent will resolve using a public DNS and finds the SOA recordsConclusion
This is the single-most issue with LEGO which will hold be back very much of advising anybody to use this library or Traefik, especially if its a company. This all has no real impact for private usage or simple networks - but for any bigger company, this is a road-blocker.
Not sure, why LEGO's implementation here trys to be extra clever and ends up being extra-broken for all those environments, where DNS-01 is really mandatory, corporate private networks.
Hopes
Just remove this entirely, use the same approach as any other official reference implementation of ACME, be it certbot/ACME.sh and others. Those work in corporate networks without any flaws and extra configuration.