Ubuntu Server 20.04
caddy version
or paste commit SHA)v2.1.1 h1:X9k1+ehZPYYrSqBvf/ocUgdLSRIuiNiMo7CvyGUQKeA=
go version
)go version go1.14.6 linux/amd64a
licdb.sonnenwagen.org {
respond "Welcome to TLS"
tls {
dns cloudflare ###YOUR-KEY###
}
}
I've tried setting up the DNS challenge with Cloudflare (build using xcaddy
) as it is mentioned in the wiki.
The problem is that I'm behind a corporate firewall that has it's own DNS server and blocks all inbound UDP traffic.
That means, that requests like dig -t TXT @8.8.8.8 google.com
time out (except for the company own DNS server).
I've verified that this request doesn't work on any other PC in the same network, which means its no configuration issue on my side.
Because of that Caddy is unable to check that the DNS entry for the acme challenge exist.
2020/08/05 19:16:52 [INFO] [licdb.sonnenwagen.org] acme: Waiting for DNS record propagation.
2020/08/05 19:16:54 [INFO] [licdb.sonnenwagen.org] acme: Cleaning DNS-01 challenge
2020/08/05 19:16:56 [INFO] Deactivating auth: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/89380421
2020/08/05 19:16:56 [ERROR] error: one or more domains had a problem:
[licdb.sonnenwagen.org] time limit exceeded: last error: read udp 137.226.33.170:59178->108.162.193.233:53: i/o timeout (challenge=dns-01 remaining=[])
The fix is fairly easy. Using TCP instead of UDP for the dig request works perfectly (even for cloudflare DNS server):
dig -t TXT @8.8.8.8 +tcp google.com
It would be nice to have an option in the Caddyfile to enable DNS lookup over TCP instead of UDP
I've no idea how to setup your firewall to block that kind of traffic, but if you do that, you'll have this issue.
Update:
I assume that you use DialContext
in your code in order to specify a custom DNS server. I've found that replacing UDP with TCP in that context works. If you can just pinpoint me into the right direction, where you make the DNS request, I might be able to change it myself for a short fix on my side.
Sample functioning go file
package main
import (
"context"
"net"
"time"
)
func main() {
r := &net.Resolver{
PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
d := net.Dialer{
Timeout: time.Millisecond * time.Duration(10000),
}
return d.DialContext(ctx, "tcp", "8.8.8.8:53")
},
}
ip, _ := r.LookupHost(context.Background(), "google.com")
print(ip[0])
}
Thanks for the issue! And for your willingness to help with finding a fix.
We just barely replaced our underlying ACME implementation (including the DNS challenge solver) so things are a little different on master than they are in 2.1.1.
Our cert management library is CertMagic: https://github.com/caddyserver/certmagic
Which in turn uses ACMEz now: https://github.com/mholt/acmez
Although, I think for this feature, ACMEz is irrelevant, since this doesn't have anything to do with ACME itself. But when we switched from lego we had to bring in-house a bunch of DNS functions, which now live in CertMagic: https://github.com/caddyserver/certmagic/blob/master/dnsutil.go
The solvers.go file in CertMagic does authoritative DNS lookups to wait for propagation (without this, DNS challenges often fail immediately): https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/solvers.go#L307-L336
So, this might actually be a fix that needs to go in CertMagic, and the only change needed in Caddy would be a way to configure DNS lookups using TCP. What do you think?
Edit: actually, I don't really know what's going on yet (I'd need to look into it more closely when I have time), but the code we brought in from lego might already be using TCP if UDP fails? https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/dnsutil.go#L146-L147 (I like the idea of an automatic failover rather than exposing another config parameter)
The code you mentioned is a failover but of a different kind. UDP request only support a maximum return length of 512 bytes and is truncated afterwards. The full answer can then be retrieved using TCP.
dig & co all use this mechanism.
I don't know if it would make sense to have an automatic switch to TCP if UDP didn't succeed. My case seems very rare and most times it will double the timeout times if the DNS server really isn't available as it now does two requests instead of one
Edit: Some info
Why we prefer UDP: https://stackoverflow.com/a/40063433/5759814
Why is there a failover code: https://serverfault.com/a/404843
I would even suggest something else: We could just do the complete request over TCP and remove the failover code. When checking for the existence of the entry we are not interested in speed, but reliability and we also don't have issues with overloading the DNS server, because it all happens so rarely. A firewall that blocks TCP but not UDP traffic seems to be very unlikely to me (but what do I know about firewalls)
Just a small question that has nothing to do with this issue: If I want to compile certmagic from my own repo and use it in Caddy, how would go with that? Can I specify the path using xcaddy, do I need to change the dependencies, …?
I don't know if it would make sense to have an automatic switch to TCP if UDP didn't succeed. My case seems very rare and most times it will double the timeout times if the DNS server really isn't available as it now does two requests instead of one
Is this such a bad cost though for the simplicity and functionality wins? If the DNS server really is unavailable, the DNS challenge will fail anyways, so how much will an extra few milliseconds hurt?
I would even suggest something else: We could just do the complete request over TCP and remove the failover code. When checking for the existence of the entry we are not interested in speed, but reliability and we also don't have issues with overloading the DNS server, because it all happens so rarely. A firewall that blocks TCP but not UDP traffic seems to be very unlikely to me (but what do I know about firewalls)
We could, but if UDP will work 90-99% of the time (just guessing), we might as well try it first, and then we can use TCP if it fails? Rather than slowing things down for everyone?
Just a small question that has nothing to do with this issue: If I want to compile certmagic from my own repo and use it in Caddy, how would go with that? Can I specify the path using xcaddy, do I need to change the dependencies, …?
Use a replace
directive: https://github.com/golang/go/wiki/Modules#when-should-i-use-the-replace-directive
In your go.mod it would be something like replace github.com/caddyserver/certmagic => /path/to/yours
With xcaddy, it's similar with the --replace
flag: --replace github.com/caddyserver/certmagic=/path/to/yours
(note we don't use >
so as to avoid accidental shell redirection)
If I've your okay I would do the change. I would just add a case in the if statement in dnsutil.go:146 to include the case, where err is a timeout.
Shouldn't be too hard.
Sure! Sounds simple and effective, without exposing any more config surface. Thanks!
Update. I'm unable to debug the code at dnsutil.go:146 because it isn't the one being triggered (even with caddy and certmanager on current commit of master)...
I don't know if it would be a good idea to allow there for every err
with err != nil
and I've no idea how to know if the err
is a timeout or not / how it is structured.
Hmm, so we'll need to figure out where the error you're experiencing comes from, right? The error comes from the Go std lib... might need to add some simple tracing (e.g. return fmt.Errorf("doing this or that: %w", err)
) to pinpoint it.
I know where it comes from (I've searched the go modules folder with recursive grep. I ❤️ the shell)
The error messages are generated by the www.github.com/go-acme/lego module in challenges/dns01/dns_challenge.go
@Gansgar Ah, okay. FYI we no longer use lego -- make sure you're using the latest code from the master branch of both caddy and certmagic (we did borrow some of their DNS code though, so you'll find the code in dnsutil.go
pretty similar to what's in lego).
Yeah I'm right now a bit stunned. After cleaning the pkg folder lego wasn't downloaded anymore with the relevant code. (go-acme is still in the go.sum though)
I've discovered that you're using a similar phrase in certmagic, but changing this code piece doesn't change the outcome and the printf is also a bit different. I've no idea how that all works rn. I'm even doing an md5 of my caddy file after each build to verify that I actually use the new version
Do a go mod tidy
and go-acme/lego
should no longer appear in go.sum.
I've discovered that you're using a similar phrase in certmagic, but changing this code piece doesn't change the outcome and the printf is also a bit different.
Hmm, try putting some traces in higher up in the code until they disappear, that's how you can know where the code path diverges from what you expected.
Okay. I've found the one responsible. If I remove the plugin "github.com/caddy-dns/cloudflare" then "lego" disappears in the "go.sum".
It might well be, because caddy v2.0.0 is a requirement in the go.mod
inside the cloudflare module:
https://github.com/caddy-dns/cloudflare/blob/master/go.mod#L6
@Gansgar Ah, I guess that needs to be updated. I've just pushed a commit to that repo that should use the latest versions of its two dependencies (although only the caddy one mattered I think).
Got change. Lego isn't used anymore
But just a small question regarding the go replace statement: It doesn't work recursivly. Doesn't it?
So when add a replace statement in caddy/go.mod
and cloudflare/go.mod
revers back to caddy
, it now doesn't use my local caddy/go.mod
but the one in the caddy git. Is that correct?
But just a small question regarding the go replace statement: It doesn't work recursivly. Doesn't it?
Correct, it does not. It only makes the replacement for that module directly. I seem to recall having to set up two or three replace statements sometimes... kind of annoying :( (but I can see why it is that way)
Yeah. I'm right now on my trip in recursing through the whole dependency tree and chaning almost everything...
But I'm not giving up ^^
Okay. So the thing is, I've now add the line return d.Err("missing YOU token")
inside https://github.com/caddy-dns/cloudflare/blob/master/cloudflare.go#L31 in order to crash the parser. I know that the statement is inside the compiled file through grep, but the parsing still doesn't crash.
But go.sum
in caddy doesn't include the cloudflare repo. But it must be imported somewhere.
It's like, if the repo github.com/caddy-dns/cloudflare has nothing to do with the actual cloudflare dns check and is just a stud
(You shouldn't need more than a few replace statements -- if you need more than... 3.... maybe 4... that's probably more than I've ever set up. Just so you know.)
Another great sanity check is to add a line of code like lksajdfl;ksjdflkjsdflkjsdflkjlkj
and see what happens during compilation.
Yeah. it says that is invalid during compilation. As expected
The thing is, that nobody (certmagic, libdns/cloudflare nor caddy) has any strings matching the log output. So it must be somewhere else.
I've also tried to include panic(42) everywhere in order to crash the program, but nada.
Current guess: The actuall DNS Challenge is handled by somebody else, but not by caddy code, even though you want it to handle it correctly.
Hmm, that doesn't sound right -- it should definitely be handled (meaning, the TXT record is created and deleted) by libdns, which is wrapped by caddy-dns/cloudflare.
Anyway, I'm away from the office for a bit -- but I'm rooting for you!
Thanks. Have a good break.
I'm rn sitting here at 2:30 am 😂
Get some sleep, then come back to this later. Definitely.
Oh
I calculated the correct md5. But I always executed the global caddy software (had it installed with apt) and not the local one.
There's no facepalm big enough for that.
Look at that. It now actually crashes when I add a panic(42)
in the sendDNSQuery
...
At least I've now a story more that makes a good laugh.
Edit: Still the correction of cloudflare repo was correct. I don't know if it would have changed something in the outcome, but it removed the lego dependency.
Your first idea with dnsutil hit bullseye. Changing UDP there to TCP makes for a successful license creation.
See https://github.com/caddyserver/certmagic/pull/81#issue-464921972
Edit: It works now. Thank you a lot
I calculated the correct md5. But I always executed the global caddy software (had it installed with apt) and not the local one.
There's no facepalm big enough for that.
😅 I've done that a few times too. It's an easy mistake to make!
Fixed upstream, thanks for the PR! Nice work. :)
Most helpful comment
Thanks for the issue! And for your willingness to help with finding a fix.
We just barely replaced our underlying ACME implementation (including the DNS challenge solver) so things are a little different on master than they are in 2.1.1.
Our cert management library is CertMagic: https://github.com/caddyserver/certmagic
Which in turn uses ACMEz now: https://github.com/mholt/acmez
Although, I think for this feature, ACMEz is irrelevant, since this doesn't have anything to do with ACME itself. But when we switched from lego we had to bring in-house a bunch of DNS functions, which now live in CertMagic: https://github.com/caddyserver/certmagic/blob/master/dnsutil.go
The solvers.go file in CertMagic does authoritative DNS lookups to wait for propagation (without this, DNS challenges often fail immediately): https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/solvers.go#L307-L336
So, this might actually be a fix that needs to go in CertMagic, and the only change needed in Caddy would be a way to configure DNS lookups using TCP. What do you think?
Edit: actually, I don't really know what's going on yet (I'd need to look into it more closely when I have time), but the code we brought in from lego might already be using TCP if UDP fails? https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/dnsutil.go#L146-L147 (I like the idea of an automatic failover rather than exposing another config parameter)