Caddy: Add feature to make DNS request over TCP instead of UDP for DNS challenge

Created on 5 Aug 2020  ·  32Comments  ·  Source: caddyserver/caddy

1. Environment

1a. Operating system and version

Ubuntu Server 20.04

1b. Caddy version (run caddy version or paste commit SHA)

v2.1.1 h1:X9k1+ehZPYYrSqBvf/ocUgdLSRIuiNiMo7CvyGUQKeA=

1c. Go version (if building Caddy from source; run go version)

go version go1.14.6 linux/amd64a

1d. Minimal Caddyfile

licdb.sonnenwagen.org {
  respond "Welcome to TLS"
  tls {
    dns cloudflare ###YOUR-KEY###
  }
}

2. Description

2a. What happens (briefly explain what is wrong)

I've tried setting up the DNS challenge with Cloudflare (build using xcaddy) as it is mentioned in the wiki.

The problem is that I'm behind a corporate firewall that has it's own DNS server and blocks all inbound UDP traffic.
That means, that requests like dig -t TXT @8.8.8.8 google.com time out (except for the company own DNS server).
I've verified that this request doesn't work on any other PC in the same network, which means its no configuration issue on my side.
Because of that Caddy is unable to check that the DNS entry for the acme challenge exist.

2c. Log output

2020/08/05 19:16:52 [INFO] [licdb.sonnenwagen.org] acme: Waiting for DNS record propagation.
2020/08/05 19:16:54 [INFO] [licdb.sonnenwagen.org] acme: Cleaning DNS-01 challenge
2020/08/05 19:16:56 [INFO] Deactivating auth: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/89380421
2020/08/05 19:16:56 [ERROR] error: one or more domains had a problem:
[licdb.sonnenwagen.org] time limit exceeded: last error: read udp 137.226.33.170:59178->108.162.193.233:53: i/o timeout (challenge=dns-01 remaining=[])

2d. Workaround(s)

The fix is fairly easy. Using TCP instead of UDP for the dig request works perfectly (even for cloudflare DNS server):

dig -t TXT @8.8.8.8 +tcp google.com

It would be nice to have an option in the Caddyfile to enable DNS lookup over TCP instead of UDP

3. Tutorial (minimal steps to reproduce the bug)

I've no idea how to setup your firewall to block that kind of traffic, but if you do that, you'll have this issue.

feature request upstream

Most helpful comment

Thanks for the issue! And for your willingness to help with finding a fix.

We just barely replaced our underlying ACME implementation (including the DNS challenge solver) so things are a little different on master than they are in 2.1.1.

Our cert management library is CertMagic: https://github.com/caddyserver/certmagic

Which in turn uses ACMEz now: https://github.com/mholt/acmez

Although, I think for this feature, ACMEz is irrelevant, since this doesn't have anything to do with ACME itself. But when we switched from lego we had to bring in-house a bunch of DNS functions, which now live in CertMagic: https://github.com/caddyserver/certmagic/blob/master/dnsutil.go

The solvers.go file in CertMagic does authoritative DNS lookups to wait for propagation (without this, DNS challenges often fail immediately): https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/solvers.go#L307-L336

So, this might actually be a fix that needs to go in CertMagic, and the only change needed in Caddy would be a way to configure DNS lookups using TCP. What do you think?

Edit: actually, I don't really know what's going on yet (I'd need to look into it more closely when I have time), but the code we brought in from lego might already be using TCP if UDP fails? https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/dnsutil.go#L146-L147 (I like the idea of an automatic failover rather than exposing another config parameter)

All 32 comments

Update:
I assume that you use DialContext in your code in order to specify a custom DNS server. I've found that replacing UDP with TCP in that context works. If you can just pinpoint me into the right direction, where you make the DNS request, I might be able to change it myself for a short fix on my side.

Sample functioning go file

package main

import (
    "context"
    "net"
    "time"
)

func main() {
    r := &net.Resolver{
        PreferGo: true,
        Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
            d := net.Dialer{
                Timeout: time.Millisecond * time.Duration(10000),
            }
            return d.DialContext(ctx, "tcp", "8.8.8.8:53")
        },
    }
    ip, _ := r.LookupHost(context.Background(), "google.com")

    print(ip[0])
}

Thanks for the issue! And for your willingness to help with finding a fix.

We just barely replaced our underlying ACME implementation (including the DNS challenge solver) so things are a little different on master than they are in 2.1.1.

Our cert management library is CertMagic: https://github.com/caddyserver/certmagic

Which in turn uses ACMEz now: https://github.com/mholt/acmez

Although, I think for this feature, ACMEz is irrelevant, since this doesn't have anything to do with ACME itself. But when we switched from lego we had to bring in-house a bunch of DNS functions, which now live in CertMagic: https://github.com/caddyserver/certmagic/blob/master/dnsutil.go

The solvers.go file in CertMagic does authoritative DNS lookups to wait for propagation (without this, DNS challenges often fail immediately): https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/solvers.go#L307-L336

So, this might actually be a fix that needs to go in CertMagic, and the only change needed in Caddy would be a way to configure DNS lookups using TCP. What do you think?

Edit: actually, I don't really know what's going on yet (I'd need to look into it more closely when I have time), but the code we brought in from lego might already be using TCP if UDP fails? https://github.com/caddyserver/certmagic/blob/c6afa6e7a22e4b3035cc7b86377f83a421f0c3af/dnsutil.go#L146-L147 (I like the idea of an automatic failover rather than exposing another config parameter)

The code you mentioned is a failover but of a different kind. UDP request only support a maximum return length of 512 bytes and is truncated afterwards. The full answer can then be retrieved using TCP.

dig & co all use this mechanism.

I don't know if it would make sense to have an automatic switch to TCP if UDP didn't succeed. My case seems very rare and most times it will double the timeout times if the DNS server really isn't available as it now does two requests instead of one

Edit: Some info
Why we prefer UDP: https://stackoverflow.com/a/40063433/5759814
Why is there a failover code: https://serverfault.com/a/404843

I would even suggest something else: We could just do the complete request over TCP and remove the failover code. When checking for the existence of the entry we are not interested in speed, but reliability and we also don't have issues with overloading the DNS server, because it all happens so rarely. A firewall that blocks TCP but not UDP traffic seems to be very unlikely to me (but what do I know about firewalls)

Just a small question that has nothing to do with this issue: If I want to compile certmagic from my own repo and use it in Caddy, how would go with that? Can I specify the path using xcaddy, do I need to change the dependencies, …?

I don't know if it would make sense to have an automatic switch to TCP if UDP didn't succeed. My case seems very rare and most times it will double the timeout times if the DNS server really isn't available as it now does two requests instead of one

Is this such a bad cost though for the simplicity and functionality wins? If the DNS server really is unavailable, the DNS challenge will fail anyways, so how much will an extra few milliseconds hurt?

I would even suggest something else: We could just do the complete request over TCP and remove the failover code. When checking for the existence of the entry we are not interested in speed, but reliability and we also don't have issues with overloading the DNS server, because it all happens so rarely. A firewall that blocks TCP but not UDP traffic seems to be very unlikely to me (but what do I know about firewalls)

We could, but if UDP will work 90-99% of the time (just guessing), we might as well try it first, and then we can use TCP if it fails? Rather than slowing things down for everyone?

Just a small question that has nothing to do with this issue: If I want to compile certmagic from my own repo and use it in Caddy, how would go with that? Can I specify the path using xcaddy, do I need to change the dependencies, …?

Use a replace directive: https://github.com/golang/go/wiki/Modules#when-should-i-use-the-replace-directive

In your go.mod it would be something like replace github.com/caddyserver/certmagic => /path/to/yours

With xcaddy, it's similar with the --replace flag: --replace github.com/caddyserver/certmagic=/path/to/yours (note we don't use > so as to avoid accidental shell redirection)

If I've your okay I would do the change. I would just add a case in the if statement in dnsutil.go:146 to include the case, where err is a timeout.

Shouldn't be too hard.

Sure! Sounds simple and effective, without exposing any more config surface. Thanks!

Update. I'm unable to debug the code at dnsutil.go:146 because it isn't the one being triggered (even with caddy and certmanager on current commit of master)...
I don't know if it would be a good idea to allow there for every err with err != nil and I've no idea how to know if the err is a timeout or not / how it is structured.

Hmm, so we'll need to figure out where the error you're experiencing comes from, right? The error comes from the Go std lib... might need to add some simple tracing (e.g. return fmt.Errorf("doing this or that: %w", err)) to pinpoint it.

I know where it comes from (I've searched the go modules folder with recursive grep. I ❤️ the shell)

The error messages are generated by the www.github.com/go-acme/lego module in challenges/dns01/dns_challenge.go

@Gansgar Ah, okay. FYI we no longer use lego -- make sure you're using the latest code from the master branch of both caddy and certmagic (we did borrow some of their DNS code though, so you'll find the code in dnsutil.go pretty similar to what's in lego).

Yeah I'm right now a bit stunned. After cleaning the pkg folder lego wasn't downloaded anymore with the relevant code. (go-acme is still in the go.sum though)

I've discovered that you're using a similar phrase in certmagic, but changing this code piece doesn't change the outcome and the printf is also a bit different. I've no idea how that all works rn. I'm even doing an md5 of my caddy file after each build to verify that I actually use the new version

Do a go mod tidy and go-acme/lego should no longer appear in go.sum.

I've discovered that you're using a similar phrase in certmagic, but changing this code piece doesn't change the outcome and the printf is also a bit different.

Hmm, try putting some traces in higher up in the code until they disappear, that's how you can know where the code path diverges from what you expected.

Okay. I've found the one responsible. If I remove the plugin "github.com/caddy-dns/cloudflare" then "lego" disappears in the "go.sum".

It might well be, because caddy v2.0.0 is a requirement in the go.mod inside the cloudflare module:
https://github.com/caddy-dns/cloudflare/blob/master/go.mod#L6

@Gansgar Ah, I guess that needs to be updated. I've just pushed a commit to that repo that should use the latest versions of its two dependencies (although only the caddy one mattered I think).

Got change. Lego isn't used anymore

But just a small question regarding the go replace statement: It doesn't work recursivly. Doesn't it?

So when add a replace statement in caddy/go.mod and cloudflare/go.mod revers back to caddy, it now doesn't use my local caddy/go.mod but the one in the caddy git. Is that correct?

But just a small question regarding the go replace statement: It doesn't work recursivly. Doesn't it?

Correct, it does not. It only makes the replacement for that module directly. I seem to recall having to set up two or three replace statements sometimes... kind of annoying :( (but I can see why it is that way)

Yeah. I'm right now on my trip in recursing through the whole dependency tree and chaning almost everything...

But I'm not giving up ^^

Okay. So the thing is, I've now add the line return d.Err("missing YOU token") inside https://github.com/caddy-dns/cloudflare/blob/master/cloudflare.go#L31 in order to crash the parser. I know that the statement is inside the compiled file through grep, but the parsing still doesn't crash.

But go.sum in caddy doesn't include the cloudflare repo. But it must be imported somewhere.

It's like, if the repo github.com/caddy-dns/cloudflare has nothing to do with the actual cloudflare dns check and is just a stud

(You shouldn't need more than a few replace statements -- if you need more than... 3.... maybe 4... that's probably more than I've ever set up. Just so you know.)

Another great sanity check is to add a line of code like lksajdfl;ksjdflkjsdflkjsdflkjlkj and see what happens during compilation.

Yeah. it says that is invalid during compilation. As expected

The thing is, that nobody (certmagic, libdns/cloudflare nor caddy) has any strings matching the log output. So it must be somewhere else.

I've also tried to include panic(42) everywhere in order to crash the program, but nada.

Current guess: The actuall DNS Challenge is handled by somebody else, but not by caddy code, even though you want it to handle it correctly.

Hmm, that doesn't sound right -- it should definitely be handled (meaning, the TXT record is created and deleted) by libdns, which is wrapped by caddy-dns/cloudflare.

Anyway, I'm away from the office for a bit -- but I'm rooting for you!

Thanks. Have a good break.

I'm rn sitting here at 2:30 am 😂

Get some sleep, then come back to this later. Definitely.

Oh

I calculated the correct md5. But I always executed the global caddy software (had it installed with apt) and not the local one.

There's no facepalm big enough for that.

Look at that. It now actually crashes when I add a panic(42) in the sendDNSQuery...

At least I've now a story more that makes a good laugh.

Edit: Still the correction of cloudflare repo was correct. I don't know if it would have changed something in the outcome, but it removed the lego dependency.

Your first idea with dnsutil hit bullseye. Changing UDP there to TCP makes for a successful license creation.

See https://github.com/caddyserver/certmagic/pull/81#issue-464921972

Edit: It works now. Thank you a lot

I calculated the correct md5. But I always executed the global caddy software (had it installed with apt) and not the local one.

There's no facepalm big enough for that.

😅 I've done that a few times too. It's an easy mistake to make!

Fixed upstream, thanks for the PR! Nice work. :)

Was this page helpful?
0 / 5 - 0 ratings