When consul is configured with a recursor, DNS queries for unrecognized record types are forwarded to the configured recursors. This forwarding also happens when resolving names in the consul. zone with types other than SOA, NS, ANY, A, AAAA, TXT.
In a particular setup, when dnsmasq is configured to forward consul. to a consul DNS server, and consul is configured to use this dnsmasq server as a recursor, this results in a DNS recursion loop. The dnsmasq configuration is described on hashicorp learn: https://learn.hashicorp.com/consul/security-networking/forwarding#dnsmasq-setup
The documentation states that consul only forwards DNS queries for names outside the consul. zone: https://www.consul.io/docs/agent/options.html#recursors
My expectations would also be that consul acts as an authoritative DNS server for the consul. zone, and does not forward it to recursors.
A docker setup to reproduce this issue is available in a gist: https://gist.github.com/vierbergenlars/d5877cf8bb076fb5789f47d1ad7039fb
Alternatively:
-recursor pointing to dnsmasq.consul. zone to the consul DNS server.consul.service.consul against either dnsmasq or consul. I used the dnssec DS type. dig @127.0.0.1 -p 8600 DS consul.service.consul`dig has finished, DNS requests keep bouncing around between consul and dnsmasq.
Server info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = a82e6a7f
version = 1.5.2
consul:
acl = disabled
bootstrap = true
known_datacenters = 1
leader = true
leader_addr = 127.0.0.1:8300
server = true
raft:
applied_index = 71
commit_index = 71
fsm_pending = 0
last_contact = 0
last_log_index = 71
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:a1a793cb-2564-e6ab-8c85-6bc4a3ee35b1 Address:127.0.0.1:8300}]
latest_configuration_index = 1
num_peers = 0
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 2
runtime:
arch = amd64
cpu_count = 4
goroutines = 82
max_procs = 4
os = linux
version = go1.12.1
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 1
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
From the consul server:
2019/07/20 16:08:42 [ERR] dns: recurse failed: read udp 10.33.10.2:34942->10.33.10.3:53: i/o timeout
2019/07/20 16:08:42 [ERR] dns: all resolvers failed for {consul.service.consul. 43 1} from client 10.33.10.3:571 (udp)
2019/07/20 16:08:42 [DEBUG] dns: request for {consul.service.consul. 43 1} (udp) (2.000994556s) from client 10.33.10.3:571 (udp)
2019/07/20 16:08:42 [ERR] dns: recurse failed: read udp 10.33.10.2:59383->10.33.10.3:53: i/o timeout
2019/07/20 16:08:42 [ERR] dns: all resolvers failed for {consul.service.consul. 43 1} from client 10.33.10.3:46076 (udp)
2019/07/20 16:08:42 [ERR] dns: recurse failed: read udp 10.33.10.2:36895->10.33.10.3:53: i/o timeout
2019/07/20 16:08:42 [ERR] dns: all resolvers failed for {consul.service.consul. 43 1} from client 10.33.10.3:33552 (udp)
2019/07/20 16:08:42 [DEBUG] dns: request for {consul.service.consul. 43 1} (udp) (2.001374786s) from client 10.33.10.3:46076 (udp)
2019/07/20 16:08:42 [DEBUG] dns: request for {consul.service.consul. 43 1} (udp) (2.002385645s) from client 10.33.10.3:33552 (udp)
From dnsmasq:
dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)
dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)
I believe the problem root to be here: https://github.com/hashicorp/consul/blob/7753b97cc7dd597a7ed3a4fbb14a6906e36ab108/agent/dns.go#L417-L436
We dispatch all query types regardless of whether we should.
I think we could make that default case (and the TypeAXFR) become:
case dns.TypeA, dns.TypeAAAA, dns.TypeCNAME, dns.TypeSRV, dns.TypeTXT:
ecsGlobal = d.dispatch(network, resp.RemoteAddr(), req, m)
default:
m.SetRcode(req, dns.RcodeNotImplemented)
}
With that we would handle SOA, NS with the two other cases and then all the rest of the supported types with the d.dispatch call. Any other types will be terminated immediately and return the rcode for not implemented.
Unfortunately, it might be a bit more complicated than I initially tought. The issue is only present when querying for DS, not for any other types.
I found this in the upstream package used for the DNS server.
I have no experience in go, but I think this means the handler for . will always be called if it exists, specifically records of type DS.
https://github.com/miekg/dns/blob/b13675009d59c97f3721247d9efa8914e1866a5b/serve_mux.go#L66-L81
Hello,
we just faced the very same issue described here.
I found this very good issue description after I've discovered thousands of the DS queries in our dnsmasq and Consul logs. Just took out our whole HQ DNS system ;).
May there be a re-prioritazation of the provided MR? Would be great to have that fixed.
As a workaround we setup a iptables rule to reject any DS queries to our dnsmasq servers for the domain consul.
iptables -A INPUT -i eth0 -p udp --dport 53 -m string --hex-string "|06|consul|00002b|" --algo bm -j REJECT -m comment --comment 'Reject consul DS requests'
The rule was tested in our Vagrant box. dig will run in a timeout but the dnsmasq server will stay healthy.
The hex code 002b was grabbed from a pcap file recorded during our outage and represents the query type DS. Without the additional leading 0 iptables rejects the rule.