Consul: Consul forwards DNS queries for type `DS` inside consul. zone to recursors

Created on 20 Jul 2019 · 4Comments · Source: hashicorp/consul

Overview of the Issue

When consul is configured with a recursor, DNS queries for unrecognized record types are forwarded to the configured recursors. This forwarding also happens when resolving names in the consul. zone with types other than SOA, NS, ANY, A, AAAA, TXT.

In a particular setup, when dnsmasq is configured to forward consul. to a consul DNS server, and consul is configured to use this dnsmasq server as a recursor, this results in a DNS recursion loop. The dnsmasq configuration is described on hashicorp learn: https://learn.hashicorp.com/consul/security-networking/forwarding#dnsmasq-setup

The documentation states that consul only forwards DNS queries for names outside the consul. zone: https://www.consul.io/docs/agent/options.html#recursors

My expectations would also be that consul acts as an authoritative DNS server for the consul. zone, and does not forward it to recursors.

Reproduction Steps

A docker setup to reproduce this issue is available in a gist: https://gist.github.com/vierbergenlars/d5877cf8bb076fb5789f47d1ad7039fb

Alternatively:

Set up a consul node (or a cluster, it does not matter) with -recursor pointing to dnsmasq.
Set up dnsmasq, forwarding the consul. zone to the consul DNS server.
Perform a DNS lookup for an unsupported record type to consul.service.consul against either dnsmasq or consul. I used the dnssec DS type. dig @127.0.0.1 -p 8600 DS consul.service.consul`
Even long after dig has finished, DNS requests keep bouncing around between consul and dnsmasq.

Consul info for Server

Server info

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = a82e6a7f
    version = 1.5.2
consul:
    acl = disabled
    bootstrap = true
    known_datacenters = 1
    leader = true
    leader_addr = 127.0.0.1:8300
    server = true
raft:
    applied_index = 71
    commit_index = 71
    fsm_pending = 0
    last_contact = 0
    last_log_index = 71
    last_log_term = 2
    last_snapshot_index = 0
    last_snapshot_term = 0
    latest_configuration = [{Suffrage:Voter ID:a1a793cb-2564-e6ab-8c85-6bc4a3ee35b1 Address:127.0.0.1:8300}]
    latest_configuration_index = 1
    num_peers = 0
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 2
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 82
    max_procs = 4
    os = linux
    version = go1.12.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 1
    event_time = 2
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Log Fragments

From the consul server:

    2019/07/20 16:08:42 [ERR] dns: recurse failed: read udp 10.33.10.2:34942->10.33.10.3:53: i/o timeout
    2019/07/20 16:08:42 [ERR] dns: all resolvers failed for {consul.service.consul. 43 1} from client 10.33.10.3:571 (udp)
    2019/07/20 16:08:42 [DEBUG] dns: request for {consul.service.consul. 43 1} (udp) (2.000994556s) from client 10.33.10.3:571 (udp)
    2019/07/20 16:08:42 [ERR] dns: recurse failed: read udp 10.33.10.2:59383->10.33.10.3:53: i/o timeout
    2019/07/20 16:08:42 [ERR] dns: all resolvers failed for {consul.service.consul. 43 1} from client 10.33.10.3:46076 (udp)
    2019/07/20 16:08:42 [ERR] dns: recurse failed: read udp 10.33.10.2:36895->10.33.10.3:53: i/o timeout
    2019/07/20 16:08:42 [ERR] dns: all resolvers failed for {consul.service.consul. 43 1} from client 10.33.10.3:33552 (udp)
    2019/07/20 16:08:42 [DEBUG] dns: request for {consul.service.consul. 43 1} (udp) (2.001374786s) from client 10.33.10.3:46076 (udp)
    2019/07/20 16:08:42 [DEBUG] dns: request for {consul.service.consul. 43 1} (udp) (2.002385645s) from client 10.33.10.3:33552 (udp)

From dnsmasq:

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)
dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

help-wanted themdns typbug

Source

vierbergenlars

👍3

All 4 comments

I believe the problem root to be here: https://github.com/hashicorp/consul/blob/7753b97cc7dd597a7ed3a4fbb14a6906e36ab108/agent/dns.go#L417-L436

We dispatch all query types regardless of whether we should.

I think we could make that default case (and the TypeAXFR) become:

case dns.TypeA, dns.TypeAAAA, dns.TypeCNAME, dns.TypeSRV, dns.TypeTXT:
   ecsGlobal = d.dispatch(network, resp.RemoteAddr(), req, m)
default:
   m.SetRcode(req, dns.RcodeNotImplemented)
}

With that we would handle SOA, NS with the two other cases and then all the rest of the supported types with the d.dispatch call. Any other types will be terminated immediately and return the rcode for not implemented.

mkeeler on 22 Jul 2019

Unfortunately, it might be a bit more complicated than I initially tought. The issue is only present when querying for DS, not for any other types.
I found this in the upstream package used for the DNS server.
I have no experience in go, but I think this means the handler for . will always be called if it exists, specifically records of type DS.
https://github.com/miekg/dns/blob/b13675009d59c97f3721247d9efa8914e1866a5b/serve_mux.go#L66-L81

vierbergenlars on 23 Jul 2019

👍1

Hello,

we just faced the very same issue described here.
I found this very good issue description after I've discovered thousands of the DS queries in our dnsmasq and Consul logs. Just took out our whole HQ DNS system ;).

May there be a re-prioritazation of the provided MR? Would be great to have that fixed.

jardleex on 27 May 2020

👀1

As a workaround we setup a iptables rule to reject any DS queries to our dnsmasq servers for the domain consul.

iptables -A INPUT -i eth0 -p udp --dport 53 -m string --hex-string "|06|consul|00002b|" --algo bm -j REJECT -m comment --comment 'Reject consul DS requests'

The rule was tested in our Vagrant box. dig will run in a timeout but the dnsmasq server will stay healthy.
The hex code 002b was grabbed from a pcap file recorded during our outage and represents the query type DS. Without the additional leading 0 iptables rejects the rule.

jardleex on 27 May 2020

Was this page helpful?

0 / 5 - 0 ratings