Consul: Bind9 FORMERR on AAAA record lookups when delegating subdomain to Consul

Created on 1 Sep 2017  路  5Comments  路  Source: hashicorp/consul

Consul version: v0.7.2

Server information:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease = 
        revision = 'a9afa0c
        version = 0.7.2
consul:
        bootstrap = false
        known_datacenters = 3
        leader = false
        leader_addr = [internal ip]:8300
        server = true
raft:
        applied_index = 57746118
        commit_index = 57746118
        fsm_pending = 0
        last_contact = 7.289316ms
        last_log_index = 57746118
        last_log_term = 3528
        last_snapshot_index = 57744807
        last_snapshot_term = 3528
        latest_configuration = [{Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300} {Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300} {Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300}]
        latest_configuration_index = 1
        num_peers = 2
        protocol_version = 1
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 3528
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 224
        max_procs = 2
        os = linux
        version = go1.7.3
serf_lan:
        encrypted = true
        event_queue = 0
        event_time = 1368
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 31
        members = 6
        query_queue = 0
        query_time = 1
serf_wan:
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 50
        members = 9
        query_queue = 0
        query_time = 1

Description of the Issue (and unexpected/desired result)

We are delegating a subdomain from Bind9 to Consul for service discovery. We have configured the datacenter and domain values properly and delegation of A records work. However, lookups of AAAA records for valid services fail. This problem is limited to AAAA lookups of valid services as the service has valid A records for ipv4, but no AAAA records as the backing servers currently do not have ipv6 addresses. Lookups of invalid services pass as Consul returns NXDOMAIN for invalid services.

In troubleshooting on the Bind9 side I see Bind is reporting FORMERR. Note the following log snippet is sanitized:

31-Aug-2017 21:52:14.490 resolver: debug 3: resquery 0x7fa7a8229010 (fctx 0x7fa7a8223010(consul.service.dc.domain/AAAA)): response
31-Aug-2017 21:52:14.490 resolver: debug 10: received packet:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id:  41509
;; flags: qr aa; QUESTION: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;consul.service.dc.domain. IN AAAA

;; AUTHORITY SECTION:
domain.       0       IN      SOA     ns.domain. postmaster.domain. 1504216334 3600 600 86400 0


31-Aug-2017 21:52:14.490 resolver: debug 3: fctx 0x7fa7a8223010(consul.service.dc.domain/AAAA'): noanswer_response
31-Aug-2017 21:52:14.490 resolver: debug 10: log_ns_ttl: fctx 0x7fa7a8223010: noanswer_response: consul.service.dc.domain (in 'dc.domain'?): 1 30
31-Aug-2017 21:52:14.490 resolver: notice: DNS format error from [internal ip]#53 resolving consul.service.dc.domain/AAAA for client 127.0.0.1#56536: invalid response

In reading about similar issues on the Bind9 user list I believe this is due to the SOA record being incorrect. See the following post with the comment "This one fails to return the CNAME to content.sjc1.site.voxcdn.net when the query type is AAAA so you get a unrelated SOA record." https://groups.google.com/forum/#!topic/comp.protocols.dns.bind/B-9RPmaJdjQ This makes sense for my error as I see the empty NOERROR response for the AAAA lookup returns a SOA record with ns.domain as the authorative nameserver, which is wrong.

Looking over the Consul docs I do not see how to configure the SOA record for the delegated domain in Consul, based on the docs at https://www.consul.io/docs/agent/options.html#dns_config I am under the impression ns.domain and postmaster.doman are hardcoded defaults. I see PR #1798 was opened to allow this record to be settable, but the author closed the PR without it being merged.

This is a nuisance problem as, while the A record lookup works, Bind is passing SERVFAIL to clients trying to look up AAAA records first because it is rejecting the response from Consul and as such cannot get a response itself. The clients retry on SERVFAIL until they timeout and fallback to the A record, adding about 10s to all API requests to services using Consul DNS in our environment.

themdns typenhancement

Most helpful comment

We've made some changes to the SOA and the NS responses of consul in 0.9.1 of which the gist is in here: https://github.com/hashicorp/consul/pull/3353#issuecomment-320934137

However, there is a panic in the code that is fixed here: https://github.com/hashicorp/consul/pull/3408 which is only on master right now. The panic is triggered when you query for the SOA record directly. We should have a 0.9.3 release out soon.

Also, I agree that the SOA fields should be configurable and I'm going to pick this up after the config refactoring I'm working on.

All 5 comments

1301 may also resolve this as (assuming my understanding is correct) if Consul properly returns records for ns.domain then the SOA in the AAAA response will be valid.

We've made some changes to the SOA and the NS responses of consul in 0.9.1 of which the gist is in here: https://github.com/hashicorp/consul/pull/3353#issuecomment-320934137

However, there is a panic in the code that is fixed here: https://github.com/hashicorp/consul/pull/3408 which is only on master right now. The panic is triggered when you query for the SOA record directly. We should have a 0.9.3 release out soon.

Also, I agree that the SOA fields should be configurable and I'm going to pick this up after the config refactoring I'm working on.

We have the same scenario and behaviour as described by @CVTJNII. We are using Bind, I'm having trouble to find a workaround.

Do you guys have any ideias?

Consul version: v1.3.0

agent: check_monitors = 1 check_ttls = 0 checks = 6 services = 1 build: prerelease = revision = e8757838 version = 1.3.0 consul: bootstrap = false known_datacenters = 2 leader = false leader_addr = 10.94.120.18:8300 server = true raft: applied_index = 5419372 commit_index = 5419372 fsm_pending = 0 last_contact = 40.821696ms last_log_index = 5419372 last_log_term = 1555 last_snapshot_index = 5409030 last_snapshot_term = 1555 latest_configuration = [{Suffrage:Voter ID:8034b686-0ed8-750e-4b29-da34f35efb44 Address:10.94.120.6:8300} {Suffrage:Voter ID:fb05db25-880b-820a-6824-83bb18642377 Address:10.94.120.18:8300} {Suffrage:Voter ID:27f328ce-fc79-026e-c9f2-23055610d461 Address:10.94.120.17:8300}] latest_configuration_index = 3639368 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 1555 runtime: arch = amd64 cpu_count = 4 goroutines = 199 max_procs = 4 os = linux version = go1.11.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 155 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 568 members = 36 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 253 members = 6 query_queue = 0 query_time = 1

Same issue consul 1.6.2 :(

@CVTJNII , @magiconair did you found a workaround ?

Hi @olivierHa,

Would you mind providing a bit more detail about your environment, and the exact error you're seeing in the Consul & DNS server logs?

I was able to successfully configure BIND to forward queries to Consul using standard DNS delegation as well as using a static-stub zone type. It seems subdomain delegation & lookups _should_ work. It would be helpful to have more details about your environment to troubleshoot.

Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sandstrom picture sandstrom  路  3Comments

deadjoe picture deadjoe  路  4Comments

nicholasjackson picture nicholasjackson  路  3Comments

matteoturra picture matteoturra  路  4Comments

aravind picture aravind  路  3Comments