Consul version: v0.7.2
Server information:
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 1
build:
prerelease =
revision = 'a9afa0c
version = 0.7.2
consul:
bootstrap = false
known_datacenters = 3
leader = false
leader_addr = [internal ip]:8300
server = true
raft:
applied_index = 57746118
commit_index = 57746118
fsm_pending = 0
last_contact = 7.289316ms
last_log_index = 57746118
last_log_term = 3528
last_snapshot_index = 57744807
last_snapshot_term = 3528
latest_configuration = [{Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300} {Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300} {Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300}]
latest_configuration_index = 1
num_peers = 2
protocol_version = 1
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 3528
runtime:
arch = amd64
cpu_count = 2
goroutines = 224
max_procs = 2
os = linux
version = go1.7.3
serf_lan:
encrypted = true
event_queue = 0
event_time = 1368
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 31
members = 6
query_queue = 0
query_time = 1
serf_wan:
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 50
members = 9
query_queue = 0
query_time = 1
We are delegating a subdomain from Bind9 to Consul for service discovery. We have configured the datacenter and domain values properly and delegation of A records work. However, lookups of AAAA records for valid services fail. This problem is limited to AAAA lookups of valid services as the service has valid A records for ipv4, but no AAAA records as the backing servers currently do not have ipv6 addresses. Lookups of invalid services pass as Consul returns NXDOMAIN for invalid services.
In troubleshooting on the Bind9 side I see Bind is reporting FORMERR. Note the following log snippet is sanitized:
31-Aug-2017 21:52:14.490 resolver: debug 3: resquery 0x7fa7a8229010 (fctx 0x7fa7a8223010(consul.service.dc.domain/AAAA)): response
31-Aug-2017 21:52:14.490 resolver: debug 10: received packet:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41509
;; flags: qr aa; QUESTION: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;consul.service.dc.domain. IN AAAA
;; AUTHORITY SECTION:
domain. 0 IN SOA ns.domain. postmaster.domain. 1504216334 3600 600 86400 0
31-Aug-2017 21:52:14.490 resolver: debug 3: fctx 0x7fa7a8223010(consul.service.dc.domain/AAAA'): noanswer_response
31-Aug-2017 21:52:14.490 resolver: debug 10: log_ns_ttl: fctx 0x7fa7a8223010: noanswer_response: consul.service.dc.domain (in 'dc.domain'?): 1 30
31-Aug-2017 21:52:14.490 resolver: notice: DNS format error from [internal ip]#53 resolving consul.service.dc.domain/AAAA for client 127.0.0.1#56536: invalid response
In reading about similar issues on the Bind9 user list I believe this is due to the SOA record being incorrect. See the following post with the comment "This one fails to return the CNAME to content.sjc1.site.voxcdn.net when the query type is AAAA so you get a unrelated SOA record." https://groups.google.com/forum/#!topic/comp.protocols.dns.bind/B-9RPmaJdjQ This makes sense for my error as I see the empty NOERROR response for the AAAA lookup returns a SOA record with ns.domain as the authorative nameserver, which is wrong.
Looking over the Consul docs I do not see how to configure the SOA record for the delegated domain in Consul, based on the docs at https://www.consul.io/docs/agent/options.html#dns_config I am under the impression ns.domain and postmaster.doman are hardcoded defaults. I see PR #1798 was opened to allow this record to be settable, but the author closed the PR without it being merged.
This is a nuisance problem as, while the A record lookup works, Bind is passing SERVFAIL to clients trying to look up AAAA records first because it is rejecting the response from Consul and as such cannot get a response itself. The clients retry on SERVFAIL until they timeout and fallback to the A record, adding about 10s to all API requests to services using Consul DNS in our environment.
We've made some changes to the SOA and the NS responses of consul in 0.9.1 of which the gist is in here: https://github.com/hashicorp/consul/pull/3353#issuecomment-320934137
However, there is a panic in the code that is fixed here: https://github.com/hashicorp/consul/pull/3408 which is only on master right now. The panic is triggered when you query for the SOA record directly. We should have a 0.9.3 release out soon.
Also, I agree that the SOA fields should be configurable and I'm going to pick this up after the config refactoring I'm working on.
We have the same scenario and behaviour as described by @CVTJNII. We are using Bind, I'm having trouble to find a workaround.
Do you guys have any ideias?
Consul version: v1.3.0
agent:
check_monitors = 1
check_ttls = 0
checks = 6
services = 1
build:
prerelease =
revision = e8757838
version = 1.3.0
consul:
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = 10.94.120.18:8300
server = true
raft:
applied_index = 5419372
commit_index = 5419372
fsm_pending = 0
last_contact = 40.821696ms
last_log_index = 5419372
last_log_term = 1555
last_snapshot_index = 5409030
last_snapshot_term = 1555
latest_configuration = [{Suffrage:Voter ID:8034b686-0ed8-750e-4b29-da34f35efb44 Address:10.94.120.6:8300} {Suffrage:Voter ID:fb05db25-880b-820a-6824-83bb18642377 Address:10.94.120.18:8300} {Suffrage:Voter ID:27f328ce-fc79-026e-c9f2-23055610d461 Address:10.94.120.17:8300}]
latest_configuration_index = 3639368
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 1555
runtime:
arch = amd64
cpu_count = 4
goroutines = 199
max_procs = 4
os = linux
version = go1.11.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 155
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 568
members = 36
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 253
members = 6
query_queue = 0
query_time = 1
Same issue consul 1.6.2 :(
@CVTJNII , @magiconair did you found a workaround ?
Hi @olivierHa,
Would you mind providing a bit more detail about your environment, and the exact error you're seeing in the Consul & DNS server logs?
I was able to successfully configure BIND to forward queries to Consul using standard DNS delegation as well as using a static-stub zone type. It seems subdomain delegation & lookups _should_ work. It would be helpful to have more details about your environment to troubleshoot.
Thanks.
Most helpful comment
We've made some changes to the SOA and the NS responses of consul in 0.9.1 of which the gist is in here: https://github.com/hashicorp/consul/pull/3353#issuecomment-320934137
However, there is a panic in the code that is fixed here: https://github.com/hashicorp/consul/pull/3408 which is only on master right now. The panic is triggered when you query for the SOA record directly. We should have a 0.9.3 release out soon.
Also, I agree that the SOA fields should be configurable and I'm going to pick this up after the config refactoring I'm working on.