If you have a question, please direct it to the
consul mailing list if it hasn't been
addressed in either the FAQ or in one
of the Consul Guides.
When filing a bug, please include the following:
When a thousands of instances of a service are registered to Consul, the DNS interface stops working because the response exceeds the maximum size supported by the dns package that Consul depends on when responding to a DNS query over TCP:
Consul then fails with this error message:
[WARN] dns: failed to respond: dns: message too large
Consul has the udp_answer_limit available to limit the number of responses in the DNS query over UDP, but nothing similar for TCP. Queries over TCP are useful when a DNS cache is used _in front_ of Consul, it needs to have access to the full list of services to provide caching and load balancing.
The current failure case is pretty bad since there is no work around. Would an infrastructure based on Consul's DNS interface for service discovery grow beyond the limit that Consul supports, it would result in a major outage as all name resolutions would stop.
Using Consul's UDP interface has also proven to have too high latency with as many instances of a service, varying between 100ms to 700ms, so it is not a viable alternative and the use of a caching layer is required.
It looks like from the discussion in #635 that we can't increase the size, but it does seem like we should be able to truncate (and set the truncation bit) and then trim the result.
This doesn't address your use case for the full list, though, did you find a workaround for that?
Truncating the response definitely alters te desired behavior to obtain the full list... but there鈥檚 just so much that can be done, we鈥檙e just hitting limits of the DNS protocol.
The reason truncating is preferrable is it trades off correctness for availability. With partial results, a service discovery system would still be able to function, but with degraded performances. It鈥檚 much better than stopping to work entirely, which is the current behavior.
@achille-roussel @slackpad This PR https://github.com/hashicorp/consul/pull/3948 fixes this issue.
Test protocol for large number of nodes:
setup_dns_overflow.sh:#!/bin/bash
CONSUL_ADDR=${1:-http://localhost:8500}
create_node()
{
curl -fs --request PUT -d '{"Node": "'$1'", "Address": "'$2'", "Service": {"Service": "'$3'", "Tags": ["http", "tag_'$1'", "start_'$(date +%Y%M%d)'"], "Port": 8000}}' $CONSUL_ADDR/v1/catalog/register -o /dev/null
}
networks=$(seq 1 20)
num_services=${NUM_SERVICES:-254}
for j in $networks
do
for i in $(seq 1 $num_services)
do
create_node host-redis-$j-$i.test.acme.com 192.168.$j.$i redis || exit 1
done
done
$ consul agent -dev
$ ./setup_dns_overflow.sh
$ while true; do http_count=$(curl -fs localhost:8500/v1/catalog/service/redis?pretty|grep '"Node"'|wc -l) ; dns_count=$(dig @localhost -p 8600 SRV redis.service.consul +tcp +short|wc -l); dns_a=$(dig @localhost -p 8600 redis.service.consul +tcp +short|wc -l); echo "HTTP: $http_count ; DNS_SRV: $dns_count ; DNS_A: $dns_a"; sleep 1; done
HTTP: 0 ; DNS_SRV: 0 ; DNS_A: 0
HTTP: 7 ; DNS_SRV: 8 ; DNS_A: 10
HTTP: 54 ; DNS_SRV: 55 ; DNS_A: 56
HTTP: 100 ; DNS_SRV: 101 ; DNS_A: 102
HTTP: 147 ; DNS_SRV: 148 ; DNS_A: 149
[...]
HTTP: 388 ; DNS_SRV: 389 ; DNS_A: 390
HTTP: 434 ; DNS_SRV: 436 ; DNS_A: 437
HTTP: 480 ; DNS_SRV: 445 ; DNS_A: 483
# Max number for SRV is 445
[...]
HTTP: 1728 ; DNS_SRV: 445 ; DNS_A: 1752
HTTP: 1797 ; DNS_SRV: 445 ; DNS_A: 1819
HTTP: 1865 ; DNS_SRV: 445 ; DNS_A: 1819
HTTP: 1935 ; DNS_SRV: 445 ; DNS_A: 1819
HTTP: 2007 ; DNS_SRV: 445 ; DNS_A: 1819
HTTP: 2082 ; DNS_SRV: 445 ; DNS_A: 1819
[...]
# Max number for A is 1819
In debug, the consul agent now outputs the following messages when messages are truncated:
2018/03/08 00:19:22 [DEBUG] dns: TCP answer to [{redis.service.consul. 33 1}] too large truncated recs:=442/4702, size:=65468/696662
Nice :)
Thanks for tackling the issue!
Most helpful comment
Truncating the response definitely alters te desired behavior to obtain the full list... but there鈥檚 just so much that can be done, we鈥檙e just hitting limits of the DNS protocol.
The reason truncating is preferrable is it trades off correctness for availability. With partial results, a service discovery system would still be able to function, but with degraded performances. It鈥檚 much better than stopping to work entirely, which is the current behavior.