Consul: [BUG] Consul DNS interface breaks when responding to SRV queries over TCP with large result sets.

Created on 31 Jan 2018  路  4Comments  路  Source: hashicorp/consul

If you have a question, please direct it to the
consul mailing list if it hasn't been
addressed in either the FAQ or in one
of the Consul Guides.

When filing a bug, please include the following:

Description of the Issue (and unexpected/desired result)

When a thousands of instances of a service are registered to Consul, the DNS interface stops working because the response exceeds the maximum size supported by the dns package that Consul depends on when responding to a DNS query over TCP:

Consul then fails with this error message:

[WARN] dns: failed to respond: dns: message too large

Consul has the udp_answer_limit available to limit the number of responses in the DNS query over UDP, but nothing similar for TCP. Queries over TCP are useful when a DNS cache is used _in front_ of Consul, it needs to have access to the full list of services to provide caching and load balancing.

The current failure case is pretty bad since there is no work around. Would an infrastructure based on Consul's DNS interface for service discovery grow beyond the limit that Consul supports, it would result in a major outage as all name resolutions would stop.

Using Consul's UDP interface has also proven to have too high latency with as many instances of a service, varying between 100ms to 700ms, so it is not a viable alternative and the use of a caching layer is required.

Reproduction steps

  1. register 3000+ services under the same name and using different ports
  2. send an SRV DNS query to Consul to resolve the service name
typbug

Most helpful comment

Truncating the response definitely alters te desired behavior to obtain the full list... but there鈥檚 just so much that can be done, we鈥檙e just hitting limits of the DNS protocol.

The reason truncating is preferrable is it trades off correctness for availability. With partial results, a service discovery system would still be able to function, but with degraded performances. It鈥檚 much better than stopping to work entirely, which is the current behavior.

All 4 comments

It looks like from the discussion in #635 that we can't increase the size, but it does seem like we should be able to truncate (and set the truncation bit) and then trim the result.

This doesn't address your use case for the full list, though, did you find a workaround for that?

Truncating the response definitely alters te desired behavior to obtain the full list... but there鈥檚 just so much that can be done, we鈥檙e just hitting limits of the DNS protocol.

The reason truncating is preferrable is it trades off correctness for availability. With partial results, a service discovery system would still be able to function, but with degraded performances. It鈥檚 much better than stopping to work entirely, which is the current behavior.

@achille-roussel @slackpad This PR https://github.com/hashicorp/consul/pull/3948 fixes this issue.

Test protocol for large number of nodes:

  1. Create script setup_dns_overflow.sh:
#!/bin/bash
CONSUL_ADDR=${1:-http://localhost:8500}

create_node()
{
  curl -fs --request PUT -d '{"Node": "'$1'", "Address": "'$2'", "Service": {"Service": "'$3'", "Tags": ["http", "tag_'$1'", "start_'$(date +%Y%M%d)'"], "Port": 8000}}' $CONSUL_ADDR/v1/catalog/register -o /dev/null
}

networks=$(seq 1 20)
num_services=${NUM_SERVICES:-254}
for j in $networks
do
for i in $(seq 1 $num_services)
do
  create_node host-redis-$j-$i.test.acme.com 192.168.$j.$i redis || exit 1
done
done
  1. Launch agent with patch https://github.com/hashicorp/consul/pull/3948
$ consul agent -dev
  1. Launch script
$ ./setup_dns_overflow.sh
  1. Examine the size of returned records
$ while true; do http_count=$(curl -fs localhost:8500/v1/catalog/service/redis?pretty|grep '"Node"'|wc -l) ; dns_count=$(dig @localhost -p 8600 SRV redis.service.consul +tcp +short|wc -l); dns_a=$(dig @localhost -p 8600 redis.service.consul +tcp +short|wc -l); echo "HTTP: $http_count ; DNS_SRV: $dns_count ; DNS_A: $dns_a"; sleep 1; done
HTTP:        0 ; DNS_SRV:        0 ; DNS_A:        0
HTTP:        7 ; DNS_SRV:        8 ; DNS_A:       10
HTTP:       54 ; DNS_SRV:       55 ; DNS_A:       56
HTTP:      100 ; DNS_SRV:      101 ; DNS_A:      102
HTTP:      147 ; DNS_SRV:      148 ; DNS_A:      149
[...]
HTTP:      388 ; DNS_SRV:      389 ; DNS_A:      390
HTTP:      434 ; DNS_SRV:      436 ; DNS_A:      437
HTTP:      480 ; DNS_SRV:      445 ; DNS_A:      483
# Max number for SRV is 445
[...]
HTTP:     1728 ; DNS_SRV:      445 ; DNS_A:     1752
HTTP:     1797 ; DNS_SRV:      445 ; DNS_A:     1819
HTTP:     1865 ; DNS_SRV:      445 ; DNS_A:     1819
HTTP:     1935 ; DNS_SRV:      445 ; DNS_A:     1819
HTTP:     2007 ; DNS_SRV:      445 ; DNS_A:     1819
HTTP:     2082 ; DNS_SRV:      445 ; DNS_A:     1819
[...]
# Max number for A is 1819

In debug, the consul agent now outputs the following messages when messages are truncated:

2018/03/08 00:19:22 [DEBUG] dns: TCP answer to [{redis.service.consul. 33 1}] too large truncated recs:=442/4702, size:=65468/696662

Nice :)

Thanks for tackling the issue!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

powerman picture powerman  路  3Comments

sandstrom picture sandstrom  路  3Comments

aravind picture aravind  路  3Comments

pritam97 picture pritam97  路  3Comments

atomantic picture atomantic  路  4Comments