Envoy: STRICT_DNS drops cluster members on lookup failure

Created on 1 Mar 2018  路  2Comments  路  Source: envoyproxy/envoy

Title: STRICT_DNS drops cluster members on lookup failure

Description:
We are using Envoy in a Consul environment. We would like to use DNS lookups to configure our clusters. For our particular use case, we need Envoy instances in DCs around the world to locate a set of hosts in one datacenter. To do this, we are using a prepared query. In short, this allows us to do a global lookup of the set of hosts we need and query it using DNS.

However, when network lag is too great the DNS response occasionally returns NXDOMAIN, instead of the set of IPs it normally returns. When using STRICT_DNS for the cluster, this is catastrophic, because all hosts are removed from the cluster causing downtime until the next successful DNS query happens.

Instead, I would like Envoy to consider the DNS entries as advisory, and keep using the last known set until lookups recover.

Workarounds
We are trying out LOGICAL_DNS instead, which seems to have the DNS lookup properties that we want. However, we do have a set of Envoy sidecars that are the result of the lookup, and it would be better if downstream Envoy could maintain connections to upstream envoy instances. From what I can tell, LOGICAL_DNS also does not use HTTP/2?

We are just getting started with Envoy, so maybe there is something obvious I'm missing. But from what I can tell, the behavior of STRICT_DNS is more what we want than LOGICAL_DNS.

enhancement help wanted

Most helpful comment

@junr03 fixed this recently.

All 2 comments

@jasonmartens the history here is that when we used to use getaddrinfo_a() there was basically no good way to differentiate an error from an empty response (terrible API). With c-ares there might be. I'm not sure. If there are clear errors that we should be ignoring, we can make the DNS resolver not consider it an empty response. I would have a look at the code.

Your other option is to enable active health checking against the endpoints. This will stabilize the endpoints since Envoy will trust active HC over discovery.

@junr03 fixed this recently.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

boncheo picture boncheo  路  3Comments

zanes2016 picture zanes2016  路  3Comments

hzxuzhonghu picture hzxuzhonghu  路  3Comments

vpiduri picture vpiduri  路  3Comments

sabiurr picture sabiurr  路  3Comments