Envoy: Envoy stops use dns_resolvers from cluster after network restart

Created on 28 Oct 2020 · 6Comments · Source: envoyproxy/envoy

Description:
We use standalone envoy on centos 8 with kubernetes installation. When on node happens restart network (systemctl restart network.service for example), envoy stops query resolvers defined in dns_resolvers section, and start to use system resolve (dns servers defined in /etc/resolv.conf).From the documentation it looks like the behavior is reset to use the default resolver
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster.proto

dns_resolvers

If this setting is not specified, the value defaults to the default resolver, which uses /etc/resolv.conf for configuration.
````
After network restart envoy not use defined resolvers anymore (until restart envoy instance), and after pod state changes in k8s (deploy), upstreams no longer available.
Is there a way to prevent this behavior?

*Repro steps*:
`systemctl restart network`
tcpdump dns resolves, try to change upstream count\address

*Config*:

clusters:
- name: somename
alt_stat_name: somename
connect_timeout: 1s
common_http_protocol_options: {idle_timeout: 5s}
type: STRICT_DNS
dns_lookup_family: 1
use_tcp_for_dns_lookups: true
load_assignment:
cluster_name: somename
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: somename-headless.namespace.svc.cluster.local.
port_value: 80
dns_resolvers:
- {socket_address: {address: "10.0.0.1", port_value: 53}}
- {socket_address: {address: "10.0.0.2", port_value: 53}}
- {socket_address: {address: "10.0.0.3", port_value: 53}}
dns_refresh_rate: 1s
health_checks:
- tcp_health_check: {send: {binary: ""}}
interval:
nanos: 300000000
timeout:
seconds: 1
unhealthy_threshold:
value: 3
healthy_threshold:
value: 1

*Logs*:

debug][upstream] [source/common/upstream/strict_dns_cluster.cc:167] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., refresh rate 1000 ms
debug][upstream] [source/common/upstream/upstream_impl.cc:286] transport socket match, socket default selected for host with address 10.0.1.1:80
debug][upstream] [source/common/upstream/upstream_impl.cc:286] transport socket match, socket default selected for host with address 10.0.1.2:80
debug][upstream] [source/common/upstream/upstream_impl.cc:286] transport socket match, socket default selected for host with address 10.0.1.3:80
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:167] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., refresh rate 1000 ms
debug][main] [source/server/server.cc:190] flushing stats
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][main] [source/server/server.cc:190] flushing stats
```
Versions:
OS: centos 8.2.2004
Envoy: 1.16

aredns bug

Source

rumanzo

All 6 comments

@junr03 for any comments

yanavlasov on 28 Oct 2020

👀1

@rumanzo I see the bug. The problem is that the c-ares channel only gets pointed to the configured resolvers on construction. However, if there is a subsequent channel destruction (due to a network error for instance) the channel is not pointed back to the custom resolvers.

I will put up a PR to fix this ASAP.

junr03 on 29 Oct 2020

❤1

@rumanzo fixed in #13820, give it a try if you want to verify locally, I am confident this is your issue.

junr03 on 29 Oct 2020

👍1

(sorry @yanavlasov I removed the wrong label by accident, I intended to remove triage but ended up removing bug. This is definitely a bug, updated)

junr03 on 29 Oct 2020

I checked it, and fix works. Thank you, it was very fast.
When will be new release (1.16.X or 1.17.0)?

rumanzo on 2 Nov 2020

Nice! We don't do releases for bug fixes like this. Most likely the next release won't happen for a few months given 1.16 came out at the beginning of October, and we do releases every quarter or so. Sorry about that!

junr03 on 2 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings