Envoy: Envoy stops use dns_resolvers from cluster after network restart

Created on 28 Oct 2020  路  6Comments  路  Source: envoyproxy/envoy

Description:
We use standalone envoy on centos 8 with kubernetes installation. When on node happens restart network (systemctl restart network.service for example), envoy stops query resolvers defined in dns_resolvers section, and start to use system resolve (dns servers defined in /etc/resolv.conf).From the documentation it looks like the behavior is reset to use the default resolver
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster.proto

dns_resolvers

If this setting is not specified, the value defaults to the default resolver, which uses /etc/resolv.conf for configuration.
````
After network restart envoy not use defined resolvers anymore (until restart envoy instance), and after pod state changes in k8s (deploy), upstreams no longer available.
Is there a way to prevent this behavior?

*Repro steps*:
`systemctl restart network`
tcpdump dns resolves, try to change upstream count\address

*Config*:

clusters:
- name: somename
alt_stat_name: somename
connect_timeout: 1s
common_http_protocol_options: {idle_timeout: 5s}
type: STRICT_DNS
dns_lookup_family: 1
use_tcp_for_dns_lookups: true
load_assignment:
cluster_name: somename
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: somename-headless.namespace.svc.cluster.local.
port_value: 80
dns_resolvers:
- {socket_address: {address: "10.0.0.1", port_value: 53}}
- {socket_address: {address: "10.0.0.2", port_value: 53}}
- {socket_address: {address: "10.0.0.3", port_value: 53}}
dns_refresh_rate: 1s
health_checks:
- tcp_health_check: {send: {binary: ""}}
interval:
nanos: 300000000
timeout:
seconds: 1
unhealthy_threshold:
value: 3
healthy_threshold:
value: 1

*Logs*:

debug][upstream] [source/common/upstream/strict_dns_cluster.cc:167] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., refresh rate 1000 ms
debug][upstream] [source/common/upstream/upstream_impl.cc:286] transport socket match, socket default selected for host with address 10.0.1.1:80
debug][upstream] [source/common/upstream/upstream_impl.cc:286] transport socket match, socket default selected for host with address 10.0.1.2:80
debug][upstream] [source/common/upstream/upstream_impl.cc:286] transport socket match, socket default selected for host with address 10.0.1.3:80
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:167] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., refresh rate 1000 ms
debug][main] [source/server/server.cc:190] flushing stats
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][upstream] [source/common/upstream/strict_dns_cluster.cc:174] DNS refresh rate reset for somename-headless.namespace.svc.cluster.local., (failure) refresh rate 1000 ms
debug][main] [source/server/server.cc:190] flushing stats
```
Versions:
OS: centos 8.2.2004
Envoy: 1.16

aredns bug

All 6 comments

@junr03 for any comments

@rumanzo I see the bug. The problem is that the c-ares channel only gets pointed to the configured resolvers on construction. However, if there is a subsequent channel destruction (due to a network error for instance) the channel is not pointed back to the custom resolvers.

I will put up a PR to fix this ASAP.

@rumanzo fixed in #13820, give it a try if you want to verify locally, I am confident this is your issue.

(sorry @yanavlasov I removed the wrong label by accident, I intended to remove triage but ended up removing bug. This is definitely a bug, updated)

I checked it, and fix works. Thank you, it was very fast.
When will be new release (1.16.X or 1.17.0)?

Nice! We don't do releases for bug fixes like this. Most likely the next release won't happen for a few months given 1.16 came out at the beginning of October, and we do releases every quarter or so. Sorry about that!

Was this page helpful?
0 / 5 - 0 ratings