ADS Server Reconnect Never Occurs After Hard Reboot: When the ADS server is abruptly rebooted with sudo reboot -f the envoy proxy does not attempt to reconnect to the ADS hosts. However, when using sudo reboot (without -f) ADS properly attempts a reconnect.
Description:
sudo reboot -fon Ubuntu 16.04 of the ADS server causes the envoy proxy to no longer attempt a reconnect to the cluster hosts for ADS. It seems like some sort of a TCP timeout is not properly handled on envoy. I also don't see any issue in the envoy logs. The envoy /cluster endpoint also shows the correct ADS cluster configuration, that the ADS host has moved, and indicates that no connection attempts have been made. The only known solution is to restart the envoy proxy.
Config:
dynamic_resources:
cds_config: {ads: {}}
lds_config: {ads: {}}
ads_config:
api_type: GRPC
grpc_services:
envoy_grpc:
cluster_name: {{ cluster_name }}_ads
static_resources:
clusters:
- name: {{ cluster_name }}_ads
connect_timeout: { seconds: 1 }
dns_refresh_rate: { seconds: 10 }
type: STRICT_DNS
lb_policy: LEAST_REQUEST
health_checks:
healthy_threshold: 1
interval: { seconds: 10 }
timeout: { seconds: 10 }
tcp_health_check: {}
unhealthy_threshold: 1
http2_protocol_options: {}
hosts:
- socket_address:
address: ads.envoy.marathon.slave.mesos
port_value: 9902
/cluster endpoint
...
system-test_ads::default_priority::max_connections::1024
system-test_ads::default_priority::max_pending_requests::1024
system-test_ads::default_priority::max_requests::1024
system-test_ads::default_priority::max_retries::3
system-test_ads::high_priority::max_connections::1024
system-test_ads::high_priority::max_pending_requests::1024
system-test_ads::high_priority::max_requests::1024
system-test_ads::high_priority::max_retries::3
system-test_ads::added_via_api::false
system-test_ads::172.27.1.192:9902::cx_active::0
system-test_ads::172.27.1.192:9902::cx_connect_fail::0
system-test_ads::172.27.1.192:9902::cx_total::0
system-test_ads::172.27.1.192:9902::rq_active::0
system-test_ads::172.27.1.192:9902::rq_error::0
system-test_ads::172.27.1.192:9902::rq_success::0
system-test_ads::172.27.1.192:9902::rq_timeout::0
system-test_ads::172.27.1.192:9902::rq_total::0
system-test_ads::172.27.1.192:9902::health_flags::healthy
system-test_ads::172.27.1.192:9902::weight::1
system-test_ads::172.27.1.192:9902::region::
system-test_ads::172.27.1.192:9902::zone::
system-test_ads::172.27.1.192:9902::sub_zone::
system-test_ads::172.27.1.192:9902::canary::false
system-test_ads::172.27.1.192:9902::success_rate::-1
...
Environment:
OS: Ubuntu 16.04
Envoy Version: v1.8.0
The behavior suggests that envoy's cluster connection manager fails to tolerate a TCP connection loss.
Is it possible that for the XDS configuration that a TTL or Timeout is missing and not being set?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.
@mattklein123 this still seems to be an issue with Envoy Version: v1.9.0. Can you reopen this issue and add the "help wanted" label? Thanks.
This is usually happening because of a TCP Half Open connection. There is no detection of this. ADS is particularly prone because of long periods of idleness and communication is initiated by the ADS server to push snapshot changes via ADS.
We ran into a similar issue and we fixed it by enabling a TCP keep alive.
static_resources:
clusters:
- name: {{ cluster_name }}_ads
connect_timeout: { seconds: 1 }
dns_refresh_rate: { seconds: 10 }
type: STRICT_DNS
...
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 3
keepalive_time: 30
keepalive_interval: 5
This is especially bad for ADS because Envoy will merrily think it's connected in a TCP Half Open mode and there is no way to re-establish that connection since Envoy is the one that connects to the ADS server not the other way round.
I'll submit a patch to update some of the docs to note this fact.
Thanks @suhailpatel , we'll give this a try and see if it solves our issue.
Note that for searching this issue also applies to non-ADS xDS.
@zsilver @zanes2016 Would be interested to hear if the suggested config solved your issues?
@zsilver @zanes2016 Would be interested to hear if the suggested config solved your issues?
@suhailpatel yup this worked for us, thanks!
Most helpful comment
Note that for searching this issue also applies to non-ADS xDS.