Envoy: ADS Server Reconnect Never Occurs After Hard Reboot

Created on 30 Nov 2018 · 9Comments · Source: envoyproxy/envoy

ADS Server Reconnect Never Occurs After Hard Reboot: When the ADS server is abruptly rebooted with sudo reboot -f the envoy proxy does not attempt to reconnect to the ADS hosts. However, when using sudo reboot (without -f) ADS properly attempts a reconnect.

Description:

sudo reboot -f on Ubuntu 16.04 of the ADS server causes the envoy proxy to no longer attempt a reconnect to the cluster hosts for ADS. It seems like some sort of a TCP timeout is not properly handled on envoy. I also don't see any issue in the envoy logs. The envoy /cluster endpoint also shows the correct ADS cluster configuration, that the ADS host has moved, and indicates that no connection attempts have been made. The only known solution is to restart the envoy proxy.

Config:

    dynamic_resources:
      cds_config: {ads: {}}
      lds_config: {ads: {}}
      ads_config:
        api_type: GRPC
        grpc_services:
          envoy_grpc:
            cluster_name: {{ cluster_name }}_ads

    static_resources:
      clusters:
      - name: {{ cluster_name }}_ads
        connect_timeout: { seconds: 1 }
        dns_refresh_rate: { seconds: 10 }
        type: STRICT_DNS
        lb_policy: LEAST_REQUEST
        health_checks:
          healthy_threshold: 1
          interval: { seconds: 10 }
          timeout: { seconds: 10 }
          tcp_health_check: {}
          unhealthy_threshold: 1
        http2_protocol_options: {}
        hosts:
        - socket_address:
            address: ads.envoy.marathon.slave.mesos
            port_value: 9902

/cluster endpoint

...
system-test_ads::default_priority::max_connections::1024
system-test_ads::default_priority::max_pending_requests::1024
system-test_ads::default_priority::max_requests::1024
system-test_ads::default_priority::max_retries::3
system-test_ads::high_priority::max_connections::1024
system-test_ads::high_priority::max_pending_requests::1024
system-test_ads::high_priority::max_requests::1024
system-test_ads::high_priority::max_retries::3
system-test_ads::added_via_api::false
system-test_ads::172.27.1.192:9902::cx_active::0
system-test_ads::172.27.1.192:9902::cx_connect_fail::0
system-test_ads::172.27.1.192:9902::cx_total::0
system-test_ads::172.27.1.192:9902::rq_active::0
system-test_ads::172.27.1.192:9902::rq_error::0
system-test_ads::172.27.1.192:9902::rq_success::0
system-test_ads::172.27.1.192:9902::rq_timeout::0
system-test_ads::172.27.1.192:9902::rq_total::0
system-test_ads::172.27.1.192:9902::health_flags::healthy
system-test_ads::172.27.1.192:9902::weight::1
system-test_ads::172.27.1.192:9902::region::
system-test_ads::172.27.1.192:9902::zone::
system-test_ads::172.27.1.192:9902::sub_zone::
system-test_ads::172.27.1.192:9902::canary::false
system-test_ads::172.27.1.192:9902::success_rate::-1
...

Environment:

OS: Ubuntu 16.04
Envoy Version: v1.8.0

help wanted question

Source

zanes2016

Most helpful comment

Note that for searching this issue also applies to non-ADS xDS.

mattklein123 on 17 Mar 2019

👍2

All 9 comments

The behavior suggests that envoy's cluster connection manager fails to tolerate a TCP connection loss.

Is it possible that for the XDS configuration that a TTL or Timeout is missing and not being set?

zanes2016 on 10 Dec 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 9 Jan 2019

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

stale[bot] on 19 Jan 2019

@mattklein123 this still seems to be an issue with Envoy Version: v1.9.0. Can you reopen this issue and add the "help wanted" label? Thanks.

jeff-merrick on 16 Feb 2019

This is usually happening because of a TCP Half Open connection. There is no detection of this. ADS is particularly prone because of long periods of idleness and communication is initiated by the ADS server to push snapshot changes via ADS.

We ran into a similar issue and we fixed it by enabling a TCP keep alive.

    static_resources:
      clusters:
      - name: {{ cluster_name }}_ads
        connect_timeout: { seconds: 1 }
        dns_refresh_rate: { seconds: 10 }
        type: STRICT_DNS
        ...
        upstream_connection_options:
          tcp_keepalive:
            keepalive_probes: 3
            keepalive_time: 30
            keepalive_interval: 5

This is especially bad for ADS because Envoy will merrily think it's connected in a TCP Half Open mode and there is no way to re-establish that connection since Envoy is the one that connects to the ADS server not the other way round.

I'll submit a patch to update some of the docs to note this fact.