Envoy: Envoy does not initiate lds connection during startup

Created on 4 Jun 2018 · 6Comments · Source: envoyproxy/envoy

Description:

Envoy does not initiate connection to lds during startup.

Repro steps:

Reproduced with 1.6.0 and master.

We deploy envoy to talk to a management sever on 127.0.0.1:8001. If the management server is available before envoy starts, that is, if envoy is able to get cds data during startup rather than entering a retry loop, then envoy will not initiate an LDS connection.

listener_manager.lds.update_attempt: 0
listener_manager.lds.update_failure: 0
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 0
listener_manager.lds.version: 0
listener_manager.listener_added: 0
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 0
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 0
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0

This is confirmed with /stats showing no lds connections have been attempted. Our management server also indicates no lds connections for {Fetch,Stream}Listeners.

Config:

dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      cluster_names: [contour]
      grpc_services:
      - envoy_grpc:
          cluster_name: contour
  cds_config:
    api_config_source:
      api_type: GRPC
      cluster_names: [contour]
      grpc_services:
      - envoy_grpc:
          cluster_name: contour
static_resources:
  clusters:
  - name: contour
    connect_timeout: 0.250s
    type: STRICT_DNS
    hosts:
    - socket_address:
        address: 127.0.0.1
        port_value: 8001
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    circuit_breakers:
      thresholds:
        - priority: high
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
        - priority: default
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9001

Logs:

https://gist.github.com/davecheney/bc158c760740928d10b048e14b5fdb95

question

Source

davecheney

Most helpful comment

@mandarjog yup, that's it.

After sleeping on the issue the problem was

Normally my impl of StreamEndpoints does not return any results if there are no endpoints registered. This seemed like it was ok, a cluster would stay in warming until there was a response from EDS, meaning there was at least one endpoint to connect too.
However because of this behaviour during initial startup CDS would request an EDS response for each cluster registered and would stall because rather than returning zero endpoints, I was stalling.

With that fixed Envoy starts up immediately as expected.

@mattklein123 @mandarjog thanks again for your assistance.

davecheney on 4 Jun 2018

👍3

All 6 comments

Additionally if I remove the cds config from our bootstrap, leaving only LDS, then lds is probed as expected

listener_manager.lds.update_attempt: 2
listener_manager.lds.update_failure: 0
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 1
listener_manager.lds.version: 7148434200721666028
listener_manager.listener_added: 2
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 2
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 2
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0

Logs:

https://gist.github.com/davecheney/5c13be53694b49d754e75acba8f7ee27

davecheney on 4 Jun 2018

My best laypersons attempt to debug this so far

a. if the management server is not available; envoy moves quickly through static cluster initalisation

[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:682] adding TLS initial cluster contour
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:767] membership update for TLS cluster contour
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:88] cm init: init complete: cluster=contour primary=0 secondary=0
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:60] cm init: adding: cluster=contour primary=0 secondary=0
[2018-06-04 04:38:13.408][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized

then waits for the cds / lds streams to connect. This is the happy path. This is not what is happening
b. What looks like is _iff_ the management server is available immediately then cds is initalised before starting the cds / lds streams

[2018-06-04 07:49:20.648][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:128] cm init: initializing cds
[2018-06-04 07:49:20.648][1][debug][upstream] source/common/config/grpc_mux_impl.cc:103] gRPC mux subscribe for type.googleapis.com/envoy.api.v2.Cluster
[2018-06-04 07:49:20.648][1][debug][upstream] source/common/config/grpc_mux_impl.cc:51] No stream available to sendDiscoveryRequest for type.googleapis.com/envoy.api.v2.Cluster
[2018-06-04 07:49:20.648][1][debug][upstream] source/common/config/grpc_mux_impl.cc:36] Establishing new gRPC bidi stream for rpc StreamClusters(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.a
pi.v2.DiscoveryResponse);

and at some point it looks like my management server is stalling

[2018-06-04 07:49:30.686][1][debug][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis
.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "kube-system/kubernetes-dashboard"
endpoints {
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.48.0.8"
          port_value: 8443
        }
      }
    }
  }
}
]
[2018-06-04 07:49:30.686][1][debug][router] source/common/router/router.cc:568] [C0][S10440692737244005597] upstream headers complete: end_stream=false
[2018-06-04 07:49:30.686][1][debug][http] source/common/http/async_client_impl.cc:92] async http request response headers (end_stream=false):
':status', '200'
'content-type', 'application/grpc'

[2018-06-04 07:49:30.686][1][debug][upstream] source/common/config/grpc_mux_impl.cc:160] Received gRPC message for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment at version 0
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/upstream/eds.cc:154] EDS hosts or locality weights changed for cluster: kube-system/metrics-server/443 (0) priority 0
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:767] membership update for TLS cluster kube-system/metrics-server/443
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:88] cm init: init complete: cluster=kube-system/metrics-server/443 primary=0 secondary=2
[2018-06-04 07:49:30.686][1][debug][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis
.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "kube-system/metrics-server"
endpoints {
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.48.1.7"
          port_value: 443
        }
      }
    }
  }
}
]

^ note the lack of

[2018-06-04 07:49:30.686][1][debug][router] source/common/router/router.cc:568] [C0][S10440692737244005597] upstream headers complete: end_stream=false
[2018-06-04 07:49:30.686][1][debug][http] source/common/http/async_client_impl.cc:92] async http request response headers (end_stream=false):

messages.

davecheney on 4 Jun 2018

@davecheney yeah that is expected. Envoy does initialization in a bunch of different phases, and CDS initialization is done prior to LDS because LDS might depends on the initial CDS response. If the management server is hanging during CDS, it will never get to LDS and startup.

mattklein123 on 4 Jun 2018

@davecheney here is a similar issue we faced in istio https://github.com/istio/istio/issues/5391

mandarjog on 4 Jun 2018

👍1

@mandarjog yup, that's it.

After sleeping on the issue the problem was

Normally my impl of StreamEndpoints does not return any results if there are no endpoints registered. This seemed like it was ok, a cluster would stay in warming until there was a response from EDS, meaning there was at least one endpoint to connect too.
However because of this behaviour during initial startup CDS would request an EDS response for each cluster registered and would stall because rather than returning zero endpoints, I was stalling.

With that fixed Envoy starts up immediately as expected.

@mattklein123 @mandarjog thanks again for your assistance.

davecheney on 4 Jun 2018

👍3

Hello @davecheney in which version of envoy this is fixed?

I saw https://github.com/projectcontour/contour/pull/308/files merged

But not sure when it was released
Thanks before!