Description:
Envoy does not initiate connection to lds during startup.
Repro steps:
Reproduced with 1.6.0 and master.
We deploy envoy to talk to a management sever on 127.0.0.1:8001. If the management server is available before envoy starts, that is, if envoy is able to get cds data during startup rather than entering a retry loop, then envoy will not initiate an LDS connection.
listener_manager.lds.update_attempt: 0
listener_manager.lds.update_failure: 0
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 0
listener_manager.lds.version: 0
listener_manager.listener_added: 0
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 0
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 0
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0
This is confirmed with /stats showing no lds connections have been attempted. Our management server also indicates no lds connections for {Fetch,Stream}Listeners.
Config:
dynamic_resources:
lds_config:
api_config_source:
api_type: GRPC
cluster_names: [contour]
grpc_services:
- envoy_grpc:
cluster_name: contour
cds_config:
api_config_source:
api_type: GRPC
cluster_names: [contour]
grpc_services:
- envoy_grpc:
cluster_name: contour
static_resources:
clusters:
- name: contour
connect_timeout: 0.250s
type: STRICT_DNS
hosts:
- socket_address:
address: 127.0.0.1
port_value: 8001
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
circuit_breakers:
thresholds:
- priority: high
max_connections: 100000
max_pending_requests: 100000
max_requests: 60000000
max_retries: 50
- priority: default
max_connections: 100000
max_pending_requests: 100000
max_requests: 60000000
max_retries: 50
admin:
access_log_path: /dev/null
address:
socket_address:
address: 127.0.0.1
port_value: 9001
Logs:
https://gist.github.com/davecheney/bc158c760740928d10b048e14b5fdb95
Additionally if I remove the cds config from our bootstrap, leaving only LDS, then lds is probed as expected
listener_manager.lds.update_attempt: 2
listener_manager.lds.update_failure: 0
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 1
listener_manager.lds.version: 7148434200721666028
listener_manager.listener_added: 2
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 2
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 2
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0
Logs:
https://gist.github.com/davecheney/5c13be53694b49d754e75acba8f7ee27
My best laypersons attempt to debug this so far
a. if the management server is not available; envoy moves quickly through static cluster initalisation
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:682] adding TLS initial cluster contour
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:767] membership update for TLS cluster contour
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:88] cm init: init complete: cluster=contour primary=0 secondary=0
[2018-06-04 04:38:13.408][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:60] cm init: adding: cluster=contour primary=0 secondary=0
[2018-06-04 04:38:13.408][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
then waits for the cds / lds streams to connect. This is the happy path. This is not what is happening
b. What looks like is _iff_ the management server is available immediately then cds is initalised before starting the cds / lds streams
[2018-06-04 07:49:20.648][1][info][upstream] source/common/upstream/cluster_manager_impl.cc:128] cm init: initializing cds
[2018-06-04 07:49:20.648][1][debug][upstream] source/common/config/grpc_mux_impl.cc:103] gRPC mux subscribe for type.googleapis.com/envoy.api.v2.Cluster
[2018-06-04 07:49:20.648][1][debug][upstream] source/common/config/grpc_mux_impl.cc:51] No stream available to sendDiscoveryRequest for type.googleapis.com/envoy.api.v2.Cluster
[2018-06-04 07:49:20.648][1][debug][upstream] source/common/config/grpc_mux_impl.cc:36] Establishing new gRPC bidi stream for rpc StreamClusters(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.a
pi.v2.DiscoveryResponse);
and at some point it looks like my management server is stalling
[2018-06-04 07:49:30.686][1][debug][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis
.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "kube-system/kubernetes-dashboard"
endpoints {
lb_endpoints {
endpoint {
address {
socket_address {
address: "10.48.0.8"
port_value: 8443
}
}
}
}
}
]
[2018-06-04 07:49:30.686][1][debug][router] source/common/router/router.cc:568] [C0][S10440692737244005597] upstream headers complete: end_stream=false
[2018-06-04 07:49:30.686][1][debug][http] source/common/http/async_client_impl.cc:92] async http request response headers (end_stream=false):
':status', '200'
'content-type', 'application/grpc'
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/config/grpc_mux_impl.cc:160] Received gRPC message for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment at version 0
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/upstream/eds.cc:154] EDS hosts or locality weights changed for cluster: kube-system/metrics-server/443 (0) priority 0
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:767] membership update for TLS cluster kube-system/metrics-server/443
[2018-06-04 07:49:30.686][1][debug][upstream] source/common/upstream/cluster_manager_impl.cc:88] cm init: init complete: cluster=kube-system/metrics-server/443 primary=0 secondary=2
[2018-06-04 07:49:30.686][1][debug][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis
.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "kube-system/metrics-server"
endpoints {
lb_endpoints {
endpoint {
address {
socket_address {
address: "10.48.1.7"
port_value: 443
}
}
}
}
}
]
^ note the lack of
[2018-06-04 07:49:30.686][1][debug][router] source/common/router/router.cc:568] [C0][S10440692737244005597] upstream headers complete: end_stream=false
[2018-06-04 07:49:30.686][1][debug][http] source/common/http/async_client_impl.cc:92] async http request response headers (end_stream=false):
messages.
@davecheney yeah that is expected. Envoy does initialization in a bunch of different phases, and CDS initialization is done prior to LDS because LDS might depends on the initial CDS response. If the management server is hanging during CDS, it will never get to LDS and startup.
@davecheney here is a similar issue we faced in istio https://github.com/istio/istio/issues/5391
@mandarjog yup, that's it.
After sleeping on the issue the problem was
Normally my impl of StreamEndpoints does not return any results if there are no endpoints registered. This seemed like it was ok, a cluster would stay in warming until there was a response from EDS, meaning there was at least one endpoint to connect too.
However because of this behaviour during initial startup CDS would request an EDS response for each cluster registered and would stall because rather than returning zero endpoints, I was stalling.
With that fixed Envoy starts up immediately as expected.
@mattklein123 @mandarjog thanks again for your assistance.
Hello @davecheney in which version of envoy this is fixed?
I saw https://github.com/projectcontour/contour/pull/308/files merged
But not sure when it was released
Thanks before!
Most helpful comment
@mandarjog yup, that's it.
After sleeping on the issue the problem was
Normally my impl of StreamEndpoints does not return any results if there are no endpoints registered. This seemed like it was ok, a cluster would stay in warming until there was a response from EDS, meaning there was at least one endpoint to connect too.
However because of this behaviour during initial startup CDS would request an EDS response for each cluster registered and would stall because rather than returning zero endpoints, I was stalling.
With that fixed Envoy starts up immediately as expected.
@mattklein123 @mandarjog thanks again for your assistance.