When creating clusters that reference SDS certificates, the warming behavior does not seem correct. My expectation is that until the secret is sent, the cluster will be marked as "warming" until the initial_fetch_timeout, and block the rest of initialization from occuring.
What I am actually seeing is initialization is blocked, but there is nothing indicating the clusters are warming.
Using this config:
docker run -v $HOME/kube/local:/config -p 15000:15000 envoyproxy/envoy-dev -c /config/envoy-sds-lds.yaml --log-format-prefix-with-location 0 --reject-unknown-dynamic-fields
with envoy version: 49efb9841a58ebdc43a666f55c445911c8e4181c/1.15.0-dev/Clean/RELEASE/BoringSSL
and config files:
cds.yaml:
resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
name: outbound_cluster_tls
connect_timeout: 5s
max_requests_per_connection: 1
load_assignment:
cluster_name: xds-grpc
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8080
type: STATIC
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext
common_tls_context:
tls_certificate_sds_secret_configs:
- name: "default"
sds_config:
initial_fetch_timeout: 20s
api_config_source:
api_type: GRPC
grpc_services:
- envoy_grpc:
cluster_name: "sds-grpc"
refresh_delay: 60s
combined_validation_context:
default_validation_context: {}
validation_context_sds_secret_config:
name: ROOTCA
sds_config:
initial_fetch_timeout: 20s
api_config_source:
api_type: GRPC
grpc_services:
- envoy_grpc:
cluster_name: sds-grpc
envoy-sds-lds.yaml:
admin:
access_log_path: /dev/null
address:
socket_address:
address: 0.0.0.0
port_value: 15000
node:
id: id
cluster: sdstest
dynamic_resources:
lds_config:
api_config_source:
api_type: GRPC
grpc_services:
envoy_grpc:
cluster_name: lds
cds_config:
path: /config/cds.yaml
static_resources:
clusters:
- name: sds-grpc
type: STATIC
http2_protocol_options: {}
connect_timeout: 5s
lb_policy: ROUND_ROBIN
- name: lds
type: STATIC
http2_protocol_options: {}
connect_timeout: 5s
lb_policy: ROUND_ROBIN
Basically what should happen here is we get a dynamic CDS cluster with SDS config. This SDS config fails, as the sds server is not setup. We have initial_fetch_timeout, so for 20s everything should be warming.
What we see instead:
cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 1
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 1
cluster_manager.cds.update_time: 1588972075968
cluster_manager.cds.version: 17241709254077376921
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 0
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0
We also see init_fetch_timeout is 0; this does not change after 20s
[2020-05-08 21:07:55.967][1][info][upstream] cds: add 1 cluster(s), remove 2 cluster(s)
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][info][upstream] cds: add/update cluster 'outbound_cluster_tls'
[2020-05-08 21:07:55.968][1][info][main] starting main dispatch loop
[2020-05-08 21:07:56.703][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.703][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:56.938][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.938][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.135][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.135][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.682][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.682][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:58.671][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:58.671][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:08.992][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:08:08.992][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:15.967][1][info][upstream] cm init: all clusters initialized
[2020-05-08 21:08:15.967][1][info][main] all clusters initialized. initializing init manager
[2020-05-08 21:08:15.967][1][warning][config] StreamListeners gRPC config stream closed: 14, no healthy upstream
dynamic_active_clusters shows the cluster in cds.yaml. I would expect it to be "warming".This example above is meant to simplify it, I have originally seen this with a normal deployment using ADS gRPC server (Istio) not just files.
/cc @JimmyCYJ
@Shikugawa would you be able to help on investigating this?
@dio Yes, I'll investigate about this problem.
I suspect this causes other issues as well. We are seeing that if we do not include sds config in the XDS connection, we eventually have SDS permanently broken - clusters that reference sds secrets will be stuck warming forever.
We have 2 secrets throughout all config, the client cert and the root cert. If we add the cert to the xds cluster (enforcing SDS starts before XDS) then we never see the issue with the client cert. However, the root cert still sometimes gets stuck warming forever.
more info https://github.com/istio/istio/issues/22443.
I am not sure if its the same root cause but seems related
@howardjohn Got it. I think that cluster_warming and also initial_fetch_timeout is broken. It never counts cluster_warming and initial_fetch_timeout. Your problem would be caused by requesting an inactive SDS cluster. So we have two problems related to this issue. Is this right?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
Your problem would be caused by requesting an inactive SDS cluster.
Its not just the SDS cluster being inactive, the SDS cluster could be active but not return any secrets
@lambdai @JimmyCYJ can you look at this? We do not want envoy to declare ready before it is.
Let me rephrase to see we are on the same page.
My understanding is that "warming cluster shows ready" is the only error.
IMHO cluster is early active b/c SDS set 20s fetch timeout. That is exactly what 20s fetch timeout is supposed to work.
We may be saying the same thing, not sure. But I think its more like:
@howardjohn I see stat is the only liar. I think envoy is early declaring active at stat. Not only cluster but also listener
However, the sequence is working as expected. The initial_fetch_timeout is supposed to unblock the initialization by announce itself "ready" or "active"
@mandarjog I think the solution is to disable init_fetch_timeout in SDS if we cannot tolerate fake readiness.
@lambdai I don't think stats is the only issue. See https://github.com/istio/istio/issues/22443. I have less of a clear reproducer, but basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response. It seems there are larger problems then just the stat - but maybe its a completely different issue.
Why is the solution to disable it rather than fix the stat?
basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response
Get it. Reading the issue and see if I can help.
If envoy does want to block initialization assuming before sds is fixed, the only solution is to disable the fetch timeout in SDS.
Then the readiness probe should chime in and kill istio container, correct?
Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.
One thing I do wonder - does the initial_fetch_timeout stuff work differently for bootstrap vs dynamic? Because what we found is if we requested a secret default as part of the XDS cluster, everything is ok. If we do not, but the dynamic response requests default it occasionally breaks and never sends the SDS request. I never figured out why that is - I assumed it may be able some sequencing issue and using the secret for the XDS cluster linearizes it, but maybe its bootstrap vs dynamic?
Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.
@lambdai @mandarjog Sorry. I missed these discussions! For now, I think that the common ground is to add disable_init_fetch_timeout on SDS or runtime flag. This won't break the default behavior of SDS and API compatibility.
Why do we need disable_init_fetch_timeout? I think setting it to 0s disabled it already?
Besides, the fact we want it disabled is an implementation detail of our usage of Envoy, it's still a general issue that will impact others
@howardjohn Got it. Maybe this is caused by the state management of cluster manager. In general, if we started to create xDS subscription when all of clusters are not initialized, this problem will be occurred. So I think that to resolve this problem, we should fix the implementation of init_manager. So this is relatively deep-routed problem. We should have the prevention logic not to use active_clusters_ on the implementation of ClusterManager, shouldn't we? cc. @lambdai @mandarjog
Any update on this? This is an intermittent problem on our istio deployment that stalls pods.
@howardjohn Hey. I'd like to take confirmation about this. I don't completely understand what is the problem. Our problem is,
Attached cluster will be immidiately active after SDS subscription on attached cluster sent DiscoveryRequest to SDS Cluster until init_fetch_timeout on attached cluster when DiscoveryResponse from SDS cluster didn't have CA. We expect to work that attached cluster is warming in this situation.
Is this what you said?
Yep. The cluster should be warming until the secret is fetched, but its active immediately
We also run in to this problem intermittently. In our case we stream file based certs via SDS using Istio. When we have many clusters if SDS push is delayed for a cluster, since the cluster is incorrectly marked as Active, requests fail to that cluster with error "OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_UNKNOWN" immediately after initialization. The problem is more prominent when there are many clusters.
Any update on this issue?
@tmshort I'm considering that this issue has two problems. As the part of this, I crafted a PR and it is in review. https://github.com/envoyproxy/envoy/pull/12783
/cc @incfly
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
@lizan Can we close this?
@Shikugawa no this is not fixed, I'll have a fix soon.