Envoy: sds: cluster not warming while certificates are being fetched; immediately marked active

Created on 8 May 2020 · 27Comments · Source: envoyproxy/envoy

When creating clusters that reference SDS certificates, the warming behavior does not seem correct. My expectation is that until the secret is sent, the cluster will be marked as "warming" until the initial_fetch_timeout, and block the rest of initialization from occuring.

What I am actually seeing is initialization is blocked, but there is nothing indicating the clusters are warming.

Using this config:
docker run -v $HOME/kube/local:/config -p 15000:15000 envoyproxy/envoy-dev -c /config/envoy-sds-lds.yaml --log-format-prefix-with-location 0 --reject-unknown-dynamic-fields

with envoy version: 49efb9841a58ebdc43a666f55c445911c8e4181c/1.15.0-dev/Clean/RELEASE/BoringSSL

and config files:

cds.yaml:

resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: outbound_cluster_tls
  connect_timeout: 5s
  max_requests_per_connection: 1
  load_assignment:
    cluster_name: xds-grpc
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: 127.0.0.1
              port_value: 8080
  type: STATIC
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext
      common_tls_context:
        tls_certificate_sds_secret_configs:
          - name: "default"
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                  - envoy_grpc:
                      cluster_name: "sds-grpc"
                refresh_delay: 60s
        combined_validation_context:
          default_validation_context: {}
          validation_context_sds_secret_config:
            name: ROOTCA
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                - envoy_grpc:
                    cluster_name: sds-grpc

envoy-sds-lds.yaml:

admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 15000
node:
  id: id
  cluster: sdstest
dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      grpc_services:
        envoy_grpc:
          cluster_name: lds
  cds_config:
    path: /config/cds.yaml
static_resources:
  clusters:    
  - name: sds-grpc
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN
  - name: lds
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN

Basically what should happen here is we get a dynamic CDS cluster with SDS config. This SDS config fails, as the sds server is not setup. We have initial_fetch_timeout, so for 20s everything should be warming.

What we see instead:

Stats are not showing warming:

cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 1
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 1
cluster_manager.cds.update_time: 1588972075968
cluster_manager.cds.version: 17241709254077376921
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 0
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0

We also see init_fetch_timeout is 0; this does not change after 20s

LDS is not requested until 20s later, indicating the initial_fetch_timeout is respected. This can be seen in the logs:
(note - for simple testing I don't have a real LDS server, but we can see its not even attempted until 20s in)

[2020-05-08 21:07:55.967][1][info][upstream] cds: add 1 cluster(s), remove 2 cluster(s)
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][info][upstream] cds: add/update cluster 'outbound_cluster_tls'
[2020-05-08 21:07:55.968][1][info][main] starting main dispatch loop
[2020-05-08 21:07:56.703][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.703][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:56.938][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.938][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.135][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.135][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.682][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.682][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:58.671][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:58.671][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:08.992][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:08:08.992][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:15.967][1][info][upstream] cm init: all clusters initialized
[2020-05-08 21:08:15.967][1][info][main] all clusters initialized. initializing init manager
[2020-05-08 21:08:15.967][1][warning][config] StreamListeners gRPC config stream closed: 14, no healthy upstream

dynamic_active_clusters shows the cluster in cds.yaml. I would expect it to be "warming".

This example above is meant to simplify it, I have originally seen this with a normal deployment using ADS gRPC server (Istio) not just files.

aresds areservice discovery

Source

howardjohn

👍1

All 27 comments

/cc @JimmyCYJ

JimmyCYJ on 11 May 2020

@Shikugawa would you be able to help on investigating this?

dio on 12 May 2020

@dio Yes, I'll investigate about this problem.

Shikugawa on 13 May 2020

I suspect this causes other issues as well. We are seeing that if we do not include sds config in the XDS connection, we eventually have SDS permanently broken - clusters that reference sds secrets will be stuck warming forever.

We have 2 secrets throughout all config, the client cert and the root cert. If we add the cert to the xds cluster (enforcing SDS starts before XDS) then we never see the issue with the client cert. However, the root cert still sometimes gets stuck warming forever.

more info https://github.com/istio/istio/issues/22443.

I am not sure if its the same root cause but seems related

howardjohn on 13 May 2020

👍1

@howardjohn Got it. I think that cluster_warming and also initial_fetch_timeout is broken. It never counts cluster_warming and initial_fetch_timeout. Your problem would be caused by requesting an inactive SDS cluster. So we have two problems related to this issue. Is this right?

Shikugawa on 14 May 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 13 Jun 2020

Your problem would be caused by requesting an inactive SDS cluster.

Its not just the SDS cluster being inactive, the SDS cluster could be active but not return any secrets

howardjohn on 15 Jun 2020

@lambdai @JimmyCYJ can you look at this? We do not want envoy to declare ready before it is.

mandarjog on 9 Jul 2020

Let me rephrase to see we are on the same page.

SDS: somehow the SDS's response is not returning.
while 1 is blocked for the first 20s due to SDS fetch timeout, the cluster is in warming but shows active
After 20s, the SDS resource announce ready
Immediately after 3. the cluster announce ready.

My understanding is that "warming cluster shows ready" is the only error.
IMHO cluster is early active b/c SDS set 20s fetch timeout. That is exactly what 20s fetch timeout is supposed to work.

lambdai on 9 Jul 2020

We may be saying the same thing, not sure. But I think its more like:

SDS: somehow the SDS's response is not returning.
while 1 is blocked for the first 20s due to SDS fetch timeout, the cluster is in warming but shows active
After 20s, the sds fetch times out due to 20s fetch timeout
Immediately after 3. the cluster announce ready - in the sense that LDS init continues, but it was always shown as "ready" in the stats

howardjohn on 9 Jul 2020

@howardjohn I see stat is the only liar. I think envoy is early declaring active at stat. Not only cluster but also listener

However, the sequence is working as expected. The initial_fetch_timeout is supposed to unblock the initialization by announce itself "ready" or "active"

@mandarjog I think the solution is to disable init_fetch_timeout in SDS if we cannot tolerate fake readiness.

lambdai on 9 Jul 2020

@lambdai I don't think stats is the only issue. See https://github.com/istio/istio/issues/22443. I have less of a clear reproducer, but basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response. It seems there are larger problems then just the stat - but maybe its a completely different issue.

Why is the solution to disable it rather than fix the stat?

howardjohn on 9 Jul 2020

basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response

Get it. Reading the issue and see if I can help.
If envoy does want to block initialization assuming before sds is fixed, the only solution is to disable the fetch timeout in SDS.
Then the readiness probe should chime in and kill istio container, correct?

lambdai on 9 Jul 2020

Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.

One thing I do wonder - does the initial_fetch_timeout stuff work differently for bootstrap vs dynamic? Because what we found is if we requested a secret default as part of the XDS cluster, everything is ok. If we do not, but the dynamic response requests default it occasionally breaks and never sends the SDS request. I never figured out why that is - I assumed it may be able some sequencing issue and using the secret for the XDS cluster linearizes it, but maybe its bootstrap vs dynamic?

howardjohn on 9 Jul 2020

Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.

@lambdai @mandarjog Sorry. I missed these discussions! For now, I think that the common ground is to add disable_init_fetch_timeout on SDS or runtime flag. This won't break the default behavior of SDS and API compatibility.

Shikugawa on 13 Jul 2020

Why do we need disable_init_fetch_timeout? I think setting it to 0s disabled it already?

Besides, the fact we want it disabled is an implementation detail of our usage of Envoy, it's still a general issue that will impact others

howardjohn on 13 Jul 2020

@howardjohn Got it. Maybe this is caused by the state management of cluster manager. In general, if we started to create xDS subscription when all of clusters are not initialized, this problem will be occurred. So I think that to resolve this problem, we should fix the implementation of init_manager. So this is relatively deep-routed problem. We should have the prevention logic not to use active_clusters_ on the implementation of ClusterManager, shouldn't we? cc. @lambdai @mandarjog

Shikugawa on 13 Jul 2020

Any update on this? This is an intermittent problem on our istio deployment that stalls pods.

tmshort on 4 Aug 2020

@howardjohn Hey. I'd like to take confirmation about this. I don't completely understand what is the problem. Our problem is,

Attached cluster will be immidiately active after SDS subscription on attached cluster sent DiscoveryRequest to SDS Cluster until init_fetch_timeout on attached cluster when DiscoveryResponse from SDS cluster didn't have CA. We expect to work that attached cluster is warming in this situation.

Is this what you said?

Shikugawa on 14 Aug 2020

Yep. The cluster should be warming until the secret is fetched, but its active immediately

howardjohn on 14 Aug 2020

We also run in to this problem intermittently. In our case we stream file based certs via SDS using Istio. When we have many clusters if SDS push is delayed for a cluster, since the cluster is incorrectly marked as Active, requests fail to that cluster with error "OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_UNKNOWN" immediately after initialization. The problem is more prominent when there are many clusters.

ramaraochavali on 23 Aug 2020

👍1

Any update on this issue?

tmshort on 14 Sep 2020

@tmshort I'm considering that this issue has two problems. As the part of this, I crafted a PR and it is in review. https://github.com/envoyproxy/envoy/pull/12783

Shikugawa on 15 Sep 2020

👍1

/cc @incfly

incfly on 23 Oct 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.