When a cluster is created or updated envoy it enters warming phase and needs a related ClusterLoadAssignement response to fully initialize.
During Envoy startup phase envoy sends requests for those resources to the management server and so the management servers knows it has to respond those.
But, when updating a Cluster via CDS, no EDS re-request is sent to the management server and management server doesn't really know it should send a ClusterLoadAssignement for that Cluster, even if the resource hasn't really changed.
This berhavior introduces subtle bugs in management servers, in our case, our resource versioning scheme somehow included the cluster version, yesterday we introduced a change to remove that, and that unexpectedly broke our cluster updates, leaving some clusters without traffic.
This should probably be handled by envoy, currently go-control-plane and java-control-plane don't really handle this since it's kind of hard to induce this behavior of always pushing an EDS update for a Cluster even if the resource hasn't really changed, and if not using ADS, envoy is possibly connected to multiple management servers which increases the difficulty.
@htuch suggested that envoy should probably unsubscribe from EDS for the updated cluster and immediately subscribe again, since this is what's actually happening inside envoy, a new Cluster is being created and it wants to subscribe to the resources, and the old one wants to unsubscribe. Some more thought should be done on this idea and the consecuences of the old cluster being unsubscribed from resources.
Some discussion already happened on Slack, opening this issue to continue discussion.
This issue also applies to LDS updates with RDS and probably SRDS with RDS.
CC @envoyproxy/api-shepherds
But, when updating a Cluster via CDS, no EDS re-request is sent to the management server and management server doesn't really know it should send a ClusterLoadAssignement for that Cluster, even if the resource hasn't really changed.
Why would the management server need to send an update for the EDS resource that has not changed, just because the CDS resource that pointed to it has changed? This seems like a pretty clear Envoy bug: the CDS and EDS resources should be tracked independently from the perspective of the xDS client code, so that when a new Cluster object is created, it can just pick up the existing data for the EDS resource it wants.
In gRPC, the pattern we use here is to separate out the transport protocol logic from the data model logic. The transport protocol logic (which includes not just the wire communication but also the caching logic, since those two things are ultimately closely tied together) is implemented in a single XdsClient object. Its API allows callers to start watches for a given resource of a given type. When the first watch for a given resource is started, the XdsClient will send a request to the xDS server subscribing to that resource, and all updates from the xDS server will be sent to that watcher. It's fine for multiple callers to be watching the same resource: if a watch is started for a resource that is already subscribed to (because some other watch was already started), then the XdsClient will immediately send the already-cached data to the new watcher. When all watchers for a given resource are cancelled, the XdsClient will send a request to the xDS server unsubscribing from that resource.
One significant difference between Envoy and gRPC is that when a CDS resource is updated, gRPC can update the individual fields in the cluster instance rather than having to create a whole new cluster instance. But even if we did need to create a whole new cluster instance, it still would not pose a problem, because we'd keep the old instance around while the new instance warmed up. This would mean that when the new instance would start a watch for the EDS resource name, it would get the value that was already cached in the XdsClient, and the management server would not know or care that the cluster instance had been replaced.
@htuch suggested that envoy should probably unsubscribe from EDS for the updated cluster and immediately subscribe again, since this is what's actually happening inside envoy, a new Cluster is being created and it wants to subscribe to the resources, and the old one wants to unsubscribe. Some more thought should be done on this idea and the consecuences of the old cluster being unsubscribed from resources.
This seems sub-optimal, both because it will cause unnecessary network traffic and because it will make warming the new cluster instance take longer. I think it should be possible to avoid all of that by separating out the xDS client code into its own layer, as I described above.
That having been said, my concern here is less about Envoy having sub-optimal behavior and more about making sure that the protocol semantics are clear. I think that the protocol itself should not require clients to re-subscribe in this case, because I'd rather not change gRPC, which already handles this case correctly, to do something less optimal. And if Envoy does do the less optimal thing, I am concerned that it may introduce situations where management servers are built with Envoy-specific assumptions that make it hard for them to support gRPC later.
@markdroth I think this was a deliberate choice we made when designing CDS warming. The idea is that you might also be switching the endpoints when changing cluster config, e.g. upgrading from non-TLS to TLS cluster. You can see a pretty detailed discussion in https://github.com/envoyproxy/envoy/issues/5168.
On clarifying protocol semantics, I think we have a PoR to move forward here with https://github.com/envoyproxy/envoy/issues/11299. I'll be putting together a SoW draft for this during the week. This is probably the best way to get a formal and testable statement of intended semantics in these corner cases.
Reading the discussion in #5168, the problem there actually seems like another symptom of the fact that Envoy's implementation does not cleanly separate the transport protocol layer (which by necessity includes caching) from the data model layer (in this case, the cluster instance). IMHO, these should be two different layers with a clear separation between them, so that implementation details in the data model layer (such as completely replacing a cluster instance whenever one field changes) do not affect the behavior of the transport protocol layer (such as caching).
To say that another way, I think that the existence of this issue is an indication that the existing solution to #5168 didn't really solve the problem. I think that if Envoy changed to cleanly separate these two layers, it would not only solve the immediate problem but also prevent future problems caused by this same underlying architectural problem.
In fact, it looks to me like you actually suggested something similar to this in https://github.com/envoyproxy/envoy/issues/5168#issuecomment-459812020, and I don't see any follow-up discussion there saying why this should not work. I see that you said that it seems odd to do this in the gRPC mux impl, which I can understand, but I don't see why we can't wrap a fuller XdsClient abstraction around the gRPC mux impl.
@htuch if one were to make something like change TLS without downtime, we now have transport_socket_matches and also, couldn't something be made with AggregateClusters?
BTW, if you're interested in what the XdsClient API looks like in gRPC, the relevant part is here:
In our implementation, there's a separate watch API for each type of resource, mainly because we can't depend on the protobuf library and didn't want to make upb part of this API. But in Envoy, you could easily have a single watch API used for all resource types, simply by specifying the resource type as a parameter when you start the watch and having the data reported to the watcher be an Any.
I'm with @markdroth here.
I think from a protocol standpoint it pretty crappy for mangement servers to have to resend a configuration change if nothing really changed. Furthermore, in Incremental, Envoy is subscribed to resource updates and should only get notifiend when those actually change.
In the case of updating clusters to TLS, if it's not doable with current mechanics, maybe we're in need of additional methods for transparently doing this sort of updates, but it would seem that transport_socket_matches and AggregateCluster are some of those mechanics.
As in the original CDS work, our constraints are not greenfield, to some extent we need to be backwards compatible with existing management servers. I don't think it would be particularly hard to add caching to Envoy's existing implementation, the question is really what is the correct behavior on the wire, given Envoy's existing behavior, xDS server implementations and our desired goals. We've been running with the "CDS updates implies warming, implies an EDS response" for some time. Can we design something that works for folks who are assuming that and migrate to a better model? Potentially, but I don't think we can sunset Envoy's cluster warming behavior overnight.
If Envoy stops requiring management servers to proactively send an EDS response after a CDS update, then it's not harmful for management servers that continue to do that (the update will effectively be a no-op on the client if nothing has changed), but it means that new management servers don't need to send that unnecessary update.
One thing I'd like to point out, going all the way back to the top of this thread, is that in Envoy today, what is going on, is that a cluster is going away and a new one is arriving, which happens to share the same name. When you delete a resource, in CDS, you logically unsubscribe from the EDS resources. We _could_ special case this to talk about what happens over an update, but keep in mind we're adding in an extra corner case to deal with in implementations.
I'm not opposed to any direction we take here. IMHO, the most logical thing would be for subscriptions to actually references the subscribing resource instance (e.g. via UUID), so that you could independently unsubscribe and resubscribe. But I don't think it's worth building that out just for this.
If Envoy stops requiring management servers to proactively send an EDS response after a CDS update, then it's not harmful for management servers that continue to do that (the update will effectively be a no-op on the client if nothing has changed), but it means that new management servers don't need to send that unnecessary update.
It's not as simple as this, since some Envoy users will be expecting the warming behavior and want to defer the cluster warming until endpoint update. See the TLS upgrade example, which is not a hypothetical but a valid use of Envoy's existing mechanisms.
Also worth keeping in mind, this is not some undocumented behavior that people happen to be relying on, we've been guaranteeing this for some time, see https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager.html?highlight=warming#cluster-warming
One thing I'd like to point out, going all the way back to the top of this thread, is that in Envoy today, what is going on, is that a cluster is going away and a new one is arriving, which happens to share the same name. When you delete a resource, in CDS, you logically unsubscribe from the EDS resources. We could special case this to talk about what happens over an update, but keep in mind we're adding in an extra corner case to deal with in implementations.
I think the right way to model this is to say that the implicit deletion of RDS and EDS resources happens at the transport protocol layer, not at the data model layer -- which makes sense if you think about it, because this deletion is a cache operation, and caching is dictated by the transport protocol, not the data model. In other words, inside of the XdsClient, when we receive a CDS update that removes a CDS resource, we remove the cached EDS resource that was referred to by the removed CDS resource, and we notify any watchers of that EDS resource that the resource no longer exists. It should not be tied to the deletion of the data model object (the cluster instance) at all.
What this means is that when the XdsClient sees an updated CDS resource that changes some field other than the EDS service name, it knows that the EDS resource is still referenced and is therefore not being deleted. The fact that Envoy will create a new cluster instance and destroy the old one at the data model layer should be irrelevant to that decision.
(Also, as we've discussed separately, I think that the implicit deletion of the RDS and EDS resources is a very ugly layering violation, specifically because it does require the transport protocol layer to know something about the relationship between the resources, which is really a data model concern. IMHO we should remove this behavior as part of moving to the incremental protocol variants, which will make the layering cleaner here.)
It's not as simple as this, since some Envoy users will be expecting the warming behavior and want to defer the cluster warming until endpoint update. See the TLS upgrade example, which is not a hypothetical but a valid use of Envoy's existing mechanisms.
Also worth keeping in mind, this is not some undocumented behavior that people happen to be relying on, we've been guaranteeing this for some time, see https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager.html?highlight=warming#cluster-warming
Interesting. I agree, that will definitely make it harder to get rid of this.
Hmmm... If management servers expect clients to not use the existing EDS data after a CDS update, is it going to cause problems that gRPC will use that existing data? What if a management server is intentionally sending a CDS update that requires drastically different EDS data? For example, if the CDS update switches from plaintext to TLS, and the EDS update points to a new set of endpoints that speak TLS?
This behavior seems like a clear layering violation and is extremely difficult to reason about. I really don't want to implement it in gRPC. I think this is going to cause problems if we don't clean this up.
Note that if we can find a way to get rid of the existing behavior, then in cases where a management server does not want the new cluster instance to use the old EDS data, it can always use a new EDS service name in the updated CDS resource.
I see this less as a layering violation as an attempt to bolt on consistency to an eventually consistent model. What warming on EDS is currently doing is saying "we may know the old EDS, but we want another EDS that is temporally fresher than this older one" (if you ignore my cluster actor model above and think purely in transport terms). Which might not even be true in an eventually consistent environment (i.e. not ADS), your warming EDS might be some delayed EDS update of data older than the cluster.
We've discussed trying to explicitly add in a notion of version dependency in the past, this probably isn't the right time to add it, but if we did have that, then it would be a lot clearer what is going on.
I agree it's pretty broken and hard to deal with.
Maybe one path forward here is to add explicit fields to Cluster to describe what warming behavior we want to see (with the default retaining existing semantics).
@htuch @markdroth I am very far from what you both discuss about separating data model and transport model, but - from what I see in APIs perspective - it basically says that dependencies between resources are described in terms of names. This "warming" thing from APIs perspective look like an implementation detail of envoy that is simply not possible to describe in terms of what API provides: cluster has changed, it's name is the same, management server's state from endpoints perspective has not changed - no updates by default should be here.
Also, if we go a bit further - if you get multiple updates to the same cluster in short period of time - that would require the "warming" to happen on the last update, but when you receive EDS "warming response" - is it possible for envoy to know - to which exact update is this EDS "warming response" related?
@fxposter Envoy just cares that it sees an EDS update after the CDS update, not which cluster it relates to. I don't see this as an Envoy implementation detail, this is clearly specified in the docs which are part of what currently constitutes the API for folks building control planes.
Maybe one path forward here is to add explicit fields to
Clusterto describe what warming behavior we want to see (with the default retaining existing semantics).
That might very well be a useful migration path. We could do a phased migration where we initially leave the default alone but provide the field to request the new behavior, then change the default but allow it to be disabled via the field, and then finally remove the field completely.
I just want to make sure that we wind up in a plce where this kind of layering violation is not supported. Otherwise, we will never have a clean separation between the transport protocol and the data model.
As discussed with @markdroth offline, this is an area of xDS that is super confusing when we consider the mix of SotW, delta, ADS and non-ADS, different consistency models. From the above thread, it's also apparent that the current behavior is not well understood. I think we should put together a uRFC https://github.com/cncf/udpa/tree/master/proposals to clarify what the existing behavior is and what we want the future behavior to be, without breaking backwards compatibility for existing use cases where the current behavior makes sense.
@htuch I mean a bit different thing. If this feature is intended for envoy not using "partially constructed" clusters in this particular case - then if there are multiple updates to the same cluster - when you receive an update to corresponding endpoints - you still don't know which cluster state that update of endpoints is related. And yes - "warming" will succeed, but it may succeed with wrong state.
ie: if cluster states (for single particular cluster) are CS1, CS2, CS3... and corresponding "good" endpoint states are ES1, ES2, ES3... Then if you receive an almost simultaneous updates "-> CS2" and "->CS3" - CDS will require warmup for "CS3" state. And if afterwards envoy receives state "ES2" and later "ES3", I can guess the first "fully constructed state" will be (CS3 + ES2), which might be incorrect.
@fxposter yes, unless you linearize on ADS you can get various kinds of races happening. This is what I was referring to when the ADS vs. non-ADS case being also pretty confusing above.
Here's my concrete proposal (which should be turned into a uRFC):
Cluster that says dont_warm_on_eds_for_updates. When this is set, Envoy inherits the existing cluster's hosts (I think @snowp mentioned in the original issue that this was plausible).None of this will change existing behaviors of LDS, CDS, etc. since we will potentially break folks who are relying on the existing warming behaviors.
I think that's a fine proposal, but I would also like to see a migration path for fully eliminating the problematic behavior. If we don't define such a path, we'll never be rid of it.
@htuch shouldn't this also consider Listener? since they also warmup
The proposal in https://github.com/envoyproxy/envoy/issues/13009#issuecomment-689140016 SGTM. We can discuss the exact field name, etc. in code review.
I agree overall that this is an incredibly confusing detail that has tripped a lot of people up and we should fix this.
I think that's a fine proposal, but I would also like to see a migration path for fully eliminating the problematic behavior. If we don't define such a path, we'll never be rid of it.
On the Envoy side we can talk about potentially flipping the default for the field, but we would have to give a lot of warning, runtime revert support, etc. as we have a large number of management servers in the wild that might break.
FWIW, https://github.com/envoyproxy/envoy/pull/6021 implemented 1 on the proposal above
This has always been a surprising thing so I think it would be great to address. Similarly to others I've solved this management server side by making the EDS version dependent on the CDS version.
Without going through all the historical context I recall there being a conceptual point about whether xDS resources are defined just by their name: if we already know the value of EDS for foo, when we get an update for cluster foo that defines an EDS dependency, is the EDS foo still good? I think with the advancements towards UDPA this is pretty clearly yes, but I think it was less obvious back then.
As for solving this I can see two variations of https://github.com/envoyproxy/envoy/issues/13009#issuecomment-689140016:
1) Reuse the old EDS value by copying the Host objects used by the old cluster into the new one
2) Reuse the old EDS value by processing a onConfigUpdate (or something similar) with the old EDS proto vaue
As @ramaraochavali said, #6021 implemented a variation of this (1), and this has some great properties around allowing connection reuse between cluster updates (since the Host owns the connection pools). That said, 2) feels a bit more inline with the separation of transport and data model layer idea: upon processing a new CDS update, Envoy simply checks its xDS cache for the EDS resource that the new cluster is asking for and adds in the endpoints immediately. If the EDS resource is missing, cluster warming is deferred until an EDS response is available.
@mattklein123
On the Envoy side we can talk about potentially flipping the default for the field, but we would have to give a lot of warning, runtime revert support, etc. as we have a large number of management servers in the wild that might break.
I understand the need to provide a graceful migration here. I can live with it taking a while as long as we do have a plan to get there and are actively moving toward that goal. That way, when people ask about why gRPC and Envoy behave differently here, I can point to that plan and tell people that gRPC is already doing what Envoy will eventually do, and they should adjust their management servers accordingly.
Regarding the @snowp comment (https://github.com/envoyproxy/envoy/issues/13009#issuecomment-689482632), I would much prefer we do (2) vs. (1). IMO this is more clear and also more akin to what @markdroth is talking about as xDS as a transport/caching layer. I also think that if we do (2), the TLS upgrade case "just works" since the hosts would get re-added, re-health checked, etc.
The one concern around building "caching" into the transport layer is that for endpoints, this could add a fair bit of memory overhead. On the plus side, it's only on the main thread and doesn't need to be multiplied by number of workers, but at O(10k) endpoints this could start to show up.
The one concern around building "caching" into the transport layer is that for endpoints, this could add a fair bit of memory overhead. On the plus side, it's only on the main thread and doesn't need to be multiplied by number of workers, but at O(10k) endpoints this could start to show up.
Yeah this is fair, though I wonder with some config options we could allow the user to make their own trade-offs.
It's not clear to me how this caching can really be optional. The xDS protocol itself requires the client to keep state. If we don't do that, we wind up violating the protocol, which is why the caching and the transport protocol are so deeply intertwined.
That having been said, there is no reason that the cached data needs to be in the form of the original protos. If we convert the EDS resource into a set of Hosts, then maybe we could just cache those hosts? And if we want to avoid making a separate copy of the hosts in each Cluster instance while we replace one with another, maybe we could do something like make the hosts ref-counted, so that we can hold a ref in both the XdsClient layer and in each Cluster instance that uses them.
I think there is some concern for possible bugs now and in the future due to reusing Host objects. Redoing the whole warming process (with a cached response) will ensure that the new hosts come up properly configured for the cluster, so there is some appeal to doing this.
That said, if the overhead of adding a caching layer might be good enough motivation to attempt to do the lower overhead but riskier approach?
That said, if the overhead of adding a caching layer might be good enough motivation to attempt to do the lower overhead but riskier approach?
I would rather we not prematurely optimize. Per @markdroth I think we could translate to a memory optimized structure if required and throw away the proto. I would rather we do the caching approach since it does seem like the most correct solution and we can optimize it later?
I agree with @htuch on caching concern.
though I wonder with some config options we could allow the user to make their own trade-offs.
My vote here would be to not drive this by user config as it is purely an implementation detail and we will have to maintain two implementations.
I would rather we do the caching approach since it does seem like the most correct solution and we can optimize it later?
I agree, Caching approach is the most correct, less riskier and more future proof (handles some use cases like TLS upgrade automatically) but we need to think about memory optimized structure for endpoints from start - because it can throw people off with increased memory for large deployments.
So, I think we're agreed on caching as the way forward. Concretely what this looks like I think is that the watch implementation holds onto the last delivered resources as long as at least one watch exists for a type URL and named resource. When Envoy brings up a warming cluster, we effectively have two subscriptions for a period to this resource, the new cluster immediately warms and the old cluster is destroyed.
This does not require a radical rewrite of how the transport works in Envoy IMHO and is a fairly straightforward PR to add. We should do some performance benchmarking before making this the default.
It just occurred to me that aren't we now storing the EDS protos for config_dump? So I think we already have the memory overhead issue? Should we start by just making sure we don't duplicate the data twice before optimizing further?
No, we rebuild EDS from host data, see https://github.com/envoyproxy/envoy/blob/419428a82945a1a2ab0ab48e054e76ee851c5450/source/server/admin/config_dump_handler.cc#L219. It would be higher fidelity to have these around in a cache.
OK, anyway, let's just consider the config_dump aspect when we implement so we don't duplicate data.
Should this new mode only apply to gRPC? I'm not familiar enough with the REST and file based implementations to know if it's relevant there as well, but I'd guess not since REST fetches on intervals, and a file lookup should be dependable/envoy initiated?
Will this end up on the Subscription API which I believe is shared between all of these implementations? If the setting is on Cluster that seems the most natural, you indicate whether cached data is okay in start().
And finally I assume this should be implemented for both delta and SotW? Would we tolerate a period where it is implemented for one and not the other?
It's a good question @mpuncel. I think the principle that as long as there is at least on subscription, we retain the last delivered configuration and make this available to warming clusters can easily be mapped to REST. I.e. if the next fetch is in 10s, but a cluster is warming now and we know its endpoints, seems reasonable to make them available. As we scale xDS and use CDNs+REST, having this capability seems increasingly more useful.
Note that it might make sense to tackle this, #10793, and #2943 at the same time by adding an XdsClient layer that centralizes the stream management and caching behavior in one place.
Most helpful comment
Here's my concrete proposal (which should be turned into a uRFC):
Clusterthat saysdont_warm_on_eds_for_updates. When this is set, Envoy inherits the existing cluster's hosts (I think @snowp mentioned in the original issue that this was plausible).None of this will change existing behaviors of LDS, CDS, etc. since we will potentially break folks who are relying on the existing warming behaviors.