Envoy: Locality endpoint discovery service (LEDS)

Created on 13 Mar 2020 · 23Comments · Source: envoyproxy/envoy

A group of us met this morning to discuss how to move the EGDS issue and PR forward:

https://github.com/envoyproxy/envoy/issues/9455
https://github.com/envoyproxy/envoy/pull/10023

The consensus at this point is that we would like to hold on working on and merging EGDS as currently specified and instead develop something that we are tentatively calling IEDS (Individual Endpoint Discovery Service). Note that better name suggestions are appreciated here!

The general idea of IEDS will be to allow a sub-API of EDS to provide individual endpoints within a defined locality/priority. This will avoid quite a bit of complexity around shuffling endpoints within localities/priorities. Additionally, localities/priorities rarely change in a deployment.

The high level flow of the IEDS API will be:
1) Allow the lb_endpoints to either be specified inline as today or instead pointed at a config source that configures the IEDS endpoint https://github.com/envoyproxy/envoy/blob/e1962d76c073bb48c57acd0ff2b57390a22394b7/api/envoy/config/endpoint/v3/endpoint_components.proto#L109
2) Determine how the IEDS resource name will be populated. Some proposals have included adding a namespace field. I'm not as clear on this part so I will defer to @htuch and @markdroth on what they want to do here. From my perspective it would be OK to have the config source specify a namespace and then have that namespace passed as the resource name in the discovery request: https://github.com/envoyproxy/envoy/blob/e1962d76c073bb48c57acd0ff2b57390a22394b7/api/envoy/service/discovery/v3/discovery.proto#L42
3) Modify the EDS code to work properly with the sub-API in which we properly do init manager init as well as populate/update the hosts in a locality/priority. This should work correctly with delta/incremental also.

To avoid confusion, note that we are not tackling merging/patching as mentioned here: https://github.com/envoyproxy/envoy/issues/8400. That is a much larger project that should not block this effort.

cc @envoyproxy/maintainers @htuch @markdroth @snowp @wgallagher @tomwans @gengleilei @hzxuzhonghu @seflerZ @lambdai @howardjohn @ramaraochavali

areservice discovery arexds help wanted

Source

mattklein123

🚀4

All 23 comments

IEDS can be confused with delta/incremental xDS. Naming suggestion: How about Locality-EDS (LEDS)? I don't have full context, but my understanding is this xDS resource represents a set of endpoints within a locality?

tomwans on 13 Mar 2020

I like LEDS as a name option. Let's see if anyone has any other ideas.

mattklein123 on 13 Mar 2020

Another naming suggestion would be Named Endpoint Discovery Service (NEDS).

markdroth on 13 Mar 2020

+1 for LEDS

ramaraochavali on 14 Mar 2020

Still I am not quite sure how this new discovery service would achieve incremental updates for endpoints. Is that mean that we can assign a config source for each endpoint and provide resource name for it? That is to say, When EDS handler is processing the config source, it will send individual request for each endpoint? If that is so, the xDS server may be overwhelmed by the flush of requests I thought.

CC yuqi.[email protected],

seflerZ on 14 Mar 2020

Is that mean that we can assign a config source for each endpoint and provide resource name for it?

For each locality/priority. So the LEDS resource is LbEndpoint. The management server could send all of them as SoTW, or it can do incremental updates.

One other thing that occurs to me is that if we implement this correctly, many users would probably end up using static clusters with fixed endpoint.v3.ClusterLoadAssignment load_assignment which would call out to LEDS for each locality.

mattklein123 on 15 Mar 2020

👍1

There is no attribute marking added or removed endpoints. How can the Envoy determinate which endpoints to delete and which to add? Is that mean this implementation must cooperate with Delta xDS to make it function properly? I’m still worried about the performance of this, because even we’ve got EGDS with size=20 we got quite a lot of EGDS messages within Istio Pilot’s full push.

seflerZ on 16 Mar 2020

@mattklein123 I understand that the LEDS cannot reduce the amount of data transmission between the management server and the Envoy through the LEDS without the realization of delta XDS. In the case of non-delta XDS, when one of the endpoints in the locality changes, the whole data should still be pushed to the Envoy, and the use of LEDS will bring some complexity instead.
As far as I know, the management server cannot support delta XDS in a short time. Currently, I think the practicality of LEDS is not strong under the current situation.

gengleilei on 16 Mar 2020

@gengleilei are you talking about Pilot? LEDS will require delta xDS support, I believe Pilot must have some support here already from the VHDS work, CC @brian-avery @costinm

htuch on 16 Mar 2020

I don't think any of the VHDS work has merged yet. But we shouldn't make Envoy decisions due to limitations in Pilot's implementation. I don't think there is any reason we cannot support delta xDS, we just haven't yet. We can have delta XDS just for LEDS and still use SOTW XDS for ADS, right? If so then it won't be too hard to implement I think.

howardjohn on 16 Mar 2020

LEDS will require delta xDS support

Pedantically it won't require delta xDS support (e.g. if we implement it right per above we could use it with static clusters and get mostly the same behavior as today), but it will require delta support for the CPU savings that you are looking for.

mattklein123 on 16 Mar 2020

@howardjohn yeah, I think having delta xDS (on a distinct connection to the ADS one) just for LEDS would be fine.

@mattklein123 the protocol extension for namespace/attribute that is in the proposal above requires that we add something to the discovery request message. Are you suggesting we would do this for SoTW DiscoveryRequest as well as incremental?

htuch on 16 Mar 2020

Are you suggesting we would do this for SoTW DiscoveryRequest as well as incremental?

I would recommend consistency here since I don't see any reason not to have it?

mattklein123 on 16 Mar 2020

@mattklein123 really depends on use case, all things being equal we're pushing more towards putting complexity/advanced stuff in incremental xDS and leaving SoTW simple. It might make sense for LDS in proxyless servers in SoTW though, so probably fine to do it in both places.

htuch on 16 Mar 2020

@mattklein123 Would you mind to paste the IEDS protobuf definition onto this issue?

seflerZ on 17 Mar 2020

Harvey and I chatted about this and agreed to go with the namespace approach. I'll let him propose the specific proto changes.

markdroth on 20 Mar 2020

Sorry if I’m missing some larger context.

I think there is some validity to both statements,

Control planes with many clusters/endpoints should utilize subsetting to prevent broadcasting irrelevant updates to Envoy instances
Adding a config source for endpoints partitioned by locality doesn’t handle the scenario where there is a large number of endpoints with the same locality attributes

I wonder if VHDS could be expanded to also allow fetching Clusters and Endpoints, such that Envoy could operate lazily and only maintain and subscribe to the resources that are actively being requested. I believe currently routes returned by VHDS can only reference known clusters (this might be an incorrect assumption).

gnz00 on 20 May 2020

@gnz00 the second statement "doesn’t handle the scenario where there is a large number of endpoints with the same locality attribute" is true for SotW LEDS xDS, but for delta LEDS xDS each endpoint in a locality would be a first class xDS resource under this proposal, capable of independent delivery.

On-demand EDS is an interesting thought experiment; what criteria would a fetch be made with? When we have too few available connections? When existing backends are too heavily loaded? For subset LB, when we miss on a subset?

htuch on 20 May 2020

In my mind, it wouldn't be on-demand EDS, but rather it would allow VHDS to return routes with previously unknown clusters, such that new CDS or LEDS resources could be subscribed to.

gnz00 on 20 May 2020

VHDS can return routes with unknown clusters today, but there is no corresponding on-demand CDS implementation. That would be pretty reasonable to add though.

I think it depends somewhat on what dimension of scalability we're talking. If it's O(20k) endpoints and a single cluster/route, then we needs something like LEDS. If it's O(20k) clusters with relatively few hosts, then on-demand CDS is going to be more attractive.

BTW, the point regarding subsetting is well taken. LEDS, on-demand CDS, etc. are for those situations where subsetting doesn't work or the sizes of configs are still too large even after subsetting.

htuch on 21 May 2020

👍1

I'm in the latter group of O(50k) clusters, so I apologize for the hijacking.

I think expanding the VHDS/VCDS would bring Envoy closer to the plug-n-play model of Linkerd 1/2 in the sense that Envoy node conventions and resource hinting conventions required for subsetting are very tightly coupled to control-plane conventions. I also imagine it would be useful in multi-mesh scenarios where global remote cluster availability needs to be cached locally for the egress mesh gateway to properly apply traffic shaping rules.

gnz00 on 21 May 2020

Can we please split the most recent part of this conversation into a new issue? It's not really related to the LEDS proposal.

Building on-demand CDS directly into Envoy is a fine thing to tackle. It's actually possible to build today without any core changes (some people have built FaaS systems this way using Envoy) but no one has done it in OSS.

mattklein123 on 21 May 2020

👍1

The design here should ideally reflect https://github.com/kubernetes/enhancements/pull/2094 (and specifically ideas such as https://github.com/kubernetes/enhancements/pull/2094#issuecomment-747579237) when understanding endpoint scalability requirements.

htuch on 17 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings