Linkerd2: Introduce support for multi-cluster/federated control plane

Created on 31 May 2019 · 11Comments · Source: linkerd/linkerd2

Feature Request

What problem are you trying to solve?

Allow control plane to be able to discover services from different clusters
Allow control planes to support multi-regional services and allow discovery chain in case of region/zone failures for the service
Hybrid cloud setup could also be a long term goal for this, i.e. running the same number of services on GCP and AWS, in either an active-active or active-passive setup.

How should the problem be solved?

This is mainly a gateway for migrating from Linkerd1 to 2, and this is how its setup in 1:
mesh_mono_micro

gRPC is a special case here as it's using a translation between gRPC package names to K8s service names using the /$/io.buoyant.http.domainToPathPfx/srv identifier in Namerd in addition to using the K8s service name in a gRPC header, service-name. So a call to <grpc_service_name>.<dtab_namespace_prefix> resolves to grpc_service_name at discovery, matching what the gRPC client is sending on the channel.

Any alternatives you've considered?

The alternative solution would be running N number of control planes for K clusters and M regions. Makes up for a segmented service mesh/graph that requires extra tooling and aggregation on top of. This makes operation a lot harder and is not natively supported.

The tooling would fill up the gap between what is setup in the infrastructure vs. what the developers of the service would like to see on their end on the dashboards, configs, etc.

How would users interact with this feature?

This can start with the same dashboard mechanism, with the service graph being aware of the cluster and/or region names as some sort of a metadata (labels?) in every segregated cloud environment.

Federation beyond an environment is a nice to have, but not a deal breaker for migrating from 1 to 2.

rfc

Source

mohsenrezaeithe

👍10

Most helpful comment

We have a draft doc describing one approach here. Comments appreciated. https://docs.google.com/document/d/1yDdd5nC348oNibvFAbxOwHL1dpFAEucpNuox1zcsOFg/edit#

wmorgan on 15 Oct 2019

🎉5 ❤4

All 11 comments

I was curious if there's been any comment/feedback for this? We're getting very close to the need for moving to Linkerd2 from 1.

mohsenrezaeithe on 24 Jul 2019

@mrezaei00 makes a lot of sense! I have a couple questions:

I don't quite understand your solution suggestion very well. Just that there's a single control plane that lives outside multiple clusters? Or do you need some kind of dtab like functionality to do name mapping?
What is wrong with running a control plane per cluster? If you run a single one, any time there is a network split or downtime, you'll lose everything. From a fault zone perspective, I would assume that you want to keep those separate.

grampelberg on 24 Jul 2019

Right now with Linkerd1, we're running a multi-zone Namerd as a separate service (it doesn't really matter where it lives as long as it can watch cluster events and is accessible by the proxies).

The use case is for discovering all services in an environment within one scope, as opposed to multiple discovery scopes (multiple control planes) that then need some sort of a special bridge between them for a global discovery.

Another option would be to allow control planes to shape into a pool for cross-cluster discovery, if that makes more sense for the current architecture.

mohsenrezaeithe on 26 Jul 2019

@grampelberg Is there any idea if this will be pursued? We'd really like this feature.

jaredallard on 18 Sep 2019

I'd love to know what problem you're trying to solve @jaredallard. Right now we're encouraging folks to do it the k8s native way, which could definitely use some improvements.

grampelberg on 18 Sep 2019

@grampelberg Sure! We're looking to be able to have services from one cluster talk to services in another cluster transparently without going the federation route or using a CNI provider to do so. Also potentially be able to do locality aware load balancing. Does that make sense?

jaredallard on 18 Sep 2019

That makes sense! That's pretty easy to setup today with just a couple pieces of normal k8s stuff. I should probably write up a reference architecture on that one of these days. There's a couple little pieces of this that a service mesh can help out with, which we're currently trying to figure out. As of right now, it is definitely looking like a 2020 kinda thing unless someone who's not currently working on the project steps up to help out =)

The only gotcha is locality aware load balancing - I'd argue that is a really, really bad idea.

grampelberg on 18 Sep 2019

❤1

@grampelberg Interesting! Do you have any references? I'd love to learn about that! Also, open to helping work that out if my company decides to use it and I can dedicate time to help.

I'm wondering why you feel that way?

jaredallard on 18 Sep 2019

Do you have any references?

Nope, this is why I need to write it up. TLDR, use ingress, external-dns and cert-manager/letsencrypt. It is pretty important to keep your fault domains separate and treat external clusters just like you would an external third party service.

I'm wondering why you feel that way?

Normally, you'd use locality aware load balancing to "fail over" to a remote cluster if the local one was unable to serve requests. There are two big parts to this happening automatically:

As the cluster is remote, latencies are going to spike. If you have configured timeouts or SLOs set, it is likely those are going to be tripped and you'll end up with cascading failures.
Depending on how you've configured overhead, a remote cluster jumping in 2x load all of the sudden might crash that cluster as well.

There's a whole separate conversation around data locality and expectations of the application developers as well. Doing that kind of thing automatically has a high probability of breaking the rest of the system.

Now, manually going over to another cluster because a local service is dead ... that's definitely useful. You just need to have thought it through, adjusted timeouts and slowly sent load over to the other cluster while monitoring load.

grampelberg on 19 Sep 2019

We have a draft doc describing one approach here. Comments appreciated. https://docs.google.com/document/d/1yDdd5nC348oNibvFAbxOwHL1dpFAEucpNuox1zcsOFg/edit#

wmorgan on 15 Oct 2019

🎉5 ❤4

This is on the roadmap for Linkerd 2.8.

wmorgan on 23 Apr 2020

🎉3

Was this page helpful?

0 / 5 - 0 ratings