Jaeger: Multi-tenant Jaeger

Created on 23 Sep 2020  路  12Comments  路  Source: jaegertracing/jaeger

Requirement - what kind of business use case are you trying to solve?

The topic of multi-tenancy has being brought up quite a few times in the past. So far, our response has been that Jaeger is lightweight enough to have one instance per tenant, which is easier to manage with a provisioning tool like Helm or the Jaeger Operator.

That said, it would still be advantageous to be able to run one instance of Jaeger for all tenants, or to a set of tenants.

Proposal - what do you suggest to solve the problem or improve the existing situation?

The proposal, after brainstorming with @pavolloffay, @objectiser, @rubenvp8510, @kevinearls and @jkandasa is the following:

  • Change the Agent, so that it can be configured to read a bearer token (flag/file)
  • Use this token for all RPC's to the collector
  • Collector/Query/UI can be configured to read a tenant configuration file. This configuration file can look like the example below
  • The UI is already able to propagate the token to the backend, so, no changes are required there
  • Both the Collector and the Query will be responsible for authenticating the incoming requests, using a new shared package

This is how the tenant configuration file could look like:

tenants:
# a concrete tenant, highest precedence
- value: globex
  storageType: cassandra
  cassandra:
    servers: xyz
    port: 1234
# a set of tenants matching some regex, also matches "globex" but we have a specific entry for it already, second in precedence
- regex: ^globex[\-\d]*$
  storageType: elasticsearch
  es:
    # %s is replaced with the actual, concrete tenant name
    server-urls: "es.%s.globex.example.com"
# all other tenants, not matching anything that came earlier
default:
  storageType: elasticsearch
  es:
    server-urls: "big-cluster.es.acme.example.com"
    index-prefix: "jaeger-%s"
enhancement feature vote security

Most helpful comment

Before discussing a solution, I would like to clearly define the problem. Multi-tenant has many meanings, as you @jpkrohling outlined in your blog post previously. Let's first agree on the requirements.

@albertteoh @joe-elliott this might be of interest to your companies

All 12 comments

Before discussing a solution, I would like to clearly define the problem. Multi-tenant has many meanings, as you @jpkrohling outlined in your blog post previously. Let's first agree on the requirements.

@albertteoh @joe-elliott this might be of interest to your companies

Multi-tenant has many meanings, as you @jpkrohling outlined in your blog post previously. Let's first agree on the requirements.

Fair enough :-) The use-cases that we've been hearing are:

  1. departments in the same company managing their own Jaeger instances
  2. departments in the same company where Jaeger is maintained centrally, perhaps "billed" separately (chargeback)
  3. customers of a SaaS company, where customers are charged by usage (Grafana/Logz.io?)
  4. SaaS company providing white-label offering (probably not a tracing vendor)

For those use-cases, the last one is the least common, so, I wouldn't worry about it right now. We've been historically suggesting people in the 2. and 3. cases to adopt the solution for the 1. case. Even though Jaeger is quite easy to manage (lightweight, stateless), it's still not without worries that people would manage hundreds of tenants, where most of the instances would be under utilized.

For the second case, it might be that users belong to multiple tenants (operations might have access to all tenants). So, Jaeger Query would return traces from multiple tenants, based on the user's memberships, present in the token's claims field.

Note that this isn't covering the cases of multi-tenant traces, where different spans in the same trace might belong to different tenants, although it might be a good first step in the direction of supporting it if we see that we have demand for it.

From our side we solve this issue with our centralized account management and our ability to create sub-accounts which roll into the same primary account. This allows us to create specific indices for customers and sub-accounts and manage the backend separately. The current Jaeger setup works fine for our needs unless @albertteoh has something to add here which we might want.

Today we have added token support into the Otel collector as an exporter or the Jaeger collector as a storage backend. It would be good to have this native to Jaeger it might make things easier.

@jpkrohling Thanks, but these are not quite the requirements. Say multiple depts want to use a shared instance of Jaeger - nothing stops them now. I think the requirement R1 would be the UI experience is walled off to just the traces of that tenant, meaning list of services, search, dep graphs, are all scoped to just a single tenant.

If we assume R1, then the next question is: what defines a "tenant" to the UI? "tenant" in the collection path is simpler to define, but in the UI - is it a dropdown? Is it a role/profile of the logged in user? Can users view data from multiple tenants?

Answers here significantly alter the capabilities that must be provided.

R1 would be the UI experience is walled off to just the traces of that tenant

That's absolutely the case. R2 would be to have different storage backends per tenant, so that they can be billed separately.

what defines a "tenant" to the UI?

An OAuth token typically has the group membership as part of the claims. A tenant would then be each group membership.

Can users view data from multiple tenants?

Potentially, yes: "For the second case, it might be that users belong to multiple tenants (operations might have access to all tenants)."

It might be the case that this will not perform well enough, as we'd have to iterate over all tenants that a user has access to in order to collect all the traces they can see. But so far, it's a theoretical problem.

@jkowall do you have one Jaeger cluster for all tenants, or do you have one Jaeger per account?

We use one instance per cloud region (we run on two clouds across about 25 regions). We run multiple instances in K8S for redundancy for the UI if that's what you mean @jpkrohling. Our query engine takes the calls for the backend and passes them to the right ElasticSearch instance. We don't need anymore multi-tenancy today. We have our tenant ID in the storage backend for Jaeger collector, and we have the same in the Otel exporter for logz.io. Might be something we could leverage, but by no means something we need before we GA our service offering. Will keep an eye on this thread.

We have our tenant ID in the storage backend for Jaeger collector

You have your own Jaeger Collector fork then, I suppose. Basically, your solution sounds very much like what was proposed here: one Jaeger Collector cluster/instance taking care of multiple tenants, sending data to the "right" ES instance.

I added this topic to be discussed at tomorrow's bi-weekly call.

I have thought about a few other "soft multi-tenancy" options for larger enterprises or closely related companies with shared services:

  • Only show traces involving a service you have access too (trace a->b-c is only visible if you have access to "b")
  • Only show details (nothing more than start/stop time, operation name, span id, error flag, for services you do not have access to).

A combination woud also be possible: so you will not see a trace involving a->d. You will only see details for span b for a trace a->b->c. If a->e->f has the same trace ID, I'm not sure if it should be displayed or not. The user should probably see children at any level from her own service.

The idea is to expose enough information to locate the service causing errors, slowness, while maintaining user privacy if spans might more information than strictly required in logs, URLs etc.

I call it "soft multi-tenancy" because it's not foolproof. A malicious insider with access to a trace id could possible create a fake span between two services to access more information about the trace.

I think this could be implemented as a stacked storage plugin in the query compnent, or perhaps built into the query component but only enabled as a filter when enabled.

As we are looking deeper into that (I am from Logz.io as well), A question came up. We have the backend pretty much handled - the way @jkowall explained it.
On our model, users in one account can pretty much access data sets on various tenants - this is configurable as we set the different tracing tenants.

Question is, how should the UI be treated? In this use case/scenario when searching from an account that has access to several tracing data sets (data sources) - they can either be able to select a single data source for each search, or be able to select multiple data sources in order to receive results TOGETHER from the different data sources.

For simplicity and taking into account that different data sources represent different infrastructures - it sounds that the best approach would be to search a single data source at a time.

Are there opinions that support results from multiple data sources for one search?

Supporting results from multiple sources is what I meant here:

For the second case, it might be that users belong to multiple tenants (operations might have access to all tenants). So, Jaeger Query would return traces from multiple tenants, based on the user's memberships, present in the token's claims field.

Was this page helpful?
0 / 5 - 0 ratings