Thanos: Support multi-tenancy on object storage using custom prefix

Created on 10 Jul 2019  ยท  48Comments  ยท  Source: thanos-io/thanos

Support multi-tenancy on object storage using custom prefix

Context / Use case

On a multi-cluster kubernetes setup with multiple prometheus instances (one per namespace) using PrometheusOperator. Each prometheus instance embed its own thanos sidecar and for performance & scalability purpose data chuncks are sharded accross multiple s3 buckets.
For each prometheus instance, there is a dedicated thanos store and a thanos compactor working on its own bucket. One thanos query is setup to map/reduce promql queries across multiple stores and thanos-sidecars.

Problems

  • With this setup there is a proliferation of s3 buckets.
  • Bucket creation lifecycle is not as dynamic as kubernetes resources (kubectl apply versus terraform apply). Require use of multiple tools.

Question / feature request

Is there a way to share a bucket across multiple thanos instances (keeping the sharding property) ?

Otherwise, a cool feature could be the ability to add a custom prefix on thanos objects (objstore.config). So we could take advantage of s3 bucket multi-tenancy.

  • e.g : s3://<my-bucket-name>/**<custom-prefix>**/<thanos-objects>
receive sidecar store feature request / improvement help wanted

Most helpful comment

The proposal LGTM :1st_place_medal:

All 48 comments

Design proposal to discuss

Scope :

  • Enrich objstore configuration to support a custom prefix in s3 object storage key name
  • Focus on AWS S3 implementation for a first contribution

Expected result

[Before] Current flat bucket file tree :

  bucket-root
  โ”œโ”€โ”€ 01DH3V0E80CBEK8SK00A93SH7Z
  โ”‚    โ”œโ”€โ”€ chunks
  โ”‚    โ”œโ”€โ”€ index
  โ”‚    โ””โ”€โ”€ meta.json
  โ”œโ”€โ”€ 01DG4BDF8R3FDSZ3EM6MVHVTFX
  โ”‚    โ”œโ”€โ”€ chunks
  โ”‚    โ”œโ”€โ”€ index
  โ”‚    โ””โ”€โ”€ meta.json
  โ”œโ”€โ”€  01DH3V01JXFHC0C9ABCM7MZGP4
  โ”‚    โ”œโ”€โ”€ chunks
  โ”‚    โ”œโ”€โ”€ index
  โ”‚    โ””โ”€โ”€ meta.json
  โ””โ”€โ”€ debug
       โ””โ”€โ”€ metas
           โ”œโ”€โ”€ 01DH3V0E80CBEK8SK00A93SH7Z.json
           โ”œโ”€โ”€ 01DG4BDF8R3FDSZ3EM6MVHVTFX.json
           โ””โ”€โ”€  01DH3V01JXFHC0C9ABCM7MZGP4.json

[After] Expected prefixed bucket file tree :

  bucket-root
  โ””โ”€โ”€ my
      โ””โ”€โ”€ custom
          โ””โ”€โ”€ prefix
              โ”œโ”€โ”€ 01DH3V0E80CBEK8SK00A93SH7Z
              โ”‚    โ”œโ”€โ”€ chunks
              โ”‚    โ”œโ”€โ”€ index
              โ”‚    โ””โ”€โ”€ meta.json
              โ”œโ”€โ”€ 01DG4BDF8R3FDSZ3EM6MVHVTFX
              โ”‚    โ”œโ”€โ”€ chunks
              โ”‚    โ”œโ”€โ”€ index
              โ”‚    โ””โ”€โ”€ meta.json
              โ”œโ”€โ”€  01DH3V01JXFHC0C9ABCM7MZGP4
              โ”‚    โ”œโ”€โ”€ chunks
              โ”‚    โ”œโ”€โ”€ index
              โ”‚    โ””โ”€โ”€ meta.json
              โ””โ”€โ”€ debug
                   โ””โ”€โ”€ metas
                       โ”œโ”€โ”€ 01DH3V0E80CBEK8SK00A93SH7Z.json
                       โ”œโ”€โ”€ 01DG4BDF8R3FDSZ3EM6MVHVTFX.json
                       โ””โ”€โ”€  01DH3V01JXFHC0C9ABCM7MZGP4.json

Implementation steps

  1. Add a new prefix flag in objstore.config for type s3
type: S3
config:
    prefix: "my/custom/prefix/"
    bucket: ""
    endpoint: ""
    region: ""
    access_key: ""
    insecure: false
  [...]
  1. Write some unit tests in pkg/objstore/s3/s3_test.go

  2. Edit pkg/objstore/s3/s3.go to :

    • Extend the configuration data structure type Config struct with Prefix string yaml:"perfix"
    • Validate prefix format :
    func validate(conf Config) error {
      [...]
      if conf.Prefix != "" && len(conf.Prefix)>=1024 {
            return errors.New("prefix is too long (limited by Amazon at 1024 bytes long)")
        }
        return nil
    }
    [...]
    prefix = strings.TrimSuffix(prefix, "/") + "/"
    
    • Append prefix string to object name for each minio call requiring an object name. For instance:

      // Delete removes the object with the given name.
      func (b *Bucket) Delete(ctx context.Context, name string) error {
      return b.client.RemoveObject(b.name, prefix + name)
      }
      
  3. Update the doc : docs/storage.md

  4. Create a PR

Forecats of impacted files :

The proposal LGTM :1st_place_medal:

@edevouge is there any progress with this issue?

@Nathan-Okolita-Aimtheory: a pull request ( #1392 ) implementing this feature is work-in-progress by @jaredallard

I'm not against this per se, but we already have a multi-tenancy mechanism approved in the thanos receive component, that is label-set based. If we do something about multi-tenancy in the object store bucket, then that should be label-set based as well, as opposed to path based.

@brancz would your label based implementation look like this in the Bucket.yml:
type: S3 config: label: "tenant_label" bucket: "bucket" endpoint: "s3" region: "region" access_key: "access_key" secret_key: "secret_key" insecure: false signature_version2: true
Would the Thanos store gateway need to index the entire s3 bucket to return just the label desired?

A tenant is a label-set, so not just a single label. So that specific part would rather be something like:

tenantLabels:
- label1
- label2

@brancz I'm not fully familiar with the label based tenants, but segregating in the bucket path allows you to use one bucket & policies for multiple Thanos deployments, that may not want to share Thanos services.

At least in my case, I have different teams with different products. By leveraging path restrictions I can reduce my S3 bucket sprawl. It also means that teams that may want to share a common query endpoint can do so, without exposing themselves to cardinality bombs by bad deployments. If someone does something bad, it only impacts their bucket location.

@edevouge @jaredallard are either of you continuing to work on the existing PR?

These changes would allow lifecycle management inside the multipurpose buckets.
As an example, we use an S3 bucket per cluster, and we maintain multiple clusters in a single account.
It's convenient to use a single bucket, replicate it, delete it with all the data at once.
And the best part is that we can apply a life-cycle policy on the Thanos data by using the prefix of a subdirectory.
And this is crucial for us.

We host all monitoring related configs, backup etc. under single bucket and would like to use that bucket to host thanos data as well. This feature would be super useful.

Elimination of multiple buckets maintenance is definitive adventage.
One more requirement here will be to ensure that clients will not affect each other performance and stability (eg with query like {"__name__"=~".+"} setup for single client/hashring should be affected at most. It shouldn't be necessary to index and query metrics from all clients - which may be the case for labels-based segregation suggested by @brancz ). Solution with /path prefix and dedicated read access(stores, queriers) for each client seems to address this issues.

I played a little bit with PR from @DLag (PR mentioned above) and menaged to setup configuration with read and write access, however I spotted that compaction is affected when accessing data from s3 subdirectories, failing with write compaction: chunk 8 not found: segment index 0 out of range error similar to https://github.com/thanos-io/thanos/issues/1300 . Setup was tested with thanos 0.10 with additional s3 prefixes configuration (configuration with thanos v0.10 and dedicated s3 buckets instead of subdirectories works fine for the same dataset).
That is one more thing that should be verified for this feature.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Kind of the same use case here where I need to set a prefix because Thanos data will not be the only data I would stored in the bucket. Seems like two PRs have been closed for this issue but I'm having hard time to understand what the blocker is.

How can we move forward here ? :)

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

I am still hoping this issue will be picked up. We have an environment with many Prometheus instances many of which are deployed through automation. Having to provision many s3 buckets is not a very appealing prospect.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

I think there is still a lot of interest here. This would make it a lot easier to operate a single Thanos that can aggregate metrics for N teams' prometheuses.

Introducing the ability to configure an optional bucket prefix would also be a nice step forward towards Cortex blocks storage and Thanos bucket store interoperability.

No objections from my side except why we are calling custom string prefix a multi-tenant support (:

Let's bring this to our next (public!) Thanos Community Sync! (: We will send announcement soon (cc @povilasv)

We need to be a bit careful, because there may be different views on what this means. Does Thanos read all directories and treat directories as tenants? Does it just scope all its read requests to the provided prefix? What if we wanted to identify tenants not by a flat ID but with a labelset instead (possibly a future concern, but healthy to think about I think)?

Not against this at all, but I think it's easy to miss other concerns when we just look at our individual immediate need.

@brancz Agree. Just adding bucket prefix support (and nothing else) would cover the single-tenant use case (single prefix in Cortex, single customizeable prefix in Thanos). It's not the end of the story, just a starting point.

Yeah I agree. Keeping those concerns separate is definitely the right thing, so just having a prefix seems safe. I'd be ok with that already now.

If it's of interest, our use case for this feature is purely so that we can reduce the number of buckets we create (to postpone running in to AWS limits on the number you can have within a single AWS account). By allowing it to use a key within a bucket, we can share the bucket with other (unrelated) resources.
We have no desire to have Thanos look at multiple keys within the same bucket holistically, we just want to bump down where it looks to a be a sub-key, rather than the "root" of the bucket.

@bwplotka is there any news regarding this? I cannot see this in community meeting agenda?

Re community meeting, I don't think we got to this issue so we rescheduled it for next meeting.

We plan to talk about this today (in 4m) (:

Looks like the decision is that we are ok for simple custom prefix on config level (for all objstorages) in YAML (:

Help wanted! :+1:

Hello ๐Ÿ‘‹ Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! ๐Ÿค—
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Still help wanted (:

@bwplotka - How involved would this change be? I could really use this feature but not sure if I have the time to learn a new codebase and contribute

The decision we took a couple of Thanos meetings ago is:

Looks like the decision is that we are ok for simple custom prefix on config level (for all objstorages) in YAML (:

To elaborate on this, the idea is to allow to configure (in YAML) the prefix where all data (blocks) is stored within the bucket. This would remove the requirement to have a bucket dedicated for a single Thanos installation and you could use a bucket for multiple purposes (or multiple Thanos setups).

Right, I understand the use-case of this feature (and will be using it myself to reduce the number of buckets I have to manage)

I'm trying to figure out the work involved because I'm new to the codebase. I see there's a few PRs linked earlier in this issue (https://github.com/thanos-io/thanos/pull/1392, https://github.com/thanos-io/thanos/pull/1862)

Can we just dust off one of them and expand upon it?

I'm trying to figure out the work involved because I'm new to the codebase. I see there's a few PRs linked earlier in this issue (#1392, #1862)

Can we just dust off one of them and expand upon it?

I'm not sure that's the right approach. We want the feature to support any backend (not just S3) and implementing it in every client will end up introducing quite a lot of duplication. In Cortex we successfully built this feature wrapping the objstore client, similarly to how the objstore.BucketWithMetrics() works. I think a similar wrapping approach could work in Thanos as well.

So we have multiple Prometheus Operators running on multiple clusters:

Cluster A with Prometheus+ThanosSidecar
Cluster B with Prometheus+ThanosSidecar
Cluster C with Prometheus+ThanosSidecar
Cluster D with Thanos Operator with a query-master and Bucket

Right now, all the data written from Cluster A,B,C go to that one bucket at the root level, with no multi-tenancy. Could this scenario be causing what I'm seeing here: https://github.com/thanos-io/thanos/issues/2958
The Store API just advertises the data it can access, so if it can access all the data... could it be showing all the clusters for each Endpoint instead of the cluster associated with the specific Endpoint?

As mentioned by https://github.com/thanos-io/thanos/issues/1318#issuecomment-660851708 we want this, so help wanted.

@tanelso2 Feel free to try out! We can guide you on what needs to be changed (:

No one is working on it currently, so help wanted (:

I want to take this up because I think this feature is really interesting and I could use it with thanos receive for some of our client use cases. I was going through the PR 1392. As far as I understood this comment from @bwplotka suggests adding a Prefix string attribute to BucketConfig struct of https://github.com/thanos-io/thanos/blob/3ce1da85c066c3f248f9e758c8f55d3c3d946472/pkg/objstore/client/factory.go#L38. Right?
But I presume we will also need to make modifications in the code of other thanos components: store, sidecar and receive ?

Just for record, I am working on this.

Thanks to amazing work @dsayan154 (https://github.com/thanos-io/thanos/pull/3289) I revisited this issue.

BTW to answer the very initial question feature request on top of this issue:

Question / feature request
Is there a way to share a bucket across multiple thanos instances (keeping the sharding property) ?

Otherwise, a cool feature could be the ability to add a custom prefix on thanos objects (objstore.config). So we could take advantage of s3 bucket multi-tenancy.

e.g : s3://<my-bucket-name>/**<custom-prefix>**/<thanos-objects>

YES: Thanos can totally share a bucket across multiple instances, tenants, clusters etc. It's thanks to globally unique block IDs and external labels (for tenancy and producer identification)

Based on discussions we decided to have this option purely to share Thanos data with something else on the bucket. However, when I think of it, you can totally run Thanos on the bucket and just put other directory/path and store there everything else ;p

.. so do we really need that prefix? :hugs: Is there any other use case for this? Anything else blocked on that?

Other, potentially interesting option is to scale bucket iteration latency / usage further by spreading groups of tenants to different paths (e.g time based - for further unlimited retention). But this can be totally transparent to users TBH (:

The implementation is not that complex, etc so I think we are happy to put this forward, but I am concerned by the feeling of users that this will be required for tenancy, which is NOT true. In fact, putting all on different paths WILL REQUIRE totally separate Store Gateway / Compactor by design (It's worth to mention on potential doc you are adding @dsayan154 BTW)

Can you elaborate more @dsayan154 on this receiver client use case: https://github.com/thanos-io/thanos/issues/1318#issuecomment-686439745

To elaborate on our use case and why we need the prefix feature (rather than putting all non-metrics items under other or something at the root) is that we want to store the metrics under a bucket that we're not the primary owner of and need to share with other teams. By having the prefix, we can carve out a namespace like /foo/bar/metrics/ and not have to worry about any collisions between keys Thanos generates and items other teams create.

@bwplotka when you say:

YES: Thanos can totally share a bucket across multiple instances, tenants, clusters etc.

and

.. so do we really need that prefix? ๐Ÿค— Is there any other use case for this? Anything else blocked on that?

Can you clarify how to avoid the issue @mei-ling is having without prefixes, which, if i understand, would allow multiple store gateway / compactor to work on the same bucket - one per cluster as desired in that case?

I'm also a little confused by how people are using "tenancy" here as i feel like that's almost the opposite of this issue. does "multi-tenant" here mean having one set of infrastructure across an arbitrary number of prometheus monitored by one "thanos" with possibly multiple store gateway / compactor? This is the use case I feel like prefixes help with. Or does it mean having multiple teams with different sets of infrastructure who want to share one "thanos" for only monitoring what they are responsible for?

Can you elaborate more @dsayan154 on this receiver client use case: #1318 (comment)

@bwplotka we are trying to build a centralized metric ingestion solution with Thanos receiver and Prometheus. The idea was to have a single top-level Prometheus for each tenant k8s clusters, and those Prometheuses would push metrics to an external Thanos receiver endpoint along with the tenant header. In case, a tenant stops subscribing to this solution, the tenant cluster's data should be deleted from the objectore store. With this prefix feature, we would be able to store tenant's data on different locations in the same bucket. This would allows us to avoid the following complexities:

  1. In case of deleting tenant's data from the object store, this prefix would provide us better visibility of where is the concerned data exactly located.
  2. If we choose to store all tenant's data under same path of the object store, then the bucket iteration count for the Store would increase, isn't it? Solution, would be to store every tenant's data in separate buckets(AWS limits 1000 s3 bucket per account). Otherwise, we can use this feature to store the tenant's data in segregated paths in the same bucket.

In fact, putting all on different paths WILL REQUIRE totally separate Store Gateway / Compactor by design

If we are using Receiver to store the streamed data, and if we don't want all the tenants to share the same configuration, e.g: retention period, etc. then we would require separate Receiver for every client anyways isn't it?
Talking of our use-case, we are planning to auto-provision the separate Thanos components(the required ones like Receiver, Store and Compact), when a tenant onboards this system.

.. so do we really need that prefix? Is there any other use case for this? Anything else blocked on that?

I did try & build the pull request for the following reason:

With the prefix, it is possible to use the excellent thanos block viewer on a cortex backend. Currently Cortex does not really have a nice UI like thanos to visualize this.

Hi @roidelapluie did it work for the use-case you were talking about? I am curious because I still have to write a few unit tests for this feature?

Yes it did work perfectly

Thanks for the explanations @markmsmith @roidelapluie @dsayan154 I think I am happy with some simple prefix to make sure you can store blocks deeper in bucket (or have it compatible with Cortex).

@underrun:

Can you clarify how to avoid the issue @mei-ling is having without prefixes, which, if i understand, would allow multiple store gateway / compactor to work on the same bucket - one per cluster as desired in that case?

You just point multiple sidecars to the same bucket, same directory. Since Prometheus has unique external labels we can distinguish between blocks. Directory does not matter for Thanos. (:

I'm also a little confused by how people are using "tenancy" here as i feel like that's almost the opposite of this issue. does "multi-tenant" here mean having one set of infrastructure across an arbitrary number of prometheus monitored by one "thanos" with possibly multiple store gateway / compactor? This is the use case I feel like prefixes help with. Or does it mean having multiple teams with different sets of infrastructure who want to share one "thanos" for only monitoring what they are responsible for?

I think both things are the same right? Multiple metric sources (e.g Prometheus + sidecar) from multiple potentially isolated teams, uploading to the same bucket, thus allowing whatever set of compactors/store gateways thanks to sharding.

@dsayan154

In case, a tenant stops subscribing to this solution, the tenant cluster's data should be deleted from the objectore store.

Deletion of data for a defined prefix is a valid issue/ request you can put on Thanos and we can guide/help you write reliable CLI / tooling to make it work. (complexity no 1) (: You can even build on top of that and automated this purely with some deletion API tenant. Building on top of prefix is really not just hacking prefixes for multitenancy. Our semantics for multitenancy is around labels, so would be nice if your setup will follow the same rules so we can work together on those complexities you mentioned (: Per complexity number 2. You are right this is some scalability limitation, however further away (you can have easily 100k items for iter API and it's quite large number of blocks (something like 1y of ~100 tenants) for Thanos with healthy compaction). This is however easy to solve by copying the blocks new dirs if needed etc. Let's create an issue to discuss potential solutions. What I am saying is that prefix will not solve this (very potential) problem: What if within tenant you have that number of blocks?

@roidelapluie

With the prefix, it is possible to use the excellent Thanos block viewer on a cortex backend. Currently, Cortex does not really have a nice UI like Thanos to visualize this.

YES! cc @Oghenebrume50 @prmsrswt @kunal-kushwaha for motivation to get block viewer natively for Prometheus (and Complex a well!)

Was this page helpful?
0 / 5 - 0 ratings