Thanos: store: Improve main pain points for using store gateway against big bucket.

Created on 6 Feb 2019 · 17Comments · Source: thanos-io/thanos

Acceptance criteria:

[x] Improve store startup time: https://github.com/improbable-eng/thanos/issues/532
[x] Improve store startup S3 traffic & disk size needed: https://github.com/improbable-eng/thanos/issues/664

--> https://github.com/improbable-eng/thanos/issues/942

[ ] Reduce store "baseline" memory per each block: https://github.com/improbable-eng/thanos/issues/448
[ ] Sync and client filtering of all objects for each group can be slow for large number of objects
[x] Fix nil panic on lazypostings: https://github.com/improbable-eng/thanos/issues/335

Initial ideas:

Ad1 & Ad2: precompute index.cache JSONs and compress them (!) in object storage, for store gateway to fetch on startup.
Ad3: Somehow build single symbols maps in memory?
Ad4: Virtual dirs? https://github.com/improbable-eng/thanos/issues/697
Ad5: Limit queries: https://github.com/improbable-eng/thanos/pull/798

Extra option mentioned below: add --max-time and --min-time to store & compactor to "shard" those within time.

CC @claytono @tdabasinskas @xjewer

hard feature request / improvement help wanted P1

Source

bwplotka

👍12

Most helpful comment

Another option @antonio and I have discussed is adding a --mintime and --maxtime flag to thanos store and compacter. If the flag was given, then they would each ignore blocks outside of the time range given, allowing you to run multiple thanos store and compactor components against a single bucket, but also easily repartition by just selecting different time ranges.

claytono on 6 Feb 2019

👍8

All 17 comments

claytono on 6 Feb 2019

👍8

I think we should add this https://github.com/improbable-eng/thanos/issues/335 to the list.

GiedriusS on 6 Feb 2019

Is there any update?

earthdiaosi on 1 Mar 2019

I've started work a patch for the--min-time and --max-time functionality. I've got it working for the store code, and I hope to start working on the compactor piece soon.

claytono on 1 Mar 2019

👍6

Help wanted for other stuff.

We also likely fixed: https://github.com/improbable-eng/thanos/issues/335 on master, but tests are pending by @GiedriusS (:

@claytono cool :+1:

bwplotka on 1 Mar 2019

@claytono cool, Can you submit the code about the store first? That's what we need...

earthdiaosi on 4 Mar 2019

I have a lot of large buckets, many of the index.cache.json files are ~100MB.

One idea that came to mind was to use FlatBuffers.

SuperQ on 18 Mar 2019

@claytono is there any update? It would be nice to solve this in a general way as we discussed here.

GiedriusS on 18 Mar 2019

👍1

I'm hoping to get a PR up for this this week if time allows. For now, my PR only addresses partitioning on the thanos-store side of things. It's not clear to me if there really needs to be similar limiting on the compactor side of things or not. We're planning to do an initial deployment without compactor support for time ranges.

claytono on 19 Mar 2019

Another option @antonio and I have discussed is adding a --mintime and --maxtime

We talk about this as well in @povilasv PR:https://github.com/improbable-eng/thanos/pull/930

bwplotka on 19 Mar 2019

I just tried 0.3.2 on Tuesday, it didn't work for my large buckets in s3. I have 37 prometheus clusters (currently), 9TB of data total, largest bucket is around 700GB. I reverted to 0.2.1 and things are back to normal. High latency and query timeouts were the issues I was seeing. I am running prometheus 2.4.3, not sure if that might have been contributing to the issue.

Do you guys think this work will help towards that end? Thanks for the great work 😄

midnightconman on 28 Mar 2019

@midnightconman have you read the change log? Most likely you need to increase your index cache size (:

GiedriusS on 28 Mar 2019

@midnightconman have you read the change log? Most likely you need to increase your index cache size (:

I did 😄

I tried settings of --index-cache-size=20GB and --chunk-pool-size=200GB, no change. Strangely the disk usage for 0.2.1 and 0.3.2 in /data is the same?

I am not talking about slower queries, like 0.2.1 is 200ms and 0.3.2 is 1000ms... 0.3.2 queries never return for larger buckets.

midnightconman on 29 Mar 2019

Could we have multiple store gateways divide the load between themselves? Ideally I would picture 3 node gateways pointing to a single bucket and they each handle a third of the chunks divided over the whole time period (e.g. all have some newer and older chunks). If another one is added then it would work out a new way to divide it. It would do the same should one disappear. I think this would be nicer then having the user work out the times to set to ensure they match the chunks and it could also prevent the store gateway with the newest chunks doing most work while the ones with older chunks do little.

baelish on 29 Jul 2019

👍1

@baelish That seems ideal. The manual time range partitioning was mostly proposed as something that would be fairly simple to implement and start using quickly. I would guess the issues with doing that would be coordination between them, and the need to publish consistent time ranges. With the latter, I think the issue is that currently stores publish just a mintime and max time, so if you want to have just queries routed to a store that definitely had the blocks, you'd want to make sure the store had a contiguous range of blocks, or change the way they're published such that they can publish multiple time ranges.

claytono on 29 Jul 2019

@claytono makes sense, sometimes you need to get things out there quick. Perhaps it could be considered a long term goal.

baelish on 29 Jul 2019

Thanks, everyone involved! :heart:

We have now time partitioning and block by external labels sharding as requested in this ticket so we can close this!

For further improvements and ideas tracking issue please see: https://github.com/thanos-io/thanos/issues/1705

Happy Halloween!

bwplotka on 1 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Support multi-tenancy on object storage using custom prefix

edevouge · 48Comments

Compaction failing due to out-of-order chunks

mattbostock · 32Comments

Retry on network failures (e.g uploads)

bwplotka · 31Comments

store (s3, gcs): invalid memory address or nil pointer dereference

mihailgmihaylov · 61Comments

Allow setting retention per metric (e.g rule aggregation)

bwplotka · 36Comments