Thanos: sidecar: Allow Thanos backup when local compaction is enabled

Created on 12 Feb 2018 · 24Comments · Source: thanos-io/thanos

Hey @Bplotka @fabxc
Still pretty new to the Thanos code base. Going through it, one thing I've noticed with the backup behaviour is it seems to only dump the initial 2 hourly blocks. In my instance I've stood up thanos against an existing Prom server. My data dir looks as follows:

data]# ls -ltr
total 108
drwxr-xr-x 3 root    root       4096 Jan 28 19:00 01C4Z28176WR17K7PH37K7FG9V
drwxr-xr-x 3 root    root       4096 Jan 29 13:00 01C5101JDEX1TK8CMSC4NQK8KP
drwxr-xr-x 3 root    root       4096 Jan 30 07:00 01C52XV3T2ETVSG93K73HVP6D1
drwxr-xr-x 3 root    root       4096 Jan 31 01:00 01C54VMN5R94ZM7N7F08J20DDA
drwxr-xr-x 3 root    root       4096 Jan 31 19:00 01C56SE59DGWBV587GG9M2W99W
drwxr-xr-x 3 root    root       4096 Feb  1 13:00 01C58Q7Q5G4R0DFY5HDGD3XC9Y
drwxr-xr-x 3 root    root       4096 Feb  2 07:00 01C5AN1885H7B9J1W11DXB137A
drwxr-xr-x 3 root    root       4096 Feb  3 01:00 01C5CJTST135PPXDYDWTKEPYTD
drwxr-xr-x 3 root    root       4096 Feb  3 19:00 01C5EGMASFVT63NC0QZTJSABFJ
drwxr-xr-x 3 root    root       4096 Feb  4 13:00 01C5GEDW21PTYQ8WNKYRCAVQJX
drwxr-xr-x 3 root    root       4096 Feb  5 07:00 01C5JC7DGMKPJVQD2DZH0JT7QJ
drwxr-xr-x 3 root    root       4096 Feb  6 01:00 01C5MA0ZC22NSD0H7S8G207JFF
-rw------- 1 root    root          6 Feb  6 12:29 lock
drwxr-xr-x 3 root    root       4096 Feb  6 19:00 01C5P7TGEMP97FSN779S5E5AYH
drwxr-xr-x 3 root    root       4096 Feb  7 13:00 01C5R5M1ANW6RYY81S91VC0F75
drwxr-xr-x 3 root    root       4096 Feb  8 07:00 01C5T3DJX67W411JZ9FP745B2Q
drwxr-xr-x 3 root    root       4096 Feb  9 01:00 01C5W173WNJA1F01TVJJPT5B93
drwxr-xr-x 3 root    root       4096 Feb  9 19:00 01C5XZ0MQ5BSY2W5K9KKAGF5N2
drwxr-xr-x 3 root    root       4096 Feb 10 13:00 01C5ZWT693CHSK8KEKBW04SSDX
drwxr-xr-x 3 root    root       4096 Feb 11 07:00 01C61TKQF3YXY1XT01JN2X0A3W
drwxr-xr-x 3 root    root       4096 Feb 12 01:00 01C63RD8T9Y2Z2C7G98YX9RBXV
drwxr-xr-x 3 root    root       4096 Feb 12 07:00 01C64D0D1Y2B9P3QHHH9XCF8NV
drwxr-xr-x 3 root    root       4096 Feb 12 09:00 01C64KW317054P6TNH8DCNGRP9
drwxr-xr-x 3 root    root       4096 Feb 12 11:00 01C64TQT974JFMQV24CH9060XW
drwxrwxrwx 2 root    root       4096 Feb 12 11:56 wal

When standing up Thanos I see:

./thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /opt/prometheus/promv2/data/ --s3.bucket=thanos --s3.endpoint=xxxxxxxx --s3.access-key=xxxxxxx --s3.secret-key=xxxxxx
level=info ts=2018-02-12T12:25:49.329654785Z caller=sidecar.go:293 msg="starting sidecar" peer=01C64ZMYG5728WQFXTVCD0F70V
level=info ts=2018-02-12T12:25:49.652116167Z caller=shipper.go:179 msg="upload new block" id=01C64KW317054P6TNH8DCNGRP9
level=info ts=2018-02-12T12:25:51.747570129Z caller=shipper.go:179 msg="upload new block" id=01C64TQT974JFMQV24CH9060XW

only the last two blocks are uploaded. From the flags for thanos sidecar I don't see a mechanism for specifying a period for backdating. Perhaps I am doing something wrong? Is this intentional for some reason (compute/performance)? Or am I simply listing a feature request here?

Thanks.

hard feature request / improvement help wanted

Source

V3ckt0r

All 24 comments

Hey, thanks for trying out Thanos.

Your are doing nothing wrong. Early on we hardcoded the sidecar to only upload blocks of comapction level 0, i.e. those that never got compacted.

With the garbage collection behavior the compactor has nowadays, it should be safe though to also upload historic data and potentially double-upload some data without lasting consequences. Just didn't get to changing the bahvior yet.

In the meantime, you could just manually upload those blocks to the bucket of course.

fabxc on 12 Feb 2018

Yea we could now safely drop the rule of uploading only blocks with compaction level 0, however, I am just curious, is there any use case or reason why anyone want to do compaction on local, Prometheus level instead of bucket level?

By default we recommend to set

--storage.tsdb.max-block-duration=...
--storage.tsdb.min-block-duration=...

to the same value to turn off local compaction at all.

If one decide to actually do local compaction I can see some (unlikely, but..) race condition, when sidecar is not able to upload for some time from various reasons and Prometheus will have enough time to compact some blocks and kill 0-level block. This way nothing would be uploaded.

More importantly, our rule of uploading only 0 level makes it more difficult to use thanos sidecar on already running Prometheus instances.

I think we should just upload all levels by default

bwplotka on 12 Feb 2018

👍1

Cool, cheers guys.

@Bplotka in regards to your question about local tsdb compaction. I'm guessing those who are not using an object store would still want local compaction, to save disk. I've heard of teams out there with a year+ worth of time series.

Thanks for flipping PR 207. I've tried testing this out in my fork, but I am not seeing the the desired uploads.

V3ckt0r on 12 Feb 2018

@V3ckt0r let's move this discussion to the https://github.com/improbable-eng/thanos/pull/207 PR then.

I think you cannot see these uploads because thanos marked these as "uploaded" in thanos.shipper.json (because they were level compaction 2+). The easiest way is to manually remove blocks with compaction lvl 2+ which are NOT actually uploaded from the thanos.shipper.json uploaded list.

I know that marking them as uploaded, when they were not was, bit weird, maybe initially we should name it processed

bwplotka on 12 Feb 2018

👍1

Seems to work for @V3ckt0r

bwplotka on 12 Feb 2018

hm just rethinking... @V3ckt0r

@Bplotka in regards to your question about local tsdb compaction. I'm guessing those who are not using an object store would still want local compaction, to save disk. I've heard of teams out there with a year+ worth of time series.

In that case they can specify to not upload things and this issue will never occur (:

bwplotka on 13 Feb 2018

OK new approach emerged:
In v2.2.1 we added min-block-size compaction delay to prometheus. This will help in 99% cases to avoid conflicts between thanos hardlink before upload & prometheus local compaction if enabled. To be even more sure we should use Snapshot API, but that will require special flag (admin endpoint) for Prometheus.

On every upload attempt Thanos should:
1) Check if all sources of compacted blocks are in object storage. Upload all compacted blocks that include missing sources. Obiously upload all not compacted blocks which are not in object store too.
2) If admin endpoint is enabled - trigger snapshot to make sure no compaction is in place in the same time.

@fabxc Do we want to require admin endpoint if user does configure local compaction? And otherwise just error sidecar? I think so.
The question is what to do for versions prior to v2.2.1

bwplotka on 20 Mar 2018

As @TimSimmons mentioned, we should add more info in docs as well, how to configure Prometheus for the best experience.

bwplotka on 20 Mar 2018

I'm looking at how to integrate the sidecar with our existing Prometheus instances and I'm wondering whether the sidecar should try to only ship the _most compacted_ blocks, within the limits of the data retention window.

The advantages:

Reduces the load on the global compactor in Thanos, thus reducing requests/bandwidth on the object storage.
Prometheus retains the performance gains achieved from local compaction.
Reduced configuration required in Prometheus to work with Thanos.

The disadvantages:

It's difficult to avoid race conditions between the sidecar and Prometheus' compaction. As mentioned above, we'd probably need to interact with an API in Prometheus to signal/coordinate between the sidecar and Prometheus.
We'd have to be careful to allow sufficient time to ship the data to object storage (including time for retries/failures) before the data retention horizon is reached. Instead of allowing the maximum amount of time to attempt to ship data to the object store, we'd be minimising that time window.
Uploads to the object store would be less frequent, but the uploads would be larger (rather than more frequent, smaller requests, which could be of concern for some on-premise installations in terms of disk/network throughput).
Data not stored in object storage will be retrieved from Prometheus via the sidecar, and since the data is older it may no longer be in-memory. Thus the retrieval latency may be higher, depending on the performance of the object store and round-trip time from the query instance to the sidecar.

mattbostock on 13 Apr 2018

Sorry for delay @mattbostock! Regarding mentioned benefits:

Reduces the load on the global compactor in Thanos, thus reducing requests/bandwidth on the object storage.

True, but not sure if req/band of object store is actually an issue.

Prometheus retains the performance gains achieved from local compaction.

Is that really the needed if you keep scraper small? (24h retention?)

Reduced configuration required in Prometheus to work with Thanos.

Don't get that, what would be simplified?

I am afraid all the disadvantages you mentioned are true, and they are "winning" with the benefits. There are couple of more problems:

We had recently problems with compaction being broken - it would be a lot of harder to debug issues if the sources of the issue could be not only compactor but also sidecars.
@jacksontj mentioned interesting issue: Fetching large series are not really efficient for Prometheus: https://github.com/prometheus/prometheus/issues/3601 , so making sidecar retention longer is not really beneficial.
Global compactor would be still required to downsample data and probably compact data for longer blocks (2w)
We would need to enforce some kind of compact levels, otherwise user would be able to shoot itself in the foot by changing to some non-standard levels (2,5h -> 8,5 day, etc) that will conflict with global compactor.

bwplotka on 23 Apr 2018

True, but not sure if req/band of object store is actually an issue.

Agree. #294 should help to determine the exact usage in access logs.

Is that really the needed if you keep scraper small? (24h retention?)

Reducing Prometheus' data retention means that you're reducing the window for recovery during a disaster recovery scenario.

For example, if you're using an on-premise object store that has a catastrophic failure (e.g. datacentre goes up in flames) and Prometheus' data retention is 30 days, that gives you more time to configure a new object store in a new datacentre and configure Thanos to send the data to the new object store.

The sample could apply to cloud storage if/when a provider had/has a significantly long outage.

There are mitigations for this (the most obvious being to not run a single object store in a single datacentre), but most of these are more complex and more costly than retaining data for longer in Prometheus (at least for on-premise installations). I'm not suggesting one over the other, but highlighting that the retention period is a factor to consider.

Reduced configuration required in Prometheus to work with Thanos.

Don't get that, what would be simplified?

Thanos currently requires that compaction is disabled in Prometheus, which means setting a command-line flag for Prometheus.

We would need to enforce some kind of compact levels, otherwise user would be able to shoot itself in the foot by changing to some non-standard levels (2,5h -> 8,5 day, etc) that will conflict with global compactor.

Good point, this is a significant downside.

mattbostock on 23 Apr 2018

Cool, I see the disaster recovery goal for some users, but not sure if Prometheus scraper should be treated as a backup solution to your on-premise/cloud/magic object storage. There must be some dedicated tools for that.

The main important use case I can see for local compaction is when user want to migrate to thanos and upload all old blocks already compacted by their vanilla Prometheus servers with long retention.

The easiest way would be to just glue sidecar to existing Prom server that will be smart enough to detect what "sources" are missing in an object store. This way, we can allow local compaction for longer local storage if you wish, and upload sources that are not upstreamed yet. This is what I proposed here: https://github.com/improbable-eng/thanos/issues/206#issuecomment-374610307

There are some pain points here that needs to be solved, though.

what if object store got source A, C but not B and Prometheus already compacted locally A,B,C together? We should upload ABC and deal with that on compactor (vertical compaction!)
local compact & upload in the same time could be racy -> we can leverage snapshot API to avoid that.

If you really want to keep longer retention for Prometheus long term -> nothing blocks you from that with above logic, except one more thing mentioned here: https://github.com/improbable-eng/thanos/issues/283
and here https://github.com/improbable-eng/thanos/issues/82

Maybe it sounds reasonable to add some arbitrary min-time flag to sidecar to expose only fresh metrics?

bwplotka on 23 Apr 2018

Ok, migration goal moved to another ticket: https://github.com/improbable-eng/thanos/issues/348

This tickets stays as ability to run Prometheus with local compaction + sidecar for long period and use it as it is. However this does not make sense that much. since longer retention is not good for this reason: https://github.com/improbable-eng/thanos/issues/82

This blocker makes me think that this (long retention + local compaction for longer usage) might be not in our scope.

bwplotka on 25 May 2018

A bit late perhaps, but without knowing the internals of the tsdb I think the suggestion from @mattbostock makes sense.

Compaction would be the limiting factor in how much data can be stored in a bucket. Not only does it have to download and upload all the data that get stored in the bucket multiple times, unlike the sidecars uploading in parallel the compaction is supposed to run as a singleton. This can be solved by splitting the data into more buckets, but that means more complicated setups where users have to manage which servers go to each buckets.

To me it seems like most issues with uploading compacted blocks also apply to uploading raw blocks.
If all blocks are compacted by prometheus instead of thanos, wouldn't issues like #82 and #283 be reduced/simpler?
Performance issue prometheus/prometheus#3601 relates to the query returning gigabytes of data, is there any reason to think thanos store would handle that better?
Compaction levels are already enforced (no compaction) so enforcing that isn't something new.

Is global compaction still needed? The sidecars could downsample the data before uploading, I don't know about 2w blocks.

It's also worth to consider that compacting before uploading would solve issues too, like #377 and #271 (long timerange queries across all promeheus sources while the compactor is running is very unreliable as some blocks are almost bound to have been removed.)

asbjxrn on 11 Jul 2018

Yea, local only compaction would solve some issues, nice idea, but unfortunately:

Is global compaction still needed? The sidecars could downsample the data before uploading, I don't know about 2w blocks.

Yes.

Reasons:

We (and most likely lots of other users) have ingestion at the level that even 9d retention is too long, not mentioning additional downsampling procedure that takes lot's of memory (however might be optimized). This being said not all compaction levels are able to accumulate enough to be compacted (like 2w)
Compactor is singleton not without a reason. You really want only one guy in the system having "delete" operations (this does not apply if your idea means sidecar doing all work and upload only ready compacted blocks)
It is really useful to have scrapers a lot lighter than casual Prometheus with 30d retention and compaction enabled. Reduction of cost and maintenance. You no longer need 500 or 1TB persistent SSD with some additional backup logic for these (!). Having Prometheus being just a scraper with some 1d buffer makes life easier.

I totally see we want to allow local compaction with Thanos for some use cases, but we need to invest some time to solve it, but I think WITH global compactor ):

bwplotka on 13 Jul 2018

👍1

BTW:

Performance issue prometheus/prometheus#3601 relates to the query returning gigabytes of data, is there any reason to think thanos store would handle that better?

Definitely not worse than Prometheus itself -> if it will be improved for Prometheus, Thanos will have the gains as well.
Yes - it will handle it better. We have downsampling. (:

bwplotka on 13 Jul 2018

We (and most likely lots of other users) have ingestion at the level that even 9d retention is too long, not mentioning additional downsampling procedure that takes lot's of memory (however might be optimized). This being said not all compaction levels are able to accumulate enough to be compacted (like 2w)

Yup, one drawback is of course that it means prometheus needs longer retention times which may not work for all deployments.

FWIW, I am experimenting with this by adding a flag to the sidecar that specifies what compaction level the sidecar is uploading. I plan to upload only compaction level 5 where a block has about a weeks worth of data.

I then won't run the global compactor, but will run "thanos downsample" or try to patch the sidecar to downsample before upload (which would have the drawback of only one level of downsampling...) I understand this is not the direction the project want to go for several reasons, but I think it's a worthwhile experiment that could provide some useful data as the alternative for us is to split the data into several buckets so compaction can keep up ( and I really like the simplicity of thanos and want to keep the deployment simple too :)

asbjxrn on 13 Jul 2018

Just a FYI.
We had issues with compaction performance, starting with a 1.5 week backlog it took almost 6 weeks for compaction to catch up. This was partly due to crashes and restarts caused by occasional timeouts as well as downtime as the disk filled up at times due to compactor not always cleaning up.

It then started the downsampling process which I estimated would take another 3 weeks to complete before the cycle would start over.

I then aborted the whole process, let prometheus start compacting the data, got a new bucket and added a small patch to upload only blocks where compaction level == *flagShipperLevel to the sidecar. With a 3 week backlog, the whole compaction and upload process now took less than 4 hours.

While one solution is to have multiple buckets to distribute the compaction load that way, would it be possible to add a flag to the sidecar to only upload blocks at a certain compaction level for those that have large enough prometheus servers?

asbjxrn on 5 Sep 2018

Totally missed this, sorry.

While one solution is to have multiple buckets to distribute the compaction load that way, would it be possible to add a flag to the sidecar to only upload blocks at a certain compaction level for those that have large enough prometheus servers?

Yes, definitely valid use case, probably to separate issues. And useful as one time job. We are actually working on it as part of: https://github.com/observatorium/thanos-replicate/issues/7

We also added sharding for the compactor, so you can deploy many that operates on different blocks.

bwplotka on 28 Oct 2019

@bwplotka better to add compactor label?

daixiang0 on 9 Jan 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.