Hey @Bplotka @fabxc
Still pretty new to the Thanos code base. Going through it, one thing I've noticed with the backup behaviour is it seems to only dump the initial 2 hourly blocks. In my instance I've stood up thanos against an existing Prom server. My data dir looks as follows:
data]# ls -ltr
total 108
drwxr-xr-x 3 root root 4096 Jan 28 19:00 01C4Z28176WR17K7PH37K7FG9V
drwxr-xr-x 3 root root 4096 Jan 29 13:00 01C5101JDEX1TK8CMSC4NQK8KP
drwxr-xr-x 3 root root 4096 Jan 30 07:00 01C52XV3T2ETVSG93K73HVP6D1
drwxr-xr-x 3 root root 4096 Jan 31 01:00 01C54VMN5R94ZM7N7F08J20DDA
drwxr-xr-x 3 root root 4096 Jan 31 19:00 01C56SE59DGWBV587GG9M2W99W
drwxr-xr-x 3 root root 4096 Feb 1 13:00 01C58Q7Q5G4R0DFY5HDGD3XC9Y
drwxr-xr-x 3 root root 4096 Feb 2 07:00 01C5AN1885H7B9J1W11DXB137A
drwxr-xr-x 3 root root 4096 Feb 3 01:00 01C5CJTST135PPXDYDWTKEPYTD
drwxr-xr-x 3 root root 4096 Feb 3 19:00 01C5EGMASFVT63NC0QZTJSABFJ
drwxr-xr-x 3 root root 4096 Feb 4 13:00 01C5GEDW21PTYQ8WNKYRCAVQJX
drwxr-xr-x 3 root root 4096 Feb 5 07:00 01C5JC7DGMKPJVQD2DZH0JT7QJ
drwxr-xr-x 3 root root 4096 Feb 6 01:00 01C5MA0ZC22NSD0H7S8G207JFF
-rw------- 1 root root 6 Feb 6 12:29 lock
drwxr-xr-x 3 root root 4096 Feb 6 19:00 01C5P7TGEMP97FSN779S5E5AYH
drwxr-xr-x 3 root root 4096 Feb 7 13:00 01C5R5M1ANW6RYY81S91VC0F75
drwxr-xr-x 3 root root 4096 Feb 8 07:00 01C5T3DJX67W411JZ9FP745B2Q
drwxr-xr-x 3 root root 4096 Feb 9 01:00 01C5W173WNJA1F01TVJJPT5B93
drwxr-xr-x 3 root root 4096 Feb 9 19:00 01C5XZ0MQ5BSY2W5K9KKAGF5N2
drwxr-xr-x 3 root root 4096 Feb 10 13:00 01C5ZWT693CHSK8KEKBW04SSDX
drwxr-xr-x 3 root root 4096 Feb 11 07:00 01C61TKQF3YXY1XT01JN2X0A3W
drwxr-xr-x 3 root root 4096 Feb 12 01:00 01C63RD8T9Y2Z2C7G98YX9RBXV
drwxr-xr-x 3 root root 4096 Feb 12 07:00 01C64D0D1Y2B9P3QHHH9XCF8NV
drwxr-xr-x 3 root root 4096 Feb 12 09:00 01C64KW317054P6TNH8DCNGRP9
drwxr-xr-x 3 root root 4096 Feb 12 11:00 01C64TQT974JFMQV24CH9060XW
drwxrwxrwx 2 root root 4096 Feb 12 11:56 wal
When standing up Thanos I see:
./thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /opt/prometheus/promv2/data/ --s3.bucket=thanos --s3.endpoint=xxxxxxxx --s3.access-key=xxxxxxx --s3.secret-key=xxxxxx
level=info ts=2018-02-12T12:25:49.329654785Z caller=sidecar.go:293 msg="starting sidecar" peer=01C64ZMYG5728WQFXTVCD0F70V
level=info ts=2018-02-12T12:25:49.652116167Z caller=shipper.go:179 msg="upload new block" id=01C64KW317054P6TNH8DCNGRP9
level=info ts=2018-02-12T12:25:51.747570129Z caller=shipper.go:179 msg="upload new block" id=01C64TQT974JFMQV24CH9060XW
only the last two blocks are uploaded. From the flags for thanos sidecar I don't see a mechanism for specifying a period for backdating. Perhaps I am doing something wrong? Is this intentional for some reason (compute/performance)? Or am I simply listing a feature request here?
Thanks.
Hey, thanks for trying out Thanos.
Your are doing nothing wrong. Early on we hardcoded the sidecar to only upload blocks of comapction level 0, i.e. those that never got compacted.
With the garbage collection behavior the compactor has nowadays, it should be safe though to also upload historic data and potentially double-upload some data without lasting consequences. Just didn't get to changing the bahvior yet.
In the meantime, you could just manually upload those blocks to the bucket of course.
Yea we could now safely drop the rule of uploading only blocks with compaction level 0, however, I am just curious, is there any use case or reason why anyone want to do compaction on local, Prometheus level instead of bucket level?
By default we recommend to set
--storage.tsdb.max-block-duration=...
--storage.tsdb.min-block-duration=...
to the same value to turn off local compaction at all.
If one decide to actually do local compaction I can see some (unlikely, but..) race condition, when sidecar is not able to upload for some time from various reasons and Prometheus will have enough time to compact some blocks and kill 0-level block. This way nothing would be uploaded.
More importantly, our rule of uploading only 0 level makes it more difficult to use thanos sidecar on already running Prometheus instances.
I think we should just upload all levels by default
Cool, cheers guys.
@Bplotka in regards to your question about local tsdb compaction. I'm guessing those who are not using an object store would still want local compaction, to save disk. I've heard of teams out there with a year+ worth of time series.
Thanks for flipping PR 207. I've tried testing this out in my fork, but I am not seeing the the desired uploads.
@V3ckt0r let's move this discussion to the https://github.com/improbable-eng/thanos/pull/207 PR then.
I think you cannot see these uploads because thanos marked these as "uploaded" in thanos.shipper.json (because they were level compaction 2+). The easiest way is to manually remove blocks with compaction lvl 2+ which are NOT actually uploaded from the thanos.shipper.json uploaded list.
I know that marking them as uploaded, when they were not was, bit weird, maybe initially we should name it processed
Seems to work for @V3ckt0r
hm just rethinking... @V3ckt0r
@Bplotka in regards to your question about local tsdb compaction. I'm guessing those who are not using an object store would still want local compaction, to save disk. I've heard of teams out there with a year+ worth of time series.
In that case they can specify to not upload things and this issue will never occur (:
OK new approach emerged:
In v2.2.1 we added min-block-size compaction delay to prometheus. This will help in 99% cases to avoid conflicts between thanos hardlink before upload & prometheus local compaction if enabled. To be even more sure we should use Snapshot API, but that will require special flag (admin endpoint) for Prometheus.
On every upload attempt Thanos should:
1) Check if all sources of compacted blocks are in object storage. Upload all compacted blocks that include missing sources. Obiously upload all not compacted blocks which are not in object store too.
2) If admin endpoint is enabled - trigger snapshot to make sure no compaction is in place in the same time.
@fabxc Do we want to require admin endpoint if user does configure local compaction? And otherwise just error sidecar? I think so.
The question is what to do for versions prior to v2.2.1
As @TimSimmons mentioned, we should add more info in docs as well, how to configure Prometheus for the best experience.
I'm looking at how to integrate the sidecar with our existing Prometheus instances and I'm wondering whether the sidecar should try to only ship the _most compacted_ blocks, within the limits of the data retention window.
The advantages:
The disadvantages:
Sorry for delay @mattbostock! Regarding mentioned benefits:
Reduces the load on the global compactor in Thanos, thus reducing requests/bandwidth on the object storage.
True, but not sure if req/band of object store is actually an issue.
Prometheus retains the performance gains achieved from local compaction.
Is that really the needed if you keep scraper small? (24h retention?)
Reduced configuration required in Prometheus to work with Thanos.
Don't get that, what would be simplified?
I am afraid all the disadvantages you mentioned are true, and they are "winning" with the benefits. There are couple of more problems:
True, but not sure if req/band of object store is actually an issue.
Agree. #294 should help to determine the exact usage in access logs.
Is that really the needed if you keep scraper small? (24h retention?)
Reducing Prometheus' data retention means that you're reducing the window for recovery during a disaster recovery scenario.
For example, if you're using an on-premise object store that has a catastrophic failure (e.g. datacentre goes up in flames) and Prometheus' data retention is 30 days, that gives you more time to configure a new object store in a new datacentre and configure Thanos to send the data to the new object store.
The sample could apply to cloud storage if/when a provider had/has a significantly long outage.
There are mitigations for this (the most obvious being to not run a single object store in a single datacentre), but most of these are more complex and more costly than retaining data for longer in Prometheus (at least for on-premise installations). I'm not suggesting one over the other, but highlighting that the retention period is a factor to consider.
Reduced configuration required in Prometheus to work with Thanos.
Don't get that, what would be simplified?
Thanos currently requires that compaction is disabled in Prometheus, which means setting a command-line flag for Prometheus.
We would need to enforce some kind of compact levels, otherwise user would be able to shoot itself in the foot by changing to some non-standard levels (2,5h -> 8,5 day, etc) that will conflict with global compactor.
Good point, this is a significant downside.
Cool, I see the disaster recovery goal for some users, but not sure if Prometheus scraper should be treated as a backup solution to your on-premise/cloud/magic object storage. There must be some dedicated tools for that.
The main important use case I can see for local compaction is when user want to migrate to thanos and upload all old blocks already compacted by their vanilla Prometheus servers with long retention.
The easiest way would be to just glue sidecar to existing Prom server that will be smart enough to detect what "sources" are missing in an object store. This way, we can allow local compaction for longer local storage if you wish, and upload sources that are not upstreamed yet. This is what I proposed here: https://github.com/improbable-eng/thanos/issues/206#issuecomment-374610307
There are some pain points here that needs to be solved, though.
snapshot API to avoid that.If you really want to keep longer retention for Prometheus long term -> nothing blocks you from that with above logic, except one more thing mentioned here: https://github.com/improbable-eng/thanos/issues/283
and here https://github.com/improbable-eng/thanos/issues/82
Maybe it sounds reasonable to add some arbitrary min-time flag to sidecar to expose only fresh metrics?
Ok, migration goal moved to another ticket: https://github.com/improbable-eng/thanos/issues/348
This tickets stays as ability to run Prometheus with local compaction + sidecar for long period and use it as it is. However this does not make sense that much. since longer retention is not good for this reason: https://github.com/improbable-eng/thanos/issues/82
This blocker makes me think that this (long retention + local compaction for longer usage) might be not in our scope.
A bit late perhaps, but without knowing the internals of the tsdb I think the suggestion from @mattbostock makes sense.
Compaction would be the limiting factor in how much data can be stored in a bucket. Not only does it have to download and upload all the data that get stored in the bucket multiple times, unlike the sidecars uploading in parallel the compaction is supposed to run as a singleton. This can be solved by splitting the data into more buckets, but that means more complicated setups where users have to manage which servers go to each buckets.
To me it seems like most issues with uploading compacted blocks also apply to uploading raw blocks.
If all blocks are compacted by prometheus instead of thanos, wouldn't issues like #82 and #283 be reduced/simpler?
Performance issue prometheus/prometheus#3601 relates to the query returning gigabytes of data, is there any reason to think thanos store would handle that better?
Compaction levels are already enforced (no compaction) so enforcing that isn't something new.
Is global compaction still needed? The sidecars could downsample the data before uploading, I don't know about 2w blocks.
It's also worth to consider that compacting before uploading would solve issues too, like #377 and #271 (long timerange queries across all promeheus sources while the compactor is running is very unreliable as some blocks are almost bound to have been removed.)
Yea, local only compaction would solve some issues, nice idea, but unfortunately:
Is global compaction still needed? The sidecars could downsample the data before uploading, I don't know about 2w blocks.
Yes.
Reasons:
I totally see we want to allow local compaction with Thanos for some use cases, but we need to invest some time to solve it, but I think WITH global compactor ):
BTW:
Performance issue prometheus/prometheus#3601 relates to the query returning gigabytes of data, is there any reason to think thanos store would handle that better?
We (and most likely lots of other users) have ingestion at the level that even 9d retention is too long, not mentioning additional downsampling procedure that takes lot's of memory (however might be optimized). This being said not all compaction levels are able to accumulate enough to be compacted (like 2w)
Yup, one drawback is of course that it means prometheus needs longer retention times which may not work for all deployments.
FWIW, I am experimenting with this by adding a flag to the sidecar that specifies what compaction level the sidecar is uploading. I plan to upload only compaction level 5 where a block has about a weeks worth of data.
I then won't run the global compactor, but will run "thanos downsample" or try to patch the sidecar to downsample before upload (which would have the drawback of only one level of downsampling...) I understand this is not the direction the project want to go for several reasons, but I think it's a worthwhile experiment that could provide some useful data as the alternative for us is to split the data into several buckets so compaction can keep up ( and I really like the simplicity of thanos and want to keep the deployment simple too :)
Just a FYI.
We had issues with compaction performance, starting with a 1.5 week backlog it took almost 6 weeks for compaction to catch up. This was partly due to crashes and restarts caused by occasional timeouts as well as downtime as the disk filled up at times due to compactor not always cleaning up.
It then started the downsampling process which I estimated would take another 3 weeks to complete before the cycle would start over.
I then aborted the whole process, let prometheus start compacting the data, got a new bucket and added a small patch to upload only blocks where compaction level == *flagShipperLevel to the sidecar. With a 3 week backlog, the whole compaction and upload process now took less than 4 hours.
While one solution is to have multiple buckets to distribute the compaction load that way, would it be possible to add a flag to the sidecar to only upload blocks at a certain compaction level for those that have large enough prometheus servers?
Totally missed this, sorry.
While one solution is to have multiple buckets to distribute the compaction load that way, would it be possible to add a flag to the sidecar to only upload blocks at a certain compaction level for those that have large enough prometheus servers?
Yes, definitely valid use case, probably to separate issues. And useful as one time job. We are actually working on it as part of: https://github.com/observatorium/thanos-replicate/issues/7
We also added sharding for the compactor, so you can deploy many that operates on different blocks.
@bwplotka better to add compactor label?
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.
It would be amazing to get back to this. @daixiang0 no as it's Prometheus compaction, not Compactor really.
I think the solution is vertical compaction which is quite stable as long as data is 1:1
/reopen