Hi there!
I am using the quay.io/thanos/thanos:v0.7.0 container and i am experiencing problems with the store component.
The store is missing metadata from it's bocks inside it's local storage.
But the metadata exists in the s3 bucket.
Store log:
level=warn ts=2019-10-07T07:15:12.145791006Z caller=bucket.go:325 msg="error parsing block range" block=01DPJ5368THP909JKH2DW72JJM err="read meta.json: open /thanos-store-data/01DPJ5368THP909JKH2DW72JJM/meta.json: no such file or directory"
S3 bucket ls:
2019-10-07 03:47 536864293 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000001
2019-10-07 03:48 536857676 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000002
2019-10-07 03:48 536860520 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000003
2019-10-07 03:48 536864881 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000004
2019-10-07 03:48 536863844 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000005
2019-10-07 03:48 536865851 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000006
2019-10-07 03:49 536771867 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000007
2019-10-07 03:49 536685857 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000008
2019-10-07 03:49 536868988 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000009
2019-10-07 03:49 536868010 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000010
2019-10-07 03:49 536868033 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000011
2019-10-07 03:50 536869076 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000012
2019-10-07 03:50 536870302 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000013
2019-10-07 03:50 516198435 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000014
2019-10-07 03:50 519912306 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/index
2019-10-07 03:50 13922915 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/index.cache.json
2019-10-07 03:50 1997 s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/meta.json
The metadata and index file is actually missing when i take a look into the data directory of the store component for that block.In the web-ui from the querier the store looks healthy and also has the correct min and max time ranges.When i restart the store, it comes back up healthy and all the metadata from the before faulty bocks are there now and are query-able.
But eventually the data goes missing and holes are represented in the graphs done by the queriers.
A restart always fixes that.This only recently started after updating to version v0.7.0
What might be important to note here is, that i run a daily bucket verify job on the bucket, while the compactor is actually still running.
But the bucket verify is always configured without the repair flag.
After restarting a store and then running the verifier does not cause holes.
I cannot manually recreate the problem, it only eventually happens after some time.I'd be very thankful for any help
Thanks for this. Do you have persistent volume? It looks really like the issue we fixed recently with this, which will be released soon: https://github.com/thanos-io/thanos/blob/master/CHANGELOG.md#fixed
Can you try running master? E.g master-2019-10-06-bb1ac398
Next release is this week (:
Hi, I wonder if this is related to shis issue https://github.com/thanos-io/thanos/issues/1504 ?
It's interesting that it gets fixed after restart. Do you have persistent storage on that store? In my case it persisted after restart so a added check to erease malformed blocks. It got merged after 0.7.0 was released IIRC could you try recent master?
Hah, Bartek was faster :)
I still wonder how those malformed blocks happen to be.
It's quite straightforward. Check https://github.com/thanos-io/thanos/pull/1505#pullrequestreview-287558671 for explanation.
Thanks for the quick response!
I do not use a persistent volume. It is saved into an empty dir.
Should i rather add a persistent volume to the store? I figured that it is unnecessary, since i have persistence inside s3. I have roughly 4TB of metrics in total in my s3. Keeping data inside the store after a restart didn't seem resourceful since pods rarely restart in my cluster.
I will try master and come back to you guys if it happens again.
Thanks so much :+1:
Just for readers that have run into this problem, since using version v0.8.1 I did not experience this problem again.