Thanos: Thanos-Store empty bocks in local storage

Created on 7 Oct 2019  路  7Comments  路  Source: thanos-io/thanos

Hi there!
I am using the quay.io/thanos/thanos:v0.7.0 container and i am experiencing problems with the store component.
The store is missing metadata from it's bocks inside it's local storage.
But the metadata exists in the s3 bucket.
Store log:

 level=warn ts=2019-10-07T07:15:12.145791006Z caller=bucket.go:325 msg="error parsing block range" block=01DPJ5368THP909JKH2DW72JJM err="read meta.json: open /thanos-store-data/01DPJ5368THP909JKH2DW72JJM/meta.json: no such file or directory"

S3 bucket ls:

2019-10-07 03:47 536864293   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000001
2019-10-07 03:48 536857676   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000002
2019-10-07 03:48 536860520   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000003
2019-10-07 03:48 536864881   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000004
2019-10-07 03:48 536863844   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000005
2019-10-07 03:48 536865851   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000006
2019-10-07 03:49 536771867   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000007
2019-10-07 03:49 536685857   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000008
2019-10-07 03:49 536868988   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000009
2019-10-07 03:49 536868010   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000010
2019-10-07 03:49 536868033   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000011
2019-10-07 03:50 536869076   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000012
2019-10-07 03:50 536870302   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000013
2019-10-07 03:50 516198435   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000014
2019-10-07 03:50 519912306   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/index
2019-10-07 03:50  13922915   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/index.cache.json
2019-10-07 03:50      1997   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/meta.json

The metadata and index file is actually missing when i take a look into the data directory of the store component for that block.In the web-ui from the querier the store looks healthy and also has the correct min and max time ranges.When i restart the store, it comes back up healthy and all the metadata from the before faulty bocks are there now and are query-able.
But eventually the data goes missing and holes are represented in the graphs done by the queriers.
A restart always fixes that.This only recently started after updating to version v0.7.0
What might be important to note here is, that i run a daily bucket verify job on the bucket, while the compactor is actually still running.
But the bucket verify is always configured without the repair flag.
After restarting a store and then running the verifier does not cause holes.
I cannot manually recreate the problem, it only eventually happens after some time.I'd be very thankful for any help

All 7 comments

Thanks for this. Do you have persistent volume? It looks really like the issue we fixed recently with this, which will be released soon: https://github.com/thanos-io/thanos/blob/master/CHANGELOG.md#fixed

Can you try running master? E.g master-2019-10-06-bb1ac398

Next release is this week (:

Hi, I wonder if this is related to shis issue https://github.com/thanos-io/thanos/issues/1504 ?

It's interesting that it gets fixed after restart. Do you have persistent storage on that store? In my case it persisted after restart so a added check to erease malformed blocks. It got merged after 0.7.0 was released IIRC could you try recent master?

Hah, Bartek was faster :)

I still wonder how those malformed blocks happen to be.

It's quite straightforward. Check https://github.com/thanos-io/thanos/pull/1505#pullrequestreview-287558671 for explanation.

Thanks for the quick response!
I do not use a persistent volume. It is saved into an empty dir.
Should i rather add a persistent volume to the store? I figured that it is unnecessary, since i have persistence inside s3. I have roughly 4TB of metrics in total in my s3. Keeping data inside the store after a restart didn't seem resourceful since pods rarely restart in my cluster.
I will try master and come back to you guys if it happens again.
Thanks so much :+1:

Just for readers that have run into this problem, since using version v0.8.1 I did not experience this problem again.

Was this page helpful?
0 / 5 - 0 ratings