Thanos: compactor is in infinite loop when broken block

Created on 8 Nov 2018 · 7Comments · Source: thanos-io/thanos

Thanos, Prometheus and Golang version used

thanos, version 0.1.0 (branch: master, revision: 3050831bec12684398ce6deb613788714b7924d9)
  build user:       circleci@a8c441c7e82a
  build date:       20181026-11:11:12
  go version:       go1.10.4

What happened
i tried to reproduce situation when compactor is shutdown in the middle of uploading new compacted blocks. The problem is if the block upload partly if compactor is start again it stuck on process of syncing metas. I think it possible to add check if block is corrupted just remove it from the list of queryable blocks or in this particular case just skip it
What you expected to happen
i expected to see that broken blocks is skipped and print error or warning that one of blocks is broken
How to reproduce it (as minimally and precisely as possible):
start compactor what until it compact and start to upload and in the middle of process just kill compactor
Full logs to relevant components

Logs

level=debug ts=2018-11-08T12:36:52.954662741Z caller=compact.go:174 msg="download meta" block=01CVSHT550PVJBPVKW7905KTAP
level=debug ts=2018-11-08T12:36:53.07216329Z caller=compact.go:174 msg="download meta" block=01CVSJ9TQ6XFP4VBDST5HS47NJ
level=debug ts=2018-11-08T12:36:53.181458976Z caller=compact.go:174 msg="download meta" block=01CVSJYK2V0EQNPYTS5A88WF81
level=debug ts=2018-11-08T12:36:53.295023922Z caller=compact.go:174 msg="download meta" block=01CVSKPHPSQ5YEGDX0NNWY710H
level=debug ts=2018-11-08T12:36:53.430717664Z caller=compact.go:174 msg="download meta" block=01CVSMG4JS1KXZPBA976FD4ZZT
level=error ts=2018-11-08T12:36:53.524348802Z caller=compact.go:207 msg="retriable error" err="compaction failed: sync: retrieve bucket block metas: downloading meta.json for 01CVSMG4JS1KXZPBA976FD4ZZT: meta.json bkt get for 01CVSMG4JS1KXZPBA976FD4ZZT: The specified key does not exist."

Anything else we need to know

bug hard help wanted

Source

dmitriy-lukyanchikov

👍2

Most helpful comment

.. there is no timeline on above one, so we need some faster fix to partial blocks...

The root cause of this issue is compactor creashed/restarted in the middle of upload and did not have time to finish it. We need to handle this case.

bwplotka on 17 Apr 2019

👍2

All 7 comments

Yup, valid issue. But there was something related regarding partial block: https://github.com/improbable-eng/thanos/issues/377

bwplotka on 8 Nov 2018

Hello, read the #377, @bwplotka what if load file to s3 and save it with prefix *.tmp and rename only if it loaded successfully, does it make sense?

dmitriy-lukyanchikov on 8 Nov 2018

Hm looks like moving or renaming not possible in s3, i think if all components will skip partially uploaded blocks it will work, but not sure

dmitriy-lukyanchikov on 8 Nov 2018

@bwplotka ,
I'm planning to work on a PR with a fix for this issue. Before doing that, I'd like to know your opinion what solution would be acceptable here. I think the simplest way to go here is to delete corrupted block (block that is missing metadata) that previously was created by compactor, and then compactor would re-create. The main trick here is to identify such block as previously created by compactor. I currently see 3 options:

Use debug/metas information. It has information about all the blocks that were written to a bucket. Theoretically this meta even would be enough to recover a block if the only stuff that is missing is metadata.json. But I'm not sure how correct that would be since it's a debugging functionality and I suspect it might be disabled in the future
Write temporary meta data for each block before uploading. It would be another copy of the same metadata as meta.json in a block or debug/metas to help identify a block was created by a compactor. After a block is successfully uploaded this temporary metadata will be deleted.
Use storage specific tags to mark objects (files) as ones created by compactor. This approach would require an implementation for each specific cloud provider.

Maybe you have some plan on how to fix it. Please let me know