Thanos, Prometheus and Golang version used
thanos, version 0.1.0 (branch: master, revision: 3050831bec12684398ce6deb613788714b7924d9)
build user: circleci@a8c441c7e82a
build date: 20181026-11:11:12
go version: go1.10.4
What happened
i tried to reproduce situation when compactor is shutdown in the middle of uploading new compacted blocks. The problem is if the block upload partly if compactor is start again it stuck on process of syncing metas. I think it possible to add check if block is corrupted just remove it from the list of queryable blocks or in this particular case just skip it
What you expected to happen
i expected to see that broken blocks is skipped and print error or warning that one of blocks is broken
How to reproduce it (as minimally and precisely as possible):
start compactor what until it compact and start to upload and in the middle of process just kill compactor
Full logs to relevant components
level=debug ts=2018-11-08T12:36:52.954662741Z caller=compact.go:174 msg="download meta" block=01CVSHT550PVJBPVKW7905KTAP
level=debug ts=2018-11-08T12:36:53.07216329Z caller=compact.go:174 msg="download meta" block=01CVSJ9TQ6XFP4VBDST5HS47NJ
level=debug ts=2018-11-08T12:36:53.181458976Z caller=compact.go:174 msg="download meta" block=01CVSJYK2V0EQNPYTS5A88WF81
level=debug ts=2018-11-08T12:36:53.295023922Z caller=compact.go:174 msg="download meta" block=01CVSKPHPSQ5YEGDX0NNWY710H
level=debug ts=2018-11-08T12:36:53.430717664Z caller=compact.go:174 msg="download meta" block=01CVSMG4JS1KXZPBA976FD4ZZT
level=error ts=2018-11-08T12:36:53.524348802Z caller=compact.go:207 msg="retriable error" err="compaction failed: sync: retrieve bucket block metas: downloading meta.json for 01CVSMG4JS1KXZPBA976FD4ZZT: meta.json bkt get for 01CVSMG4JS1KXZPBA976FD4ZZT: The specified key does not exist."
Anything else we need to know
Yup, valid issue. But there was something related regarding partial block: https://github.com/improbable-eng/thanos/issues/377
Hello, read the #377, @bwplotka what if load file to s3 and save it with prefix *.tmp and rename only if it loaded successfully, does it make sense?
Hm looks like moving or renaming not possible in s3, i think if all components will skip partially uploaded blocks it will work, but not sure
@bwplotka ,
I'm planning to work on a PR with a fix for this issue. Before doing that, I'd like to know your opinion what solution would be acceptable here. I think the simplest way to go here is to delete corrupted block (block that is missing metadata) that previously was created by compactor, and then compactor would re-create. The main trick here is to identify such block as previously created by compactor. I currently see 3 options:
Maybe you have some plan on how to fix it. Please let me know
Hey, wow quite a long time from the initial response, sorry for delay.
The way we want to solve this is specified here: https://github.com/improbable-eng/thanos/blob/master/docs/proposals/approved/201901-read-write-operations-bucket.md
.. there is no timeline on above one, so we need some faster fix to partial blocks...
The root cause of this issue is compactor creashed/restarted in the middle of upload and did not have time to finish it. We need to handle this case.
Most helpful comment
.. there is no timeline on above one, so we need some faster fix to partial blocks...
The root cause of this issue is compactor creashed/restarted in the middle of upload and did not have time to finish it. We need to handle this case.