Thanos: Potential race condition between compactor applying retention or compaction and store gateway syncing metas.

Created on 10 Oct 2018 · 9Comments · Source: thanos-io/thanos

With compaction or retention logic we have one "writer" that creates new blocks (compaction) and deletes blocks that were source of it.

The problem with our readers (store) is that syncing is periodically every X seconds. So it might happen that we query store during time of compactor remove the block, but store did not sync yet. There is no watch logic for Bucket API.

The simplest solution is to defer deletion in some time in future to address potential eventual consistency of store Gateway internal state (and potentially bucket itself)

Acceptance criteria:

Store Gateway will have loaded new block before old compaction source blocks are deleted.
Store Gateway handling paritally deleted, empty or non existent TSDB block, despite having cached meta-index file.

This unfortunately requires the heavy modification of compactor Plan logic to understand the edge cases like:

Compactor creates new block. Defers source block deletions and suddenly stopped working.
- After restart compactor needs to detect that block overlap is because of defered deletion.
- What if bucket shows newly created block as partial? (eventual consistency), What if compactor crashed before upload completed? How to differentiate those two?
- What if defered deletion fails in between?

This has to be done and is planned to be done EOY

bug compact store hard

Source

bwplotka

👍2

Most helpful comment

Fixed by https://github.com/thanos-io/thanos/pull/2136

bwplotka on 17 Mar 2020

🚀1 🎉1

All 9 comments

This will be fixed by: https://github.com/thanos-io/thanos/issues/1528 help wanted (:

bwplotka on 20 Nov 2019

Why we doesn't trigger a resync with remote in reader component (thanos store), if a block (deleted by compactor) isn't found?

Reamer on 25 Nov 2019

@Reamer

We can, but still sync takes time, so ideally we can do it in a smarter way as defined by https://github.com/thanos-io/thanos/issues/1528

Also trigger will involve coordination which we don't want.

bwplotka on 13 Dec 2019

👍1

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 6 Feb 2020

@khyatisoneji is on it (:

bwplotka on 6 Feb 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 7 Mar 2020

We are super close to merge the fix! But not yet fixed.

bwplotka on 14 Mar 2020

Fixed by https://github.com/thanos-io/thanos/pull/2136

bwplotka on 17 Mar 2020

🚀1 🎉1

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 16 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings