Thanos: Potential race condition between compactor applying retention or compaction and store gateway syncing metas.

Created on 10 Oct 2018  路  9Comments  路  Source: thanos-io/thanos

With compaction or retention logic we have one "writer" that creates new blocks (compaction) and deletes blocks that were source of it.

The problem with our readers (store) is that syncing is periodically every X seconds. So it might happen that we query store during time of compactor remove the block, but store did not sync yet. There is no watch logic for Bucket API.

The simplest solution is to defer deletion in some time in future to address potential eventual consistency of store Gateway internal state (and potentially bucket itself)

Acceptance criteria:

  • Store Gateway will have loaded new block before old compaction source blocks are deleted.
  • Store Gateway handling paritally deleted, empty or non existent TSDB block, despite having cached meta-index file.

This unfortunately requires the heavy modification of compactor Plan logic to understand the edge cases like:

  • Compactor creates new block. Defers source block deletions and suddenly stopped working.

    • After restart compactor needs to detect that block overlap is because of defered deletion.

    • What if bucket shows newly created block as partial? (eventual consistency), What if compactor crashed before upload completed? How to differentiate those two?

    • What if defered deletion fails in between?

This has to be done and is planned to be done EOY

bug compact store hard

Most helpful comment

All 9 comments

This will be fixed by: https://github.com/thanos-io/thanos/issues/1528 help wanted (:

Why we doesn't trigger a resync with remote in reader component (thanos store), if a block (deleted by compactor) isn't found?

@Reamer

We can, but still sync takes time, so ideally we can do it in a smarter way as defined by https://github.com/thanos-io/thanos/issues/1528

Also trigger will involve coordination which we don't want.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@khyatisoneji is on it (:

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

We are super close to merge the fix! But not yet fixed.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings