Thanos: compact: Investigate global compator being sad.

Created on 14 Mar 2018 · 35Comments · Source: thanos-io/thanos

level=debug name=thanos-compactor ts=2018-03-14T16:08:22.330077587Z caller=compact.go:571 compactionGroup=11a8f9199ddc09ad msg="downloaded blocks" blocks="[/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]" duration=2.2325282s
level=error name=thanos-compactor ts=2018-03-14T16:08:22.365408779Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]: open /compactor-data/groups/11a8f9199ddc09ad/tmp/01C6Z7A5MKP0MF6HR863FDSG28/chunks: no such file or directory"
level=debug name=thanos-compactor ts=2018-03-14T16:08:25.488596021Z caller=compact.go:571 compactionGroup=aef416e3a2758b0f msg="downloaded blocks" blocks="[/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]" duration=3.122579475s
level=info name=thanos-compactor ts=2018-03-14T16:08:25.501648107Z caller=compact.go:387 msg="compact blocks" count=4 mint=1520985600000 maxt=1521014400000
level=error name=thanos-compactor ts=2018-03-14T16:08:25.555418327Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]: write compaction: get series 5735: invalid checksum"
level=debug name=thanos-compactor ts=2018-03-14T16:13:23.012490052Z caller=compact.go:571 compactionGroup=11a8f9199ddc09ad msg="downloaded blocks" blocks="[/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]" duration=2.896049324s
level=error name=thanos-compactor ts=2018-03-14T16:13:23.051682944Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]: open /compactor-data/groups/11a8f9199ddc09ad/tmp/01C6Z7A5MKP0MF6HR863FDSG28/chunks: no such file or directory"
level=debug name=thanos-compactor ts=2018-03-14T16:13:26.54207661Z caller=compact.go:571 compactionGroup=aef416e3a2758b0f msg="downloaded blocks" blocks="[/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]" duration=3.48974054s
level=info name=thanos-compactor ts=2018-03-14T16:13:26.550168484Z caller=compact.go:387 msg="compact blocks" count=4 mint=1520985600000 maxt=1521014400000
level=error name=thanos-compactor ts=2018-03-14T16:13:26.61308472Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]: write compaction: get series 5735: invalid checksum"

bug

Source

bwplotka

All 35 comments

I'm also seeing issues with the compactor, though not sure if they're related at this point (likely an issue with unclean shutdown on the service I'm using to write the TSDB blocks).

level=error ts=2018-03-15T02:11:00.862765788Z caller=main.go:147 msg="running command failed" err="critical error detected: invalid plan block /app/data/compact/01C8HH6P59EHX22BH95QD03XWY: open index file: try lock file: open /app/data/compact/01C8HH6P59EHX22BH95QD03XWY/index: no such file or directory"

mattbostock on 15 Mar 2018

Hm, yea. I need to run it once again with current master. There was a bug in master TSDB and we did not pin tsdb vendor until yesterday PR (https://github.com/improbable-eng/thanos/pull/247). The bug had different symptoms, but lets see.

bwplotka on 15 Mar 2018

I'm getting the same error trying to compact on master

msg="compaction failed" err="invalid plan block /tmp/thanos/01C5775HEWYQGHT555K0TQ55HN: open index file: try lock file: open /tmp/thanos/01C5775HEWYQGHT555K0TQ55HN/index: no such file or directory"

TimSimmons on 21 Mar 2018

Thanks all for input!

This PR added section that claims the problems: https://github.com/improbable-eng/thanos/pull/194/files#r176140589

@fabxc any quick ideas?
Will have more time tomorrow for this.

bwplotka on 21 Mar 2018

If it's always about the index file not being there in the first place it's hard to tell? Maybe an aborted upload or download?

Is it always about the same block(s) or about many different ones?

Not quite sure where the "try lock file" comes from tbh.

fabxc on 21 Mar 2018

For me it happens with every bucket, even if the index is there. As Bplotka says it's just that the index files doesn't get downloaded before block.VerifyIndex gets called. It does get called again on L648 after things get downloaded.

TimSimmons on 21 Mar 2018

Oh, maybe just a bad rebase then – there was a lot of that in the chain of downsampling PRs.

fabxc on 21 Mar 2018

👍1

Removed it in #250, just in case ya'll think that's it. Deleting it allowed my compaction to continue, although I've run into other seemingly unrelated errors down the line.

TimSimmons on 21 Mar 2018

Please let us know about all of them. The compactor was altered really
heavily with the downsampling additions and regressions are well possible.

On Wed, Mar 21, 2018 at 7:21 PM Tim Simmons notifications@github.com
wrote:

Removed it in #250 https://github.com/improbable-eng/thanos/pull/250,
just in case ya'll think that's it. Deleting it allowed my compaction to
continue, although I've run into other seemingly unrelated errors down the
line.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/improbable-eng/thanos/issues/246#issuecomment-375047023,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8mpBYTMuS9R85xjVrqIq6ydzCFHXks5tgpoagaJpZM4SqwKM
.

fabxc on 21 Mar 2018

OK, I think I fixed verifyIndex bug (tests for compactor where kind of broken -> did not ensure proper input state that's why we missed that) and found proper place for that verify: https://github.com/improbable-eng/thanos/pull/251

I cannot find any other bug there, will deploy new compactor on my side.

bwplotka on 21 Mar 2018

all works for me. I got some unrelated issue (tsdb issue or my malformed thing).

bwplotka on 22 Mar 2018

All fixed from what I can tell. Let me know if you think otherwise.

bwplotka on 22 Mar 2018

This did allow compaction to continue, but I'm unsure if what I'm seeing is correct. I'm wondering if there might be an issue in the s3 code for iterating a bucket or something. I've run against two different buckets.

Here are some logs:
https://gist.github.com/TimSimmons/f552e2439321e13029daf2dcbf2b8066

I invoked the command like:

S3_SECRET_KEY="xxx" ./thanos compact \
    --s3.bucket <bucket> \
    --s3.endpoint <endpoint> \
    --s3.access-key xxxx \
    --log.level debug \
    --data-dir /tmp/thanos > ~/thanos-log2.txt 2>&1

Some things I noticed (these are still running as of typing this):

The data was about 5 days worth, in 2 hour blocks, before compaction they were ~1-1.5GB, 400-520 objects.
At this point the buckets are ~14GB, and 3-4k objects.
Compactions have been running about 6 hours
The same blocks seem to be downloaded a lot
Lots of blocks seem to get created often with new-ish ULIDs (current times)
I've yet to see them try to downsample, where I did see that in past runs that failed (it often failed at that point due to similar errors that are seen occasionally here can't find meta.json, etc)

Am I doing something wrong? Am I too impatient? I'm not sure what to expect, so I thought ya'll might be able to help :)

TimSimmons on 23 Mar 2018

No - thanks for noting it here and being careful!

~14GB, and 3-4k objects.

This sounds quite bad - you should have less objects and slightly more size than without compaction - not that difference!

Rest of the things you noticed seems ok.

@TimSimmons Can you show the logs of the comapctor? Especially I am interested if delete log lines are there and can you check if actuallty the thing we logged we deleted, actually got deleted for S3?

I remember the same issue in GCS object store in the very beginning. We thought delete code is deleting, but actually the error was ignored. Maybe we have a similar case here?

bwplotka on 23 Mar 2018

Yea, sounds like deletion not quite working. GCS object listing is
consistent – AFAIK S3 is not but the time between compactions should be
relatively large and I wouldn't have expected it to be an issue.

On Fri, Mar 23, 2018 at 11:32 AM Bartek Plotka notifications@github.com
wrote:

No - thanks for noting it here and being careful!

~14GB, and 3-4k objects.
This sounds quite bad - you should have less objects and slightly more
size not that difference!

Rest of the things you noticed seems ok.

@TimSimmons https://github.com/timsimmons Can you show the logs of the
comapctor? Especially I am interested if delete log lines are there and
can you check if actuallty the thing we logged we deleted, actually got
deleted for S3?

I remember the same issue in GCS object store in the very beginning. We
thought delete code is deleting, but actually the error was ignored. Maybe
we have a similar case here?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/improbable-eng/thanos/issues/246#issuecomment-375611191,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8pBVux7WuM5mwBQ6KHhKa6CidH6tks5thM8ugaJpZM4SqwKM
.

fabxc on 23 Mar 2018

@Bplotka logs here: https://gist.github.com/TimSimmons/f552e2439321e13029daf2dcbf2b8066

The blocks that have been deleted appear to be actually deleted. But it's decisions on what to delete seem a bit off. It seems to delete a couple of the blocks from early on, and then delete a bunch of new blocks it creates.

TimSimmons on 23 Mar 2018

Thx for the logs, will take a look later, unless @fabxc will take it sooner. No idea so far.

Some uploads fails from time to time:

level=error ts=2018-03-22T21:00:06.960068782Z caller=compact.go:157 msg="compaction failed" err="upload block: upload file /tmp/thanos2/01C97R5SDNRHWRV7QE4SMVR1YH/chunks/000001 as 01C97R5SDNRHWRV7QE4SMVR1YH/chunks/000001: upload s3 object: Put https://prod-int-spaces01.nyc3.internal.digitalocean.com/pandora-lts-tor1-slm/01C97R5SDNRHWRV7QE4SMVR1YH/chunks/000001?partNumber=1&uploadId=2~yC6-JaiXnWm4xj83X7_PGVaWzEKj01F: net/http: HTTP/1.x transport connection broken: write tcp 10.124.60.65:44114->10.74.44.30:443: write: broken pipe"

and reads:

level=error ts=2018-03-23T01:50:51.313383223Z caller=compact.go:157 msg="compaction failed" err="download block 01C8TEER62BGV785DSCC599WY0: copy object to file: Get https://prod-int-spaces01.nyc3.internal.digitalocean.com/pandora-lts-tor1-slm/01C8TEER62BGV785DSCC599WY0/index: net/http: TLS handshake timeout"

We should retry I guess on these.

There is nothing obvious that would explain such increase in size and number of blocks.

bwplotka on 23 Mar 2018

The oddest thing for me is that for example this line:

msg="downloaded blocks" blocks="[/tmp/thanos2/01C8RGN5CR3AYD3TA0MPB9VYH4 /tmp/thanos2/01C8RQGWMRWTCNJADBZGD7SQMP]"

appears 711 times in the log2.txt there. I can't imagine why it'd need to do that.

TimSimmons on 23 Mar 2018

Can you post configuration for you compactor?

bwplotka on 23 Mar 2018

It's not AWS only - I think I can repro it ):

bwplotka on 23 Mar 2018

Ah good to hear. Well not _good_ but you know.

This was how I invoked it:

S3_SECRET_KEY="xxx" ./thanos compact \
    --s3.bucket <bucket> \
    --s3.endpoint <endpoint> \
    --s3.access-key xxxx \
    --log.level debug \
    --data-dir /tmp/thanos > ~/thanos-log2.txt 2>&1

No other config afaik. It was compiled off of my latest branch. The sidecar I'm running is from early january.

TimSimmons on 23 Mar 2018

We need some metrics to make sure we can catch up these kind of things earlier.

bwplotka on 23 Mar 2018

@TimSimmons stop you compactor now, we can debug logs to see what happend. Seems like compactor produces too many things.. If that's true, we need to find root cause & write a tool/compactor step to repair what he broke ^^

Sounds like fun!

bwplotka on 23 Mar 2018

Cool, I stopped them. I wouldn't worry about fixing it unless you've got stuff busted that you need fixed. I'm not fussed about my stuff.

TimSimmons on 23 Mar 2018

Well - I care about mine xD

bwplotka on 23 Mar 2018

😄1

I spent some more time to debug this. Update:

Checked my bucket against overlapping blocks thanks to this: https://github.com/improbable-eng/thanos/pull/257
All good. So the compactor logic seems somehow weird (~200 times compacting same blocks into new ones), but at the end, all extra blocks are removed, so the state is fine. @TimSimmons you can run this command against your bucket as well to check against overlaps, but I think it's fine.

make build && ./thanos bucket <s3 flags to configure your bucket> verify -i overlapped_blocks

Regarding the issue itself:

After further investigation I spotted that same blocks are compacted in every iteration of compaction in certain time range. For example certain blocks are compacted from
ts=2018-03-23T06:44:15.826366538Z
until around:
ts=2018-03-23T08:44:33.519088648Z

So what happen at 8:44:33? Well the block that was created 2h earlier appeared in the syncMetas. Why so late?
Because of:

// ULIDs contain a millisecond timestamp. We do not consider blocks
// that have been created too recently to avoid races when a block
// is only partially uploaded.
if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond) {
    return nil
}

We found the root cause.

I will investigate now possible fix -> We want our compactor to be kind of stateless so it's not enough to just preserve (cache) newly compacted, uploaded block locally to have it immediately visible by syncer. Any ideas @fabxc?

bwplotka on 25 Mar 2018

Maybe we should delete blocks used by compaction immediately after compacted block being uploaded?
(that is the logic for local TSDB compaction BTW)

bwplotka on 25 Mar 2018

Oke after fixing our bucket verify I can see or the duplicates ):

This fixes the root cause https://github.com/improbable-eng/thanos/pull/258

But most likely we need to kill duplicated blocks manually.

bwplotka on 26 Mar 2018

Fixes merged but now I am implementing command to safely remove all duplicates (move to backup bucket with TTL)

bwplotka on 27 Mar 2018

Ups, bad button.

bwplotka on 27 Mar 2018

With master, it seems to get through initial compaction mostly ok (# of objects/total sized decreased dramatically), but things eventually go south with:

level=debug ts=2018-04-04T22:36:04.178630034Z caller=compact.go:172 msg="download meta" block=01CA9B1Y8RRVWPF8E6MMDTT20B
level=warn ts=2018-04-04T22:36:04.182021863Z caller=compact.go:176 msg="downloading meta.json failed" err="decode meta.json for block 01CA9B1Y8RRVWPF8E6MMDTT20B: The specified key does not exist."
level=info ts=2018-04-04T22:36:04.230062115Z caller=compact.go:177 msg="start first pass of downsampling"
level=error ts=2018-04-04T22:36:04.60553534Z caller=main.go:147 msg="running command failed" err="first pass of downsampling failed: retrieve bucket block metas: decode meta: The specified key does not exist."

If I grep the log it seems I get a failed upload for that block

$ grep "01CA9B1Y8RRVWPF8E6MMDTT20B" thanos-log.txt
level=error ts=2018-04-04T22:07:01.765908943Z caller=compact.go:157 msg="compaction failed" err="upload block: upload file /tmp/thanos/01CA9B1Y8RRVWPF8E6MMDTT20B/index as 01CA9B1Y8RRVWPF8E6MMDTT20B/index: upload s3 object: This multipart completion is already in progress"
level=debug ts=2018-04-04T22:35:09.779591747Z caller=compact.go:172 msg="download meta" block=01CA9B1Y8RRVWPF8E6MMDTT20B
level=warn ts=2018-04-04T22:35:09.85736662Z caller=compact.go:176 msg="downloading meta.json failed" err="decode meta.json for block 01CA9B1Y8RRVWPF8E6MMDTT20B: The specified key does not exist."
level=debug ts=2018-04-04T22:36:04.178630034Z caller=compact.go:172 msg="download meta" block=01CA9B1Y8RRVWPF8E6MMDTT20B
level=warn ts=2018-04-04T22:36:04.182021863Z caller=compact.go:176 msg="downloading meta.json failed" err="decode meta.json for block 01CA9B1Y8RRVWPF8E6MMDTT20B: The specified key does not exist."

So it probably needs to verify/retry or something like that? Am I interpreting these results correctly?

TimSimmons on 5 Apr 2018

Yea I think you are totally right!

Any ideas what

err="upload block: upload file /tmp/thanos/01CA9B1Y8RRVWPF8E6MMDTT20B/index as 01CA9B1Y8RRVWPF8E6MMDTT20B/index: upload s3 object: This multipart completion is already in progress"

Error means? Are we uploading that index file somewhere else or what? If it's transient error -> yea, retry should be correct. Otherwise, metric + alert I think

bwplotka on 5 Apr 2018

Actually ignore the above. If it would be correct UploadDir behaviour all should be good.

The correct behaviour is:

Upload each file from the dir
If any error on upload, delete all partial file / block dir stuff
Return error

The current flow indicates that failed uploadDir did not cleaned partially uploaded dir. Can look if we have that logic for s3?
When we will assume uploaded dir return error when it is NOT uploaded, compactor will do same compaction again and thus retry the upload.

I think the 01CA9B1Y8RRVWPF8E6MMDTT20B is just rubbish partially uploaded block that is ok to be removed. Most like another block was produced in its place (because that compaction errored)

bwplotka on 5 Apr 2018

I don't think that logic lives in the s3 plugin, from what I understand the objstore plugins just implement "upload this file".

I don't see any logic to clean up a partially uploaded dir when things fail. If I'm seeing it correctly.
Compact every block:
https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/compact.go#L149-L168
The upload happens here:
https://github.com/improbable-eng/thanos/blob/master/pkg/compact/compact.go#L571-L574
If it fails the error is returned upwards, but never any cleanup attempted:
https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/compact.go#L149-L168
UploadDir just ends up calling the bkt.Upload function:
https://github.com/improbable-eng/thanos/blob/master/pkg/objstore/objstore.go#L44-L63

Idk, maybe I'm wrong but I can't see that The current flow indicates that failed uploadDir did not cleaned partially uploaded dir. logic anywhere. Not in the gcs bits either

Note: after I deleted that block, compaction finished successfully.

TimSimmons on 5 Apr 2018

You are totally right, thanks for checking.

I double checked and yea, the answer is simple:

failed uploadDir did not cleaned partially uploaded dir.

that flow just does not exists. (:

PS: I managed to fully repair my storage after the issues from this ticket (:

Let's merge repair command bits for future use (hopefully won't be needed)
https://github.com/improbable-eng/thanos/pull/269
https://github.com/improbable-eng/thanos/pull/268

With this being said, I am going to close this issue and setup new one (https://github.com/improbable-eng/thanos/issues/273) for issues you mentioned. Thanks for help!

bwplotka on 7 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Compact] Downsampled blocks still exist after disabling + setting a low retention

sevagh · 3Comments

Update thanos/kube/manifests/ files according to the thanos latest v0.2.0

asudhakarreddy · 3Comments

Include DCO instructions in Contributing Guide

StephenWeber · 3Comments

Compact: Bucket WebUI doesn't show up on compact

jdfalk · 3Comments

Stale bot marks closed PRs/Issues as "stale" 🤔

bwplotka · 4Comments