level=debug name=thanos-compactor ts=2018-03-14T16:08:22.330077587Z caller=compact.go:571 compactionGroup=11a8f9199ddc09ad msg="downloaded blocks" blocks="[/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]" duration=2.2325282s
level=error name=thanos-compactor ts=2018-03-14T16:08:22.365408779Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]: open /compactor-data/groups/11a8f9199ddc09ad/tmp/01C6Z7A5MKP0MF6HR863FDSG28/chunks: no such file or directory"
level=debug name=thanos-compactor ts=2018-03-14T16:08:25.488596021Z caller=compact.go:571 compactionGroup=aef416e3a2758b0f msg="downloaded blocks" blocks="[/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]" duration=3.122579475s
level=info name=thanos-compactor ts=2018-03-14T16:08:25.501648107Z caller=compact.go:387 msg="compact blocks" count=4 mint=1520985600000 maxt=1521014400000
level=error name=thanos-compactor ts=2018-03-14T16:08:25.555418327Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]: write compaction: get series 5735: invalid checksum"
level=debug name=thanos-compactor ts=2018-03-14T16:13:23.012490052Z caller=compact.go:571 compactionGroup=11a8f9199ddc09ad msg="downloaded blocks" blocks="[/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]" duration=2.896049324s
level=error name=thanos-compactor ts=2018-03-14T16:13:23.051682944Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/11a8f9199ddc09ad/01C6Z70BV8QHHFQ5X1DQDE225A /compactor-data/groups/11a8f9199ddc09ad/01C6Z70JPDPJ07Q2FQG7X533YB /compactor-data/groups/11a8f9199ddc09ad/01C6Z7A5MKP0MF6HR863FDSG28]: open /compactor-data/groups/11a8f9199ddc09ad/tmp/01C6Z7A5MKP0MF6HR863FDSG28/chunks: no such file or directory"
level=debug name=thanos-compactor ts=2018-03-14T16:13:26.54207661Z caller=compact.go:571 compactionGroup=aef416e3a2758b0f msg="downloaded blocks" blocks="[/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]" duration=3.48974054s
level=info name=thanos-compactor ts=2018-03-14T16:13:26.550168484Z caller=compact.go:387 msg="compact blocks" count=4 mint=1520985600000 maxt=1521014400000
level=error name=thanos-compactor ts=2018-03-14T16:13:26.61308472Z caller=compact.go:152 msg="compaction failed" err="compact blocks [/compactor-data/groups/aef416e3a2758b0f/01C8H76F0Y88QB6J30B3NRQE99 /compactor-data/groups/aef416e3a2758b0f/01C8HE268ZQWMYN7J2KEH90DYG /compactor-data/groups/aef416e3a2758b0f/01C8HMXXGZBM311QQKG24YRGXQ /compactor-data/groups/aef416e3a2758b0f/01C8HVSMRYE20VZTJ2WSXE33D6]: write compaction: get series 5735: invalid checksum"
I'm also seeing issues with the compactor, though not sure if they're related at this point (likely an issue with unclean shutdown on the service I'm using to write the TSDB blocks).
level=error ts=2018-03-15T02:11:00.862765788Z caller=main.go:147 msg="running command failed" err="critical error detected: invalid plan block /app/data/compact/01C8HH6P59EHX22BH95QD03XWY: open index file: try lock file: open /app/data/compact/01C8HH6P59EHX22BH95QD03XWY/index: no such file or directory"
Hm, yea. I need to run it once again with current master. There was a bug in master TSDB and we did not pin tsdb vendor until yesterday PR (https://github.com/improbable-eng/thanos/pull/247). The bug had different symptoms, but lets see.
I'm getting the same error trying to compact on master
msg="compaction failed" err="invalid plan block /tmp/thanos/01C5775HEWYQGHT555K0TQ55HN: open index file: try lock file: open /tmp/thanos/01C5775HEWYQGHT555K0TQ55HN/index: no such file or directory"
Thanks all for input!
This PR added section that claims the problems: https://github.com/improbable-eng/thanos/pull/194/files#r176140589
@fabxc any quick ideas?
Will have more time tomorrow for this.
If it's always about the index file not being there in the first place it's hard to tell? Maybe an aborted upload or download?
Is it always about the same block(s) or about many different ones?
Not quite sure where the "try lock file" comes from tbh.
For me it happens with every bucket, even if the index is there. As Bplotka says it's just that the index files doesn't get downloaded before block.VerifyIndex gets called. It does get called again on L648 after things get downloaded.
Oh, maybe just a bad rebase then – there was a lot of that in the chain of downsampling PRs.
Removed it in #250, just in case ya'll think that's it. Deleting it allowed my compaction to continue, although I've run into other seemingly unrelated errors down the line.
Please let us know about all of them. The compactor was altered really
heavily with the downsampling additions and regressions are well possible.
On Wed, Mar 21, 2018 at 7:21 PM Tim Simmons notifications@github.com
wrote:
Removed it in #250 https://github.com/improbable-eng/thanos/pull/250,
just in case ya'll think that's it. Deleting it allowed my compaction to
continue, although I've run into other seemingly unrelated errors down the
line.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/improbable-eng/thanos/issues/246#issuecomment-375047023,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8mpBYTMuS9R85xjVrqIq6ydzCFHXks5tgpoagaJpZM4SqwKM
.
OK, I think I fixed verifyIndex bug (tests for compactor where kind of broken -> did not ensure proper input state that's why we missed that) and found proper place for that verify: https://github.com/improbable-eng/thanos/pull/251
I cannot find any other bug there, will deploy new compactor on my side.
all works for me. I got some unrelated issue (tsdb issue or my malformed thing).
All fixed from what I can tell. Let me know if you think otherwise.
This did allow compaction to continue, but I'm unsure if what I'm seeing is correct. I'm wondering if there might be an issue in the s3 code for iterating a bucket or something. I've run against two different buckets.
Here are some logs:
https://gist.github.com/TimSimmons/f552e2439321e13029daf2dcbf2b8066
I invoked the command like:
S3_SECRET_KEY="xxx" ./thanos compact \
--s3.bucket <bucket> \
--s3.endpoint <endpoint> \
--s3.access-key xxxx \
--log.level debug \
--data-dir /tmp/thanos > ~/thanos-log2.txt 2>&1
Some things I noticed (these are still running as of typing this):
can't find meta.json, etc)Am I doing something wrong? Am I too impatient? I'm not sure what to expect, so I thought ya'll might be able to help :)
No - thanks for noting it here and being careful!
~14GB, and 3-4k objects.
This sounds quite bad - you should have less objects and slightly more size than without compaction - not that difference!
Rest of the things you noticed seems ok.
@TimSimmons Can you show the logs of the comapctor? Especially I am interested if delete log lines are there and can you check if actuallty the thing we logged we deleted, actually got deleted for S3?
I remember the same issue in GCS object store in the very beginning. We thought delete code is deleting, but actually the error was ignored. Maybe we have a similar case here?
Yea, sounds like deletion not quite working. GCS object listing is
consistent – AFAIK S3 is not but the time between compactions should be
relatively large and I wouldn't have expected it to be an issue.
On Fri, Mar 23, 2018 at 11:32 AM Bartek Plotka notifications@github.com
wrote:
No - thanks for noting it here and being careful!
~14GB, and 3-4k objects.
This sounds quite bad - you should have less objects and slightly more
size not that difference!Rest of the things you noticed seems ok.
@TimSimmons https://github.com/timsimmons Can you show the logs of the
comapctor? Especially I am interested if delete log lines are there and
can you check if actuallty the thing we logged we deleted, actually got
deleted for S3?I remember the same issue in GCS object store in the very beginning. We
thought delete code is deleting, but actually the error was ignored. Maybe
we have a similar case here?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/improbable-eng/thanos/issues/246#issuecomment-375611191,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8pBVux7WuM5mwBQ6KHhKa6CidH6tks5thM8ugaJpZM4SqwKM
.
@Bplotka logs here: https://gist.github.com/TimSimmons/f552e2439321e13029daf2dcbf2b8066
The blocks that have been deleted appear to be actually deleted. But it's decisions on what to delete seem a bit off. It seems to delete a couple of the blocks from early on, and then delete a bunch of new blocks it creates.
Thx for the logs, will take a look later, unless @fabxc will take it sooner. No idea so far.
Some uploads fails from time to time:
level=error ts=2018-03-22T21:00:06.960068782Z caller=compact.go:157 msg="compaction failed" err="upload block: upload file /tmp/thanos2/01C97R5SDNRHWRV7QE4SMVR1YH/chunks/000001 as 01C97R5SDNRHWRV7QE4SMVR1YH/chunks/000001: upload s3 object: Put https://prod-int-spaces01.nyc3.internal.digitalocean.com/pandora-lts-tor1-slm/01C97R5SDNRHWRV7QE4SMVR1YH/chunks/000001?partNumber=1&uploadId=2~yC6-JaiXnWm4xj83X7_PGVaWzEKj01F: net/http: HTTP/1.x transport connection broken: write tcp 10.124.60.65:44114->10.74.44.30:443: write: broken pipe"
and reads:
level=error ts=2018-03-23T01:50:51.313383223Z caller=compact.go:157 msg="compaction failed" err="download block 01C8TEER62BGV785DSCC599WY0: copy object to file: Get https://prod-int-spaces01.nyc3.internal.digitalocean.com/pandora-lts-tor1-slm/01C8TEER62BGV785DSCC599WY0/index: net/http: TLS handshake timeout"
We should retry I guess on these.
There is nothing obvious that would explain such increase in size and number of blocks.
The oddest thing for me is that for example this line:
msg="downloaded blocks" blocks="[/tmp/thanos2/01C8RGN5CR3AYD3TA0MPB9VYH4 /tmp/thanos2/01C8RQGWMRWTCNJADBZGD7SQMP]"
appears 711 times in the log2.txt there. I can't imagine why it'd need to do that.
Can you post configuration for you compactor?
It's not AWS only - I think I can repro it ):
Ah good to hear. Well not _good_ but you know.
This was how I invoked it:
S3_SECRET_KEY="xxx" ./thanos compact \
--s3.bucket <bucket> \
--s3.endpoint <endpoint> \
--s3.access-key xxxx \
--log.level debug \
--data-dir /tmp/thanos > ~/thanos-log2.txt 2>&1
No other config afaik. It was compiled off of my latest branch. The sidecar I'm running is from early january.
We need some metrics to make sure we can catch up these kind of things earlier.
@TimSimmons stop you compactor now, we can debug logs to see what happend. Seems like compactor produces too many things.. If that's true, we need to find root cause & write a tool/compactor step to repair what he broke ^^
Sounds like fun!
Cool, I stopped them. I wouldn't worry about fixing it unless you've got stuff busted that you need fixed. I'm not fussed about my stuff.
Well - I care about mine xD
I spent some more time to debug this. Update:
Checked my bucket against overlapping blocks thanks to this: https://github.com/improbable-eng/thanos/pull/257
All good. So the compactor logic seems somehow weird (~200 times compacting same blocks into new ones), but at the end, all extra blocks are removed, so the state is fine. @TimSimmons you can run this command against your bucket as well to check against overlaps, but I think it's fine.
make build && ./thanos bucket <s3 flags to configure your bucket> verify -i overlapped_blocks
Regarding the issue itself:
After further investigation I spotted that same blocks are compacted in every iteration of compaction in certain time range. For example certain blocks are compacted from
ts=2018-03-23T06:44:15.826366538Z
until around:
ts=2018-03-23T08:44:33.519088648Z
So what happen at 8:44:33? Well the block that was created 2h earlier appeared in the syncMetas. Why so late?
Because of:
// ULIDs contain a millisecond timestamp. We do not consider blocks
// that have been created too recently to avoid races when a block
// is only partially uploaded.
if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond) {
return nil
}
We found the root cause.
I will investigate now possible fix -> We want our compactor to be kind of stateless so it's not enough to just preserve (cache) newly compacted, uploaded block locally to have it immediately visible by syncer. Any ideas @fabxc?
Maybe we should delete blocks used by compaction immediately after compacted block being uploaded?
(that is the logic for local TSDB compaction BTW)
Oke after fixing our bucket verify I can see or the duplicates ):
This fixes the root cause https://github.com/improbable-eng/thanos/pull/258
But most likely we need to kill duplicated blocks manually.
Fixes merged but now I am implementing command to safely remove all duplicates (move to backup bucket with TTL)
Ups, bad button.
With master, it seems to get through initial compaction mostly ok (# of objects/total sized decreased dramatically), but things eventually go south with:
level=debug ts=2018-04-04T22:36:04.178630034Z caller=compact.go:172 msg="download meta" block=01CA9B1Y8RRVWPF8E6MMDTT20B
level=warn ts=2018-04-04T22:36:04.182021863Z caller=compact.go:176 msg="downloading meta.json failed" err="decode meta.json for block 01CA9B1Y8RRVWPF8E6MMDTT20B: The specified key does not exist."
level=info ts=2018-04-04T22:36:04.230062115Z caller=compact.go:177 msg="start first pass of downsampling"
level=error ts=2018-04-04T22:36:04.60553534Z caller=main.go:147 msg="running command failed" err="first pass of downsampling failed: retrieve bucket block metas: decode meta: The specified key does not exist."
If I grep the log it seems I get a failed upload for that block
$ grep "01CA9B1Y8RRVWPF8E6MMDTT20B" thanos-log.txt
level=error ts=2018-04-04T22:07:01.765908943Z caller=compact.go:157 msg="compaction failed" err="upload block: upload file /tmp/thanos/01CA9B1Y8RRVWPF8E6MMDTT20B/index as 01CA9B1Y8RRVWPF8E6MMDTT20B/index: upload s3 object: This multipart completion is already in progress"
level=debug ts=2018-04-04T22:35:09.779591747Z caller=compact.go:172 msg="download meta" block=01CA9B1Y8RRVWPF8E6MMDTT20B
level=warn ts=2018-04-04T22:35:09.85736662Z caller=compact.go:176 msg="downloading meta.json failed" err="decode meta.json for block 01CA9B1Y8RRVWPF8E6MMDTT20B: The specified key does not exist."
level=debug ts=2018-04-04T22:36:04.178630034Z caller=compact.go:172 msg="download meta" block=01CA9B1Y8RRVWPF8E6MMDTT20B
level=warn ts=2018-04-04T22:36:04.182021863Z caller=compact.go:176 msg="downloading meta.json failed" err="decode meta.json for block 01CA9B1Y8RRVWPF8E6MMDTT20B: The specified key does not exist."
So it probably needs to verify/retry or something like that? Am I interpreting these results correctly?
Yea I think you are totally right!
Any ideas what
err="upload block: upload file /tmp/thanos/01CA9B1Y8RRVWPF8E6MMDTT20B/index as 01CA9B1Y8RRVWPF8E6MMDTT20B/index: upload s3 object: This multipart completion is already in progress"
Error means? Are we uploading that index file somewhere else or what? If it's transient error -> yea, retry should be correct. Otherwise, metric + alert I think
Actually ignore the above. If it would be correct UploadDir behaviour all should be good.
The correct behaviour is:
The current flow indicates that failed uploadDir did not cleaned partially uploaded dir. Can look if we have that logic for s3?
When we will assume uploaded dir return error when it is NOT uploaded, compactor will do same compaction again and thus retry the upload.
I think the 01CA9B1Y8RRVWPF8E6MMDTT20B is just rubbish partially uploaded block that is ok to be removed. Most like another block was produced in its place (because that compaction errored)
I don't think that logic lives in the s3 plugin, from what I understand the objstore plugins just implement "upload this file".
I don't see any logic to clean up a partially uploaded dir when things fail. If I'm seeing it correctly.
Compact every block:
https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/compact.go#L149-L168
The upload happens here:
https://github.com/improbable-eng/thanos/blob/master/pkg/compact/compact.go#L571-L574
If it fails the error is returned upwards, but never any cleanup attempted:
https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/compact.go#L149-L168
UploadDir just ends up calling the bkt.Upload function:
https://github.com/improbable-eng/thanos/blob/master/pkg/objstore/objstore.go#L44-L63
Idk, maybe I'm wrong but I can't see that The current flow indicates that failed uploadDir did not cleaned partially uploaded dir. logic anywhere. Not in the gcs bits either
Note: after I deleted that block, compaction finished successfully.
You are totally right, thanks for checking.
I double checked and yea, the answer is simple:
failed uploadDir did not cleaned partially uploaded dir.
that flow just does not exists. (:
PS: I managed to fully repair my storage after the issues from this ticket (:
Let's merge repair command bits for future use (hopefully won't be needed)
https://github.com/improbable-eng/thanos/pull/269
https://github.com/improbable-eng/thanos/pull/268
With this being said, I am going to close this issue and setup new one (https://github.com/improbable-eng/thanos/issues/273) for issues you mentioned. Thanks for help!