The global compactor is halting due to the following error:
invalid result block /app/data/compact/01CAAX3A54YPTXPMGMQTJT9T7F: 119381/1087562 series have an average of 1.340 out-of-order chunks
I haven't dug into the issue yet, but I wonder if it's related to @gouthamve's comment that out-of-order appends are not prevented in the TSDB library when out-of-order samples are appended to the same appender:
https://github.com/prometheus/tsdb/pull/258#issuecomment-378153699
Hm... The question is why OOO (out of order) chunk was actually appended in wrong order.
Your TSDB PR makes sense, just here we have OOO chunks, not samples. I have never seen that problem.
Can you track 01CAAX3A54YPTXPMGMQTJT9T7F down? is that block a compacted one? First compacted one? 10th one? Can we grep your logs with this ulid?
@mattbostock While we don't error out, we drop the sample. OOO samples can never make it inside :) This looks like a Thanos issue.
That sounds like the bug we couldn't track down but for which we added the halting – so we can detect it early on. In the original case we observed, it was simply that sequences of 1-3 chunks were repeated a few up to hundreds of times, i.e. not data loss (and not even visible during querying) but concerning nonetheless of course.
Could you make the index file available for debugging?
In general, the thanos bucket check command may let you repair that issue. But getting the broken data for forensics first would be great.
Good point, thanks @gouthamve.
@Bplotka: 01CAAX3A54YPTXPMGMQTJT9T7F isn't persisted so can't be inspected using thanos bucket ls, but here are the debug logs from the compactor:
2018-04-05T12:39:49.000Z level=error ts=2018-04-05T12:39:49.942886616Z caller=compact.go:158 msg="compaction failed" err="invalid result block /app/data/compact/01CAAX3A54YPTXPMGMQTJT9T7F: 119381/1087562 series have an average of 1.340 out-of-order chunks" thanos-testing.compact
2018-04-05T12:39:49.000Z level=error ts=2018-04-05T12:39:49.942960438Z caller=compact.go:163 msg="critical error detected; halting" thanos-testing.compact
2018-04-05T12:39:48.000Z level=info ts=2018-04-05T12:39:48.607935828Z caller=compact.go:347 msg="compact blocks" count=7 mint=1522925400000 maxt=1522929600000 ulid=01CAAX3A54YPTXPMGMQTJT9T7F sources="[01CAAQP1TRAZTTE7VCN02TQ9FQ 01CAAR8KHJ5CF5M48BYC8FANRK 01CAARV0YR1AH46N01BVTE1KF0 01CAASD3MEYPK2825BK864SD0T 01CAASZMX64D8T32SSD0RCVCMR 01CAATHYQYCWN6GTK6RHW9WB9P 01CAAV48R8300SFWDYP8K0JJ7Q]" thanos-testing.compact
2018-04-05T12:39:48.000Z level=debug ts=2018-04-05T12:39:48.64333371Z caller=compact.go:547 compactionGroup="0@{__kafka2thanos_partition=\"0\"}" msg="compacted blocks" blocks="[/app/data/compact/01CAAQP1TRAZTTE7VCN02TQ9FQ /app/data/compact/01CAAR8KHJ5CF5M48BYC8FANRK /app/data/compact/01CAARV0YR1AH46N01BVTE1KF0 /app/data/compact/01CAASD3MEYPK2825BK864SD0T /app/data/compact/01CAASZMX64D8T32SSD0RCVCMR /app/data/compact/01CAATHYQYCWN6GTK6RHW9WB9P /app/data/compact/01CAAV48R8300SFWDYP8K0JJ7Q]" duration=20.927944306s thanos-testing.compact
2018-04-05T12:39:27.000Z level=debug ts=2018-04-05T12:39:27.715311043Z caller=compact.go:538 compactionGroup="0@{__kafka2thanos_partition=\"0\"}" msg="downloaded blocks" blocks="[/app/data/compact/01CAAQP1TRAZTTE7VCN02TQ9FQ /app/data/compact/01CAAR8KHJ5CF5M48BYC8FANRK /app/data/compact/01CAARV0YR1AH46N01BVTE1KF0 /app/data/compact/01CAASD3MEYPK2825BK864SD0T /app/data/compact/01CAASZMX64D8T32SSD0RCVCMR /app/data/compact/01CAATHYQYCWN6GTK6RHW9WB9P /app/data/compact/01CAAV48R8300SFWDYP8K0JJ7Q]" duration=15.581562228s
So 01CAAX3A54YPTXPMGMQTJT9T7F is a compacted block and is the first compaction attempted.
Here's the metadata for the source blocks used for compaction, which I queried using thanos bucket ls:
{
"version": 1,
"ulid": "01CAAQP1TRAZTTE7VCN02TQ9FQ",
"minTime": 1522925400000,
"maxTime": 1522926000000,
"stats": {
"numSamples": 803141,
"numSeries": 666332,
"numChunks": 669066
},
"compaction": {
"level": 1,
"sources": [
"01CAAQP1TRAZTTE7VCN02TQ9FQ"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAAR8KHJ5CF5M48BYC8FANRK",
"minTime": 1522926000000,
"maxTime": 1522926600000,
"stats": {
"numSamples": 10699530,
"numSeries": 1062388,
"numChunks": 1088379
},
"compaction": {
"level": 1,
"sources": [
"01CAAR8KHJ5CF5M48BYC8FANRK"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAARV0YR1AH46N01BVTE1KF0",
"minTime": 1522926600000,
"maxTime": 1522927200000,
"stats": {
"numSamples": 10720091,
"numSeries": 1062867,
"numChunks": 1092616
},
"compaction": {
"level": 1,
"sources": [
"01CAARV0YR1AH46N01BVTE1KF0"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAASD3MEYPK2825BK864SD0T",
"minTime": 1522927200000,
"maxTime": 1522927800000,
"stats": {
"numSamples": 10696100,
"numSeries": 1062900,
"numChunks": 1089044
},
"compaction": {
"level": 1,
"sources": [
"01CAASD3MEYPK2825BK864SD0T"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAASZMX64D8T32SSD0RCVCMR",
"minTime": 1522927800000,
"maxTime": 1522928400000,
"stats": {
"numSamples": 10696364,
"numSeries": 1063668,
"numChunks": 1089091
},
"compaction": {
"level": 1,
"sources": [
"01CAASZMX64D8T32SSD0RCVCMR"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAATHYQYCWN6GTK6RHW9WB9P",
"minTime": 1522928400000,
"maxTime": 1522929000000,
"stats": {
"numSamples": 10711089,
"numSeries": 1063985,
"numChunks": 1090410
},
"compaction": {
"level": 1,
"sources": [
"01CAATHYQYCWN6GTK6RHW9WB9P"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAAV48R8300SFWDYP8K0JJ7Q",
"minTime": 1522929000000,
"maxTime": 1522929600000,
"stats": {
"numSamples": 10719638,
"numSeries": 1064534,
"numChunks": 1092185
},
"compaction": {
"level": 1,
"sources": [
"01CAAV48R8300SFWDYP8K0JJ7Q"
]
},
"thanos": {
"labels": {
"__kafka2thanos_partition": "0"
},
"downsample": {
"resolution": 0
}
}
}
Min/max times for the source blocks (same as above, but summarised to make the times easier to read):
matt➜github.com/improbable-eng/thanos(master✗)» cat compacted.txt| jq -r -s '.[] | select(.ulid | match("^01CAAQP1TRAZTTE7VCN02TQ9FQ|01CAAR8KHJ5CF5M48BYC8FANRK|01CAARV0YR1AH46N01BVTE1KF0|01CAASD3MEYPK2825BK864SD0T|01CAASZMX64D8T32SSD0RCVCMR|01CAATHYQYCWN6GTK6RHW9WB9P|01CAAV48R8300SFWDYP8K0JJ7Q$"; "")) | [.minTime, .maxTime] | @tsv'
1522925400000 1522926000000
1522926000000 1522926600000
1522926600000 1522927200000
1522927200000 1522927800000
1522927800000 1522928400000
1522928400000 1522929000000
1522929000000 1522929600000
matt➜github.com/improbable-eng/thanos(master✗)»
Thanks! Looks like they are all non-overlapping, which means the compactor is facing the same case as in Prometheus – that excludes a fair amount of possible issues.
But I've absolutely no idea what is is unfortunately (: @mattbostock You think sending all those input blocks our way somehow is at all possible?
01CAAX3A54YPTXPMGMQTJT9T7F isn't persisted
Ah, actually, you should still be able to find that output block on the halted compactor. If not we should definitely make it so – without that halting is kind of pointless I guess :)
which means the compactor is facing the same case as in Prometheus
So the bug you mentioned before was observed in Prometheus? To be sure, do you suspect the issue is in the TSDB library?
Sharing the data directly is probably not going to be feasible. If you can please send me some pointers, I'm happy to debug and maybe can share some more specific metadata.
So the bug you mentioned before was observed in Prometheus?
No, I just meant that we are using the vanilla TSDB compactor and the input we give it is entirely on-overlapping, just like in Prometheus.
To be sure, do you suspect the issue is in the TSDB library?
That's the question. Seems like there shouldn't be anything Thanos specific in that path, so a TSDB issue is possible.
That's what we've to find out now I guess :)
If you can please send me some pointers, I'm happy to debug and maybe can share some more specific metadata.
You are probably very familiar with the TSDB library, so that helps :) I think your best bet would be pulling the blocks in question onto your machine and writing a standalone program compacting them.
Then running Thanos' block verification (that makes it halt) on the result but a bit more verbose to see how chunks are actually out of order.
Inspecting the same chunks in the input blocks may reveal something. If everything looks normal some simple debug printing in the compactor may reveal something.
You can probably just focus on a single series that is affected to avoid the noise – it will likely be the same issue for all of them.
Hey,
I fixed other issues I had problem with regarding compactor, also successfully performed repair process and rerun compactor. After couple of hours I can repro this too.
Looking on that now.
My logs:
level=debug name=thanos-compactor ts=2018-04-07T13:43:36.438324184Z caller=compact.go:172 msg="download meta" block=01CAFYKHE98Q4Q569EM39HE2QT
level=debug name=thanos-compactor ts=2018-04-07T13:43:36.528939959Z caller=compact.go:172 msg="download meta" block=01CAFYM8XBB0DR677N1R24G9KB
level=debug name=thanos-compactor ts=2018-04-07T13:43:36.605182796Z caller=compact.go:172 msg="download meta" block=01CAFYNA7DACHGYB8RA0D76CPR
level=debug name=thanos-compactor ts=2018-04-07T13:43:42.703684027Z caller=compact.go:538 compactionGroup="0@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor_namespace=\"default\",replica=\"thanos-mon-0\"}" msg="downloaded blocks" blocks="[/compactor-data/compact-work-dir/01CAFYKHE98Q4Q569EM39HE2QT /compactor-data/compact-work-dir/01CAFW35QQWW9A8GRXK44GACAV]" duration=5.977901379s
level=info name=thanos-compactor ts=2018-04-07T13:43:57.609519251Z caller=compact.go:347 msg="compact blocks" count=2 mint=1521504000000 maxt=1521648000000 ulid=01CAG5JCBTJKQ4QRRS6YCXJQE8 sources="[01CAFYKHE98Q4Q569EM39HE2QT 01CAFW35QQWW9A8GRXK44GACAV]"
level=debug name=thanos-compactor ts=2018-04-07T13:43:57.646041674Z caller=compact.go:547 compactionGroup="0@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor_namespace=\"default\",replica=\"thanos-mon-0\"}" msg="compacted blocks" blocks="[/compactor-data/compact-work-dir/01CAFYKHE98Q4Q569EM39HE2QT /compactor-data/compact-work-dir/01CAFW35QQWW9A8GRXK44GACAV]" duration=14.942279547s
level=error name=thanos-compactor ts=2018-04-07T13:43:58.109889158Z caller=compact.go:175 msg="compaction failed" err="invalid result block /compactor-data/compact-work-dir/01CAG5JCBTJKQ4QRRS6YCXJQE8: 60472/102031 series have an average of 1.000 out-of-order chunks"
level=error name=thanos-compactor ts=2018-04-07T13:43:58.10996179Z caller=compact.go:180 msg="critical error detected; halting"
I can find that my compactor restarted 18 times before halting ): Logs are weird. Status was "Terminated 0" so no error :scream:
logs
level=info name=thanos-compactor ts=2018-04-07T13:38:24.533743579Z caller=compact.go:240 msg="starting compact node"
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.687025887Z caller=compact.go:172 msg="download meta" block=01C6XZ1256S7VFNQP9D36XJ4F4
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.729746263Z caller=compact.go:172 msg="download meta" block=01C827TCWSYTFY55SRHW4QG5JT
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.751442425Z caller=compact.go:172 msg="download meta" block=01C87BZWG8S9HYKRWG1Q1T65ME
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.798424237Z caller=compact.go:172 msg="download meta" block=01C8CH2HEYQX6RTF75KGAE7BZM
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.825951983Z caller=compact.go:172 msg="download meta" block=01C8HNVWXFHPH9G2C9AY57HJ6P
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.845298132Z caller=compact.go:172 msg="download meta" block=01C8J2QQ4DMD2BPMEY2T9X3PB6
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.911706949Z caller=compact.go:172 msg="download meta" block=01C8KSK64GGRV6MDKPC08GQ7QD
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.931304839Z caller=compact.go:172 msg="download meta" block=01C8KSK6AADH50X1DB67WXB7VA
level=debug name=thanos-compactor ts=2018-04-07T13:38:24.980848163Z caller=compact.go:172 msg="download meta" block=01C8M0EXCH6AA166KEBEJ3BVCT
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.000510258Z caller=compact.go:172 msg="download meta" block=01C8M0EXGFV69KVZ92SW1KMEQZ
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.018310688Z caller=compact.go:172 msg="download meta" block=01C8M7AMMH833HBPVN9D0CNEW5
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.092030143Z caller=compact.go:172 msg="download meta" block=01C8M7AMT9BR7KHE8F6PK73QAM
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.109934872Z caller=compact.go:172 msg="download meta" block=01C8ME6BWJ49JFF8QFWP66Z50N
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.137005619Z caller=compact.go:172 msg="download meta" block=01C8ME6C2982KE0T5213JMEPZR
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.157069351Z caller=compact.go:172 msg="download meta" block=01C95DR60PDXPF5XYW3WX8HYW7
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.188664319Z caller=compact.go:172 msg="download meta" block=01C972CS6AYR8G12S8S5J343FS
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.231254334Z caller=compact.go:172 msg="download meta" block=01CAFPVWS4ADYAEKCX8P87VTXY
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.24801435Z caller=compact.go:172 msg="download meta" block=01CAFPX4ABMG8V9PVK2HH8EQTH
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.286700136Z caller=compact.go:172 msg="download meta" block=01CAFPYCDKX90ZH8F2BHJHVW7D
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.303544151Z caller=compact.go:172 msg="download meta" block=01CAFPZJBQMF0J8Y5SKC9Y32A1
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.320454597Z caller=compact.go:172 msg="download meta" block=01CAFW35QQWW9A8GRXK44GACAV
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.339247885Z caller=compact.go:172 msg="download meta" block=01CAFW413DT9KTVXFTV1ZPC4V0
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.357512783Z caller=compact.go:172 msg="download meta" block=01CAFW6KXXT40VF8TR1Q10KZF8
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.387581554Z caller=compact.go:172 msg="download meta" block=01CAFW6M0J566GRBCPA49N5ESM
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.413345789Z caller=compact.go:172 msg="download meta" block=01CAFXJF3XXS222KZECMJXDCH6
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.430856828Z caller=compact.go:172 msg="download meta" block=01CAFXMMAD9413KZ7TCV01Z3NB
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.449041277Z caller=compact.go:172 msg="download meta" block=01CAFXPZRJ5A5H07RNJM8KGP7Q
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.467764776Z caller=compact.go:172 msg="download meta" block=01CAFXRP0ET165VGR5WE5MWVSN
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.483201615Z caller=compact.go:172 msg="download meta" block=01CAFXTTC4B2HN0GWAKJPT3YAW
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.501041255Z caller=compact.go:172 msg="download meta" block=01CAFXX29AX68BPFJ3ZE2BRSXF
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.51721711Z caller=compact.go:172 msg="download meta" block=01CAFXZ44MW3V7RVD2RGXDZ16B
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.548254169Z caller=compact.go:172 msg="download meta" block=01CAFY172370BKMZTY9A6BN8YE
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.565518138Z caller=compact.go:172 msg="download meta" block=01CAFY3D8N65MV3EVE2W1VJRNA
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.58694948Z caller=compact.go:172 msg="download meta" block=01CAFY5B4NYT6K49S0CSBVW5E1
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.60592112Z caller=compact.go:172 msg="download meta" block=01CAFY7A8D2T2TYV0C7VT1PACJ
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.624782476Z caller=compact.go:172 msg="download meta" block=01CAFY9FCJG3GRSC7C6QNNKWAW
level=debug name=thanos-compactor ts=2018-04-07T13:38:25.640065511Z caller=compact.go:172 msg="download meta" block=01CAFYBE95M3VB913MZ607QK7N
level=info name=thanos-compactor ts=2018-04-07T13:38:25.762175764Z caller=compact.go:195 msg="start first pass of downsampling"
level=info name=thanos-compactor ts=2018-04-07T13:38:27.693713395Z caller=compact.go:201 msg="start second pass of downsampling"
level=info name=thanos-compactor ts=2018-04-07T13:38:29.649157594Z caller=main.go:150 msg=exiting
thanos-compactor-0 1/1 Running 18 2h
Not yet sure if related.
Ignore above, I just had wait=false
Two blocks that are compacted into a wrong thing:
{
"version": 1,
"ulid": "01CAFYKHE98Q4Q569EM39HE2QT",
"minTime": 1521504000000,
"maxTime": 1521676800000,
"stats": {
"numSamples": 337673850,
"numSeries": 97437,
"numChunks": 2826731
},
"compaction": {
"level": 4,
"sources": [
"01C90NJRXGXXXD15M3N9S3SAN8",
"01C90WEG5JD2RQTJ1WKW0BYAHW",
"01C913A7GM7G4ZPYMW897503YJ",
"01C91A5YNK91VTS7QC0KVMVNAC",
"01C91H1NXH9CQHHNPVZBW24ASG",
"01C91QXD5GM3E3S406KVMVFPS5",
"01C91YS4DHF7XPGGXHFB48JZMF",
"01C925MVNJQWGBPN5FJQAMYK2B",
"01C94YX9XHY34JGD42MMP8H56J",
"01C955S15HE2G5N19MPB5W4W26",
"01C95CMRDHCZ06VZNM1W47G528",
"01C95KGFRCW4ETRFYB1B06D2WV"
]
},
"thanos": {
"labels": {
...,
"environment": "alpha",
"monitor_namespace": "default",
"replica": "thanos-mon-0"
},
"downsample": {
"resolution": 0
}
}
}
{
"version": 1,
"ulid": "01CAFW35QQWW9A8GRXK44GACAV",
"minTime": 1521561600000,
"maxTime": 1521648000000,
"stats": {
"numSamples": 338076640,
"numSeries": 65800,
"numChunks": 2829924
},
"compaction": {
"level": 3,
"sources": [
"01C92CGJXGMQMW5ZT4149PCSE1",
"01C92KCA5JXJ1TYZ5RK1G4ZKBQ",
"01C92T81DGFZQ08TTD4KX6YBXF",
"01C9313RNGK95T3D09S0W48841",
"01C937ZFXJ2D2V0KHWA3V27MCY",
"01C93EV75HT83WFDXMMD034Y6Z",
"01C93NPYDHGJER21JMTDXGFM4K",
"01C93WJNNH71EVGFR2T9A8PKCE",
"01C943ECXHEZSF98CA8YKGGJVQ",
"01C94AA45GSXPPD65511GNQVXD",
"01C94H5VDHA4234BJGBT4H4PND",
"01C94R1JNHSBKQTBR16F9PY14Y"
]
},
"thanos": {
"labels": {
...,
"environment": "alpha",
"monitor_namespace": "default",
"replica": "thanos-mon-0"
},
"downsample": {
"resolution": 0
}
}
}
These blocks looks like overlapping ):
Yea.. more questions than answers:
level=info ts=2018-04-07T15:33:25.820501237Z caller=overlapped_blocks.go:20 msg="started verifying issue" with-repair=false issue=overlapped_blocks
level=warn ts=2018-04-07T15:33:43.71434158Z caller=overlapped_blocks.go:33 msg="found overlapped blocks" group="0@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor=\"thanos-coremon\",replica=\"thanos-coremon-1\"}" overlap="[mint: 1519660800000, maxt: 1519833600000, range: 48h0m0s, blocks: 2]: <ulid: 01CAG5EMFF64X73P7XMKYKXV1J, mint: 1519257600000, maxt: 1520467200000, range: 336h0m0s>, <ulid: 01CAG35P3G12F2HSM1NR5C03CA, mint: 1519660800000, maxt: 1519833600000, range: 48h0m0s>\n[mint: 1519833600000, maxt: 1520294400000, range: 128h0m0s, blocks: 2]: <ulid: 01CAG5EMFF64X73P7XMKYKXV1J, mint: 1519257600000, maxt: 1520467200000, range: 336h0m0s>, <ulid: 01CAG4GK26THDHENSJE8GMXJMY, mint: 1519833600000, maxt: 1520294400000, range: 128h0m0s>"
level=warn ts=2018-04-07T15:33:43.714914447Z caller=overlapped_blocks.go:33 msg="found overlapped blocks" group="0@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor_namespace=\"default\",replica=\"thanos-mon-0\"}" overlap="[mint: 1521561600000, maxt: 1521648000000, range: 24h0m0s, blocks: 2]: <ulid: 01CAFYKHE98Q4Q569EM39HE2QT, mint: 1521504000000, maxt: 1521676800000, range: 48h0m0s>, <ulid: 01CAFW35QQWW9A8GRXK44GACAV, mint: 1521561600000, maxt: 1521648000000, range: 24h0m0s>"
level=warn ts=2018-04-07T15:33:43.715017117Z caller=overlapped_blocks.go:33 msg="found overlapped blocks" group="0@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor_namespace=\"default\",replica=\"thanos-mon-1\"}" overlap="[mint: 1521561600000, maxt: 1521648000000, range: 24h0m0s, blocks: 2]: <ulid: 01CAFYM8XBB0DR677N1R24G9KB, mint: 1521504000000, maxt: 1521676800000, range: 48h0m0s>, <ulid: 01CAFW413DT9KTVXFTV1ZPC4V0, mint: 1521561600000, maxt: 1521648000000, range: 24h0m0s>"
level=warn ts=2018-04-07T15:33:43.715097146Z caller=overlapped_blocks.go:33 msg="found overlapped blocks" group="300000@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor=\"thanos-coremon\",replica=\"thanos-coremon-1\"}" overlap="[mint: 1519948800000, maxt: 1520121600000, range: 48h0m0s, blocks: 2]: <ulid: 01CAG4SVRYR3130MGTFQXHMWWE, mint: 1519833600000, maxt: 1520294400000, range: 128h0m0s>, <ulid: 01CAG0QR8SM3CFYA0AMBHNHB72, mint: 1519948800000, maxt: 1520121600000, range: 48h0m0s>\n[mint: 1520121600000, maxt: 1520294400000, range: 48h0m0s, blocks: 2]: <ulid: 01CAG4SVRYR3130MGTFQXHMWWE, mint: 1519833600000, maxt: 1520294400000, range: 128h0m0s>, <ulid: 01CAG0S9ZN754TZRAVNEPMNG8H, mint: 1520121600000, maxt: 1520294400000, range: 48h0m0s>"
level=info ts=2018-04-07T15:33:43.715189398Z caller=verify.go:67 msg="verify completed" issues=1 repair=false
level=info ts=2018-04-07T15:33:43.715887108Z caller=main.go:150 msg=exiting
I will definitely focus on finding the reason for overlaps first, before ooo investigation. OOO might be a result of this.
I feel like my overlap is related to previous issue. I repaired my state in 98% and I assumed, other overlaps will be garbage collected. With this https://github.com/improbable-eng/thanos/pull/274 we will never have this weird thing again -> We will need to resolve "non-garbage collectable" overlaps before continuing compaction.
Now I will write yet another repair tool to merge my 2 blocks together. Seems like one block has 48h time range witth 24h gap and is second that missing 24h gap. Cannot tell though, if the reason of this happening is some bug, or my previous broken state. https://github.com/improbable-eng/thanos/pull/274 will help to deduct that in future runs.
I found potential cause for my overlaps. Fix: https://github.com/improbable-eng/thanos/pull/278
OK, I finally have cleaned up all overlaps, found root cases, fixed it, and no longer have overlaps in my system (: After lot's of compaction passes I have still OOO, so looking into that now.
Ok, so after long investigation, this out of order for compacted blocks seems to be caused by tsdb compaction merger: https://github.com/prometheus/tsdb/blob/master/compact.go#L796
Basically here, my input blocks (two that are merged) includes series for the same time range. This is technically impossible in normal cases, so TSDB is not that wrong, the only missing bit is potential detection if this and returning proper error -> can fix that upstream.
The question is now, why I have non-overlapping blocks with overlaped series' time range O.o
Basically:
BlockA:
"minTime": 1522224000000,
"maxTime": 1522252800000,
Block B:
"minTime": 1522252800000,
"maxTime": 1522281600000,
And both of them have series for range: {Min:1522252800000 Max:1522253534999} O.o
This is bizzare, but in the same time we had really nasty bug with potential group mixups. (Group is a unique group of blocks within same External labels and Resolution. Compaction should be done only within group). Bug fixed in https://github.com/improbable-eng/thanos/pull/278 was causing potential mixes. Dir cleanup was not working, so compactor had access to ALL blocks (from all groups).
notDuplicate := false
fmt.Printf("DUPLICATE across blocks for range: %+v\n", t)
bsA := chA.Chunk.Bytes()
bsB := chB.Chunk.Bytes()
if len(bsA) != len(bsB) {
notDuplicate = true
fmt.Printf("Mismatch in samples: %v vs %v \n", len(bsA) , len(bsB) )
}
for i, b := range chA.Chunk.Bytes() {
if b != bsB[i] {
notDuplicate = true
fmt.Printf("Mismatch in series bytes: %v vs %v [%d]\n", b, bsB[i], i)
break
}
}
iA := chA.Chunk.Iterator()
iB := chB.Chunk.Iterator()
for {
aok := iA.Next()
bok := iB.Next()
if !aok {
if bok {
notDuplicate = true
fmt.Printf("A ends, B: %v\n", bok)
}
break
}
if !bok {
if aok {
notDuplicate = true
fmt.Printf("B ends, A: %v\n", aok)
}
break
}
t1, v1 := iA.At()
t2, v2 := iB.At()
if t1 != t2 {
notDuplicate = true
fmt.Printf("Ts nismatch: %v vs %v\n", t1, t2)
break
}
if !floatEquals(v1, v2) {
notDuplicate = true
fmt.Printf("Value mismatch: %v vs %v\n", v1, v2)
break
}
}
if notDuplicate {
panic("STOP")
}
Checked all these weird series and these are 1:1 duplicates, so this seems to be either unrelated to https://github.com/improbable-eng/thanos/pull/278 or just 2 different replicas gathered same samples within same timestamp which is unlikely (or isn't?)
Ok I repaired all these blocks and I am back with running compactor again.
With https://github.com/improbable-eng/thanos/pull/278 I no longer can repro this issue. @mattbostock could you rerun compactor from master, and confirm this?
Preferably on clean storage or at least fixed one (can guide you if you really want to not remove your blocks, but it won't be easy - all sorts of weirdnesses could happen).
OK all good except this: https://github.com/improbable-eng/thanos/issues/281
Got yet another overlap. ):
Investigated and I think we need to kill sync-delay which introduces these (at least sync delay for compacted blocks). Fix: https://github.com/improbable-eng/thanos/pull/282
Thanks @Bplotka, I'm still looking into this from my side and testing with 61e63b7536 from master.
For my issue, I'm starting to suspect it's due to a lack of bounds checking in the TSDB library for the maximum timestamp in a chunk, such that the block's metadata reports a lower maxTime than the one actually recorded in the last chunk in the block. I'll write a test case to confirm.
Yea I see your point, but I think it is all good with the compaction on
that side. The out of order chunks can easily happen when compactor
compacts overlapping blocks. It just checks min time of first one, and max
time of last, but if first one has greater max time than last ooo is
produced.
pt., 13 kwi 2018, 19:53 użytkownik Matt Bostock notifications@github.com
napisał:
For my issue, I'm starting to suspect it's due to a lack of bounds
checking in the TSDB library for the maximum timestamp in a chunk, such
that the block's metadata reports a lower maxTime than the one actually
recorded in the last chunk in the block. I'll write a test case to confirm.—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/improbable-eng/thanos/issues/267#issuecomment-381213787,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGoNu1JZRe6h-UszS5nTckeshnWcJv5Cks5toOYwgaJpZM4TIdBR
.
Yea I see your point, but I think it is all good with the compaction on
that side. The out of order chunks can easily happen when compactor
compacts overlapping blocks.
Yes - I think the issue is happening for me because the chunk max time (as stored in the index) is higher than the max time on the block that owns the chunk, which causes the blocks to overlap.
It looks to be an issue in the TSDB library but I think we should keep this issue open in the meantime, since it impacts Thanos.
Cannot repro this issue anymore with current master.
So basically we added overlap check now, so this ooo blocks should never happen anymore. There are ways to repair that kind of blocks or at least not removing the whole one, only the broken chunks. Can you show you if you want.
I would rerun Thanos from master and double check again. I am not aware of any issue in TSDB or even any wrong result that would indicate this, but maybe I am just lucky. Can we close this issue and reopen when we can repro something using Thanos with our fixes?
No repro so far, feel free to reopen it if needed.
@bwplotka @mattbostock We ran into this today with v.0.13.0, here's the log:
level=warn ts=2020-07-31T07:19:41.042728864Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="error executing compaction: firs
t pass of downsampling failed: downsampling to 5 min: downsample block 01EEHF36CQTKBVRMB0319CBPGS to window 300000: found overlapping chunks within series 298
1093. Chunks expected to be ordered by min time and non-overlapping, got: [{4420628494 <nil> 1596102736772 1596103156772} {4420628524 <nil> 1596102736772 159$
103156772} {4420628554 <nil> 1596103216772 1596110356772} {4420628646 <nil> 1596110416772 1596117556772} {4420628723 <nil> 1596117616772 1596124756772} {4420$
28807 <nil> 1596124816772 1596131956772} {4420628875 <nil> 1596132016772 1596139156772} {4420628954 <nil> 1596139216772 1596146356772} {4420629017 <nil> 1596$
46416772 1596153556772}]"
level=info ts=2020-07-31T07:19:41.042848207Z caller=http.go:81 service=http/server component=compact msg="internal server shutdown" err="error executing comp$
ction: first pass of downsampling failed: downsampling to 5 min: downsample block 01EEHF36CQTKBVRMB0319CBPGS to window 300000: found overlapping chunks withi$
series 2981093. Chunks expected to be ordered by min time and non-overlapping, got: [{4420628494 <nil> 1596102736772 1596103156772} {4420628524 <nil> 159610$
736772 1596103156772} {4420628554 <nil> 1596103216772 1596110356772} {4420628646 <nil> 1596110416772 1596117556772} {4420628723 <nil> 1596117616772 159612475$
772} {4420628807 <nil> 1596124816772 1596131956772} {4420628875 <nil> 1596132016772 1596139156772} {4420628954 <nil> 1596139216772 1596146356772} {4420629017
<nil> 1596146416772 1596153556772}]"
level=info ts=2020-07-31T07:19:41.042870222Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="error executing compaction: f$
rst pass of downsampling failed: downsampling to 5 min: downsample block 01EEHF36CQTKBVRMB0319CBPGS to window 300000: found overlapping chunks within series $
981093. Chunks expected to be ordered by min time and non-overlapping, got: [{4420628494 <nil> 1596102736772 1596103156772} {4420628524 <nil> 1596102736772 1$
96103156772} {4420628554 <nil> 1596103216772 1596110356772} {4420628646 <nil> 1596110416772 1596117556772} {4420628723 <nil> 1596117616772 1596124756772} {44$
0628807 <nil> 1596124816772 1596131956772} {4420628875 <nil> 1596132016772 1596139156772} {4420628954 <nil> 1596139216772 1596146356772} {4420629017 <nil> 15$
6146416772 1596153556772}]"
level=error ts=2020-07-31T07:19:41.043089238Z caller=main.go:211 err="found overlapping chunks within series 2981093. Chunks expected to be ordered by min ti$
e and non-overlapping, got: [{4420628494 <nil> 1596102736772 1596103156772} {4420628524 <nil> 1596102736772 1596103156772} {4420628554 <nil> 1596103216772 15$
6110356772} {4420628646 <nil> 1596110416772 1596117556772} {4420628723 <nil> 1596117616772 1596124756772} {4420628807 <nil> 1596124816772 1596131956772} {442$
628875 <nil> 1596132016772 1596139156772} {4420628954 <nil> 1596139216772 1596146356772} {4420629017 <nil> 1596146416772 1596153556772}]\ndownsample block 01$
EHF36CQTKBVRMB0319CBPGS to window 300000\nmain.processDownsampling\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/downsample.go:274\nmain.downsampleBucket$
n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/downsample.go:210\nmain.runCompact.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:312\n$
ain.runCompact.func6.1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:365\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com$
thanos-io/thanos/pkg/runutil/runutil.go:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:364\ngithub.com/oklog/run.(*Gr$
up).Run.func1\n\t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\ndownsampling to 5 min\nmai
n.downsampleBucket\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/downsample.go:212\nmain.runCompact.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thano
s/compact.go:312\nmain.runCompact.func6.1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:365\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t
/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:364\ngithub.
com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfirst p
ass of downsampling failed\nmain.runCompact.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:313\nmain.runCompact.func6.1\n\t/go/src/github.
com/thanos-io/thanos/cmd/thanos/compact.go:365\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:7
2\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:364\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/o
klog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nerror executing compaction\nmain.runCompact.func6.1\n\t/go/src/gith
ub.com/thanos-io/thanos/cmd/thanos/compact.go:392\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.g
o:72\nmain.runCompact.func6\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:364\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.co
m/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\ncompact command failed\nmain.main\n\t/go/src/github.com/thanos-i
o/thanos/cmd/thanos/main.go:211\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"
Could you please suggest a fix or workaround?
Thanks!
We've since upgraded to v0.14.0, will let you know if it reproduces again..
It still reproduces even on v0.14 (same issue as ^^^). Is there any diagnostic info I can provide to help with investigation?
Most helpful comment
No, I just meant that we are using the vanilla TSDB compactor and the input we give it is entirely on-overlapping, just like in Prometheus.
That's the question. Seems like there shouldn't be anything Thanos specific in that path, so a TSDB issue is possible.
That's what we've to find out now I guess :)
You are probably very familiar with the TSDB library, so that helps :) I think your best bet would be pulling the blocks in question onto your machine and writing a standalone program compacting them.
Then running Thanos' block verification (that makes it halt) on the result but a bit more verbose to see how chunks are actually out of order.
Inspecting the same chunks in the input blocks may reveal something. If everything looks normal some simple debug printing in the compactor may reveal something.
You can probably just focus on a single series that is affected to avoid the noise – it will likely be the same issue for all of them.