Thanos, Prometheus and Golang version used
thanos, version 0.5.0 (branch: HEAD, revision: 72820b3f41794140403fd04d6da82299f2c16447)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5
What happened
thanos compactor crashes with ""write compaction: chunk 8 not found: reference sequence 0 out of range"
What you expected to happen
Should work fine :-)
How to reproduce it (as minimally and precisely as possible):
Not sure :-/
Full logs to relevant components
Out of the list of objects dumped along with error message I've found one without chunks
$ gsutil ls -r gs://REDACTED/01DBZNNTM2557YW8T35RBM676P
gs://REDACTED/01DBZNNTM2557YW8T35RBM676P/:
gs://REDACTED/01DBZNNTM2557YW8T35RBM676P/index
gs://REDACTED/01DBZNNTM2557YW8T35RBM676P/meta.json
meta.json contents:
{
"version": 1,
"ulid": "01DBZNNTM2557YW8T35RBM676P",
"minTime": 1557993600000,
"maxTime": 1558022400000,
"stats": {
"numSamples": 35591778,
"numSeries": 39167,
"numChunks": 299446
},
"compaction": {
"level": 2,
"sources": [
"01DB04S079GCEBKMTWZBH8HQA3",
"01DB0BMQG3W7M12M8DE3V9QW5C",
"01DB0JGEQD5RCZ50JAS2NENHQ6",
"01DB0SC6071QW08JWQVG000AKF"
],
"parents": [
{
"ulid": "01DB04S079GCEBKMTWZBH8HQA3",
"minTime": 1557993600000,
"maxTime": 1558000800000
},
{
"ulid": "01DB0BMQG3W7M12M8DE3V9QW5C",
"minTime": 1558000800000,
"maxTime": 1558008000000
},
{
"ulid": "01DB0JGEQD5RCZ50JAS2NENHQ6",
"minTime": 1558008000000,
"maxTime": 1558015200000
},
{
"ulid": "01DB0SC6071QW08JWQVG000AKF",
"minTime": 1558015200000,
"maxTime": 1558022400000
}
]
},
"thanos": {
"labels": {
"environment": "devint",
"instance_number": "1",
"location": "REDACTED",
"prometheus": "monitoring/prometheus-operator-prometheus",
"prometheus_replica": "prometheus-prometheus-operator-prometheus-1",
"stack": "data"
},
"downsample": {
"resolution": 0
},
"source": "compactor"
}
}
It is an object created by compactor apparently.
We've been running compactor for some time, but after a while (due to lack of local disk storage) it was crashing constantly. After an extended period of crashing storage has been added and compactor was able to go further until it encountered problem described here.
I guess important part is how such object ended up in bucket, although I wonder if it is possible for thanos to ignore such objects and keep processing data rest of data? (exposing data about bad objects in metrics)
I'd guess that it was somehow created during constant crashes we've had earlier, but have nothing to support that.
Anything else we need to know
We've been running 0.5 and 0.4 before that. I'm not sure, but it is possible that 0.3.2 (compactor) was used at the beginning.
Sounds like Thanos Compact has crashed midway during the upload and you got inconsistent data. Nowadays Thanos Compact removes old, malformed data that is older than the maximum of consistency delay and the minimum age for removal (30 minutes). Does that not work for you for some reason, @bjakubski?
Also, Thanos Compact now has metrics about failed compactions (in 0.6.0 which is soon to be released) so you can use that :P
I've just encountered the same problem on another object in bucket, but this time it was created by sidecar. I do not see any message in logs about sidecar ever uploading it, but this object is older (> week old) but according to GCP storage it was just created yesterday. I see messages in prometheus logs
2019-06-29 "write block"
2019-07-02 "found healthy block" (probably after prom restart)
2019-07-10 21:31 "found healthy block) (after prom restart)
object in bucket seems to have been created at 2019-07-10 21:25 - which is just after one of thanos sidecars successfully connected to prometheus (after pod restart). In logs I can see it uploading various objects, but nothing about this particular ulid
Now: I'm running unusual setup with thanos-sidecar: prometheus compaction is disabled, but retention is quite long: 15d IIRC It generally seems to work fine, although I wonder if it might cause problems like I have here?
What is the consistency delay? Should I've been expecting compactor to ignore/remove this object from bucket?
Also I met the same question about this , the error log is write compaction: chunk 8 not found: reference sequence 0 out of range, i found the compactor pods always crashoff , and then restart pod again, below is the pod information. and i don't know how to fix this question, could anyone know the issue?
prometheus-operator-57fc4686dc-gw8st 1/1 Running 2 17d
thanos-compactor-0 1/1 Running 162 24h
thanos-query-596fd87fcc-2tmx8 1/1 Running 0 24h
thanos-store-0 1/1 Running 0 24h
I'm seeing the same error for blocks uploaded with thanos-compactor 0.6.0 (and then processed by 0.6.0) myself. Backend storage is Ceph cluster via Swift API.
thanos, version 0.6.0 (branch: HEAD, revision: c70b80eb83e52f5013ed5ffab72464989c132883)
build user: root@fd15c8276e5c
build date: 20190718-11:11:58
go version: go1.12.5
thanos-compactor has uploaded a compacted block yesterday:
Jul 29 14:03:47 thanos01 thanos[32452]: level=info ts=2019-07-29T14:03:47.456113025Z caller=compact.go:444 msg="compact blocks" count=4 mint=1561622400000 maxt=1561651200000 ulid=01DGZ0PDS2MFX25P9CKG1C1TA3 sources="[01DEC9F6B0BVQHFG7BZDS75N0V 01DECGAXK0CM9HC2QPZ7MMH8M0 01DECQ6MV0K0FR00HHY9906178 01DECY2C2ZX4F6N9GKHKM2PC61]" duration=9.524377767s
Jul 29 14:03:47 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:47.752332697Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: file already closed"
Jul 29 14:03:55 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:55.115560765Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:56.472928348Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:56.571187127Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index.cache.json: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/index.cache.json: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=warn ts=2019-07-29T14:03:56.72380147Z caller=runutil.go:108 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="detected close error" err="close file /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: close /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json: file already closed"
Jul 29 14:03:56 thanos01 thanos[32452]: level=debug ts=2019-07-29T14:03:56.723861375Z caller=compact.go:906 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="uploaded block" result_block=01DGZ0PDS2MFX25P9CKG1C1TA3 duration=8.994729984s
and now it's choking on that block:
Jul 30 11:25:38 thanos01 thanos[22465]: level=debug ts=2019-07-30T11:25:38.310116816Z caller=compact.go:252 msg="download meta" block=01DGZ0PDS2MFX25P9CKG1C1TA3
Jul 30 11:25:47 thanos01 thanos[22465]: level=debug ts=2019-07-30T11:25:47.039193509Z caller=compact.go:840 compactionGroup="0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}" msg="downloaded and verified blocks" blocks="[/var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DEBE09AZXVCNWKYCWP43A7G4 /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3]" duration=7.277184353s
Jul 30 11:25:47 thanos01 thanos[22465]: level=error ts=2019-07-30T11:25:47.214649027Z caller=main.go:199 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}: compact blocks [/var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DEBE09AZXVCNWKYCWP43A7G4 /var/lib/thanos/compactor/compact/0@{datacenter=\"p24\",environment=\"internal\",replica=\"02\"}/01DGZ0PDS2MFX25P9CKG1C1TA3]: write compaction: chunk 8 not found: reference sequence 0 out of range"
First, I'd expect it to survive broken blocks, but what's more concerning is that the block has been uploaded successfully before (unless those warning messages are not there just for show, and there is indeed something wrong).
What's uploaded:
$ swift list prometheus-storage | grep 01DGZ0PDS2MFX25P9CKG1C1TA3
01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001
01DGZ0PDS2MFX25P9CKG1C1TA3/index
01DGZ0PDS2MFX25P9CKG1C1TA3/index.cache.json
01DGZ0PDS2MFX25P9CKG1C1TA3/meta.json
debug/metas/01DGZ0PDS2MFX25P9CKG1C1TA3.json
and meta.json:
{
"ulid": "01DGZ0PDS2MFX25P9CKG1C1TA3",
"minTime": 1561622400000,
"maxTime": 1561651200000,
"stats": {
"numSamples": 271799367,
"numSeries": 141915,
"numChunks": 2265057
},
"compaction": {
"level": 2,
"sources": [
"01DEC9F6B0BVQHFG7BZDS75N0V",
"01DECGAXK0CM9HC2QPZ7MMH8M0",
"01DECQ6MV0K0FR00HHY9906178",
"01DECY2C2ZX4F6N9GKHKM2PC61"
],
"parents": [
{
"ulid": "01DEC9F6B0BVQHFG7BZDS75N0V",
"minTime": 1561622400000,
"maxTime": 1561629600000
},
{
"ulid": "01DECGAXK0CM9HC2QPZ7MMH8M0",
"minTime": 1561629600000,
"maxTime": 1561636800000
},
{
"ulid": "01DECQ6MV0K0FR00HHY9906178",
"minTime": 1561636800000,
"maxTime": 1561644000000
},
{
"ulid": "01DECY2C2ZX4F6N9GKHKM2PC61",
"minTime": 1561644000000,
"maxTime": 1561651200000
}
]
},
"version": 1,
"thanos": {
"labels": {
"datacenter": "p24",
"environment": "internal",
"replica": "02"
},
"downsample": {
"resolution": 0
},
"source": "compactor"
}
}
This also breaks when running with --wait.
I was expecting it to fail, but not quit... but it did crashloop with that error.
After some time I deleted the pod and it seems to be running fine now... maybe that 30m mentioned above?
OK that was quick, just crashlooped again
Are you hitting this maybe? https://github.com/thanos-io/thanos/issues/1331
Essentially partial upload and empty or missing chunk files in <block ulid>/chunks folder?
You have 01DGZ0PDS2MFX25P9CKG1C1TA3/chunks/000001 but it might be partiall uploaded. I think we should focus on that issue then.
Thanks all for reporting let's investigate this upload path with minio client. Essentially this might be some assumption: https://github.com/thanos-io/thanos/issues/1331#issuecomment-519855917
Let's continue on one of the issues. This looks like a duplicate of https://github.com/thanos-io/thanos/issues/1331
EDIT: Actually this one was first but I think it does not matter much - let's continue on one (: Hope this is ok with you guys!
Are you hitting this maybe? #1331
not sure, I'm using GCS... that issue seems to be specific to S3...
hm, maybe we are wrong and GCS is affected as well. What about other question about chunk folder?
sorry, didn't see that question...
So, the log says:
level=error ts=2019-08-09T14:09:55.107857264Z caller=main.go:199 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{server=\"central\"}: compact blocks [/var/thanos/compact/compact/0@{server=\"central\"}/01DB9JXXFE1HE1CYQM171PRMFK /var/thanos/compact/compact/0@{server=\"central\"}/01DBFCAMBRRMRT0SJAT516AVF4 /var/thanos/compact/compact/0@{server=\"central\"}/01DBN5QC3G1ZJXQZJ6MXZNPMV2 /var/thanos/compact/compact/0@{server=\"central\"}/01DBTZ3WZY9VR0T3MDZRTP5Z95 /var/thanos/compact/compact/0@{server=\"central\"}/01DC0RGG6EK3BMK96YVRBDQT6E]: write compaction: chunk 8 not found: reference sequence 0 out of range"
gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/:
217104068 2019-06-24T17:25:14Z gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/index
1650 2019-06-24T17:25:14Z gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/meta.json
gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/chunks/:
536861830 2019-06-24T17:24:35Z gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/chunks/000001
536862360 2019-06-24T17:24:47Z gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/chunks/000002
536867046 2019-06-24T17:25:00Z gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/chunks/000003
404281529 2019-06-24T17:25:09Z gs://REDACTED/01DB9JXXFE1HE1CYQM171PRMFK/chunks/000004
TOTAL: 6 objects, 2231978483 bytes (2.08 GiB)
gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/:
239464138 2019-06-24T17:26:09Z gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/index
1650 2019-06-24T17:26:09Z gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/meta.json
gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/chunks/:
536868373 2019-06-24T17:25:26Z gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/chunks/000001
536863290 2019-06-24T17:25:39Z gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/chunks/000002
536870423 2019-06-24T17:25:52Z gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/chunks/000003
498277142 2019-06-24T17:26:03Z gs://REDACTED/01DBFCAMBRRMRT0SJAT516AVF4/chunks/000004
TOTAL: 6 objects, 2348345016 bytes (2.19 GiB)
gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/:
264888644 2019-06-24T17:27:05Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/index
1650 2019-06-24T17:27:06Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/meta.json
gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/chunks/:
536870054 2019-06-24T17:26:21Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/chunks/000001
536866656 2019-06-24T17:26:34Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/chunks/000002
536869259 2019-06-24T17:26:47Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/chunks/000003
536843539 2019-06-24T17:26:59Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/chunks/000004
23772607 2019-06-24T17:27:00Z gs://REDACTED/01DBN5QC3G1ZJXQZJ6MXZNPMV2/chunks/000005
TOTAL: 7 objects, 2436112409 bytes (2.27 GiB)
gs://REDACTED/01DBTZ3WZY9VR0T3MDZRTP5Z95/:
232606362 2019-06-24T17:27:59Z gs://REDACTED/01DBTZ3WZY9VR0T3MDZRTP5Z95/index
1650 2019-06-24T17:27:59Z gs://REDACTED/01DBTZ3WZY9VR0T3MDZRTP5Z95/meta.json
TOTAL: 2 objects, 232608012 bytes (221.83 MiB)
gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/:
235607510 2019-06-24T17:29:12Z gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/index
1650 2019-06-24T17:29:15Z gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/meta.json
gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/chunks/:
536856293 2019-06-24T17:28:15Z gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/chunks/000001
536849008 2019-06-24T17:28:30Z gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/chunks/000002
536865860 2019-06-24T17:28:43Z gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/chunks/000003
532574433 2019-06-24T17:28:55Z gs://REDACTED/01DC0RGG6EK3BMK96YVRBDQT6E/chunks/000004
TOTAL: 6 objects, 2378754754 bytes (2.22 GiB)
It seems to be the problem is 01DBTZ3WZY9VR0T3MDZRTP5Z95? Should I just delete the entire folder?
anything I can do as a workaround so compact works again?
You should delete the blocks which have duplicated data and only leave one copy. It's up to you to decide which one it is (: It sounds like you need to delete the one you've mentioned but please double check.
You should delete the blocks which have duplicated data and only leave one copy. It's up to you to decide which one it is (: It sounds like you need to delete the one you've mentioned but please double check.
yeah, did that and it seems to be running now
Hi, I am having this issue using thanos 0.8.1. I have tried moving the directories for blocks it complains about out of the bucket but then every time I run the compactor it just finds some more to be sad about :-(
This is crashing the compactor with 0.8.1 even with --wait flag.
Any help to further debug this would be appreciated! (@bwplotka maybe?)
Hi, we started some doc where we discussed another potential cause of this, which is the false assumption of partial block upload: https://docs.google.com/document/d/1QvHt9NXRvmdzy51s4_G21UW00QwfL9BfqBDegcqNvg0/edit#
Reopening as it still occurs. The mentioned doc above states potential issue (however quite unlikely) of compactor removing a block when it is being still uploaded for some reason (long upload / old block upload). We are working on fix for that particular case, but more info would be useful why chunk file is missing from the block.
To compliment the last comment:
thanosio/thanos:v0.9.0
Compactor logs:
level=warn ts=2020-01-03T10:30:29.887625771Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="error executing compaction: compaction failed: compaction failed for group 0@3883330991078906199: gather index issues for block /thanos-compact-data/compact/0@3883330991078906199/01DXKWN08YHZQMCHHJ3VQVVDSP: open index file: try lock file: open /thanos-compact-data/compact/0@3883330991078906199/01DXKWN08YHZQMCHHJ3VQVVDSP/index: no such file or directory"
level=info ts=2020-01-03T10:30:29.888132722Z caller=http.go:78 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: compaction failed: compaction failed for group 0@3883330991078906199: gather index issues for block /thanos-compact-data/compact/0@3883330991078906199/01DXKWN08YHZQMCHHJ3VQVVDSP: open index file: try lock file: open /thanos-compact-data/compact/0@3883330991078906199/01DXKWN08YHZQMCHHJ3VQVVDSP/index: no such file or directory"
Meantime in the bucket (there is the only file meta.json in the bucket indeed):
{
"ulid": "01DXKWN08YHZQMCHHJ3VQVVDSP",
"minTime": 1577952000000,
"maxTime": 1577980800000,
"stats": {
"numSamples": 954058569,
"numSeries": 906289,
"numChunks": 8144375
},
"compaction": {
"level": 2,
"sources": [
"01DXJYK0WNNX8WKK3KXSK2JR6S",
"01DXK5ER4N46ST170D3D8JX1FS",
"01DXKCAFCYHS2Z94XP5QYBFJHA",
"01DXKK66MG06PDAQAQD1YNS8AX"
],
"parents": [
{
"ulid": "01DXJYK0WNNX8WKK3KXSK2JR6S",
"minTime": 1577952000000,
"maxTime": 1577959200000
},
{
"ulid": "01DXK5ER4N46ST170D3D8JX1FS",
"minTime": 1577959200000,
"maxTime": 1577966400000
},
{
"ulid": "01DXKCAFCYHS2Z94XP5QYBFJHA",
"minTime": 1577966400000,
"maxTime": 1577973600000
},
{
"ulid": "01DXKK66MG06PDAQAQD1YNS8AX",
"minTime": 1577973600000,
"maxTime": 1577980800000
}
]
},
"version": 1,
"thanos": {
"labels": {
....
},
"downsample": {
"resolution": 0
},
"source": "compactor"
}
}
So it seems compactor failed to upload itself and not crashing against it.
We merged fix for the case of: https://github.com/thanos-io/thanos/issues/1919 which potentially might cause such chunk file to be missing.
Most helpful comment
I'm seeing the same error for blocks uploaded with thanos-compactor 0.6.0 (and then processed by 0.6.0) myself. Backend storage is Ceph cluster via Swift API.
thanos-compactor has uploaded a compacted block yesterday:
and now it's choking on that block:
First, I'd expect it to survive broken blocks, but what's more concerning is that the block has been uploaded successfully before (unless those warning messages are not there just for show, and there is indeed something wrong).
What's uploaded:
and meta.json: