Thanos: store (s3, gcs): invalid memory address or nil pointer dereference

Created on 14 May 2018  ·  61Comments  ·  Source: thanos-io/thanos

Since I migrated the data from a 3 node prometheus cluster to S3 buckets with the thanos sidecar and I run the thanos store node agains that data, a strange issue occurred:
panic: runtime error: invalid memory address or nil pointer dereference
after a query for a long period of time.

Since initially Prometheus was using the default min/max-block-duration options, the data I migrated was compressed (to level 5 mostly) so in order to migrate it I manually changed the meta.json files to:

    "compaction": {
        "level": 1,

This migrated the data successfully, but may be in part of the issue.

When I executed this query:

sum(node_memory_MemTotal{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_MemFree{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Buffers{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"}) - sum(node_memory_Cached{job="node-exporter", failure_domain_beta_kubernetes_io_zone="eu-west-1a", kops_k8s_io_instancegroup=~".*"})

For a period of 8w, I got this error and the thanos store pod got restarted:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb6e912]

goroutine 921 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc42e90e040, 0xc45b2f00c0)
autogenerated:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c900, 0xc4816f2d70)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc47d64c960, 0xc45b2f00c0)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33\ngithub.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf137a0, 0xc47d64c960, 0x0, 0x4, 0x4, 0xf137a0, 0xc47d64c960)\u0009/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc42002ed90, 0xf12ce0, 0xc425b6a300, 0x9cb7c84a295c6301, 0xe320b7d6feb78d94, 0xc4203c9440, 0xc44b3faa10, 0xc42008c5a0, 0xc425b6a280, 0x4, ...)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:482 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x432e88)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:667 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42008d200, 0xc44b3faa80, 0xc4237c6850)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

After second execution, however, the same query returned a satisfying result without any other panic errors.

I managed to find this pattern (however strange it may seem).
When I start to make a graph from a small period of time and use the "+" (Grow the time range) to rapidly extend the time period, I get nil pointer for even a small periods like 1d.
screen shot 2018-05-14 at 13 47 09

If I run the query directly for a large period of time (12w or more), I receive a prompt reply without an error.

bug hard help wanted

Most helpful comment

It happens here every hour or two but no one wants to upload a core dump that is around 200GB (at least in our case since we have a bucket which is >1TB). And I haven't had enough time to dig into it and research the warts of the Go's memory handling code myself.
FWIW, sysctl vm.overcommit_memory=2 which disables overcommit immediately leads to Thanos Store getting killed by the OOM killer here since there is no memory left available. The default value is 0 and the Linux kernel uses a heuristic to check if there is enough memory available to fulfill any memory allocation request. And because the nil pointers come in places where they shouldn't it leads me to think that the allocations fail because the Kernel had decided that there isn't enough RAM.
Also, if you haven't noticed but Go's specification, unfortunately, doesn't specify anything exact about what happens when make() or append() fails when it tries to create a new object underneath in the heap... so who knows? :( Plus, in the Thanos code we are only checking for errors which are represented by the Error interface.
Lastly, https://github.com/golang/go/issues/16843 this has been open for almost 3 years by now and funnily enough even some Prometheus developers weighed in onto this issue. Which just leads me to think that the language itself doesn't provide a way to control this or check this.
And this is why https://github.com/improbable-eng/thanos/pull/798 was born since in our case random users sometimes were sending queries which overload the servers

All 61 comments

A new development in my case is the following:
I ran the compactor on the data in S3 and after many hours of processing it started to fail with:

level=info ts=2018-05-14T16:27:14.402143577Z caller=downsample.go:227 msg="downloaded block" id=01CAPH4101J6BBTD4XRZYWBGQ3 duration=1m22.822874423s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x7b8ba0]

goroutine 82 [running]:
github.com/improbable-eng/thanos/pkg/block.GatherIndexIssueStats(0xc420730550, 0x41, 0x1628a173700, 0x162acda2400, 0xa1ffd, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/block/index.go:264 +0x530
github.com/improbable-eng/thanos/pkg/block.VerifyIndex(0xc420730550, 0x41, 0x1628a173700, 0x162acda2400, 0x41, 0x0)
    /go/src/github.com/improbable-eng/thanos/pkg/block/index.go:155 +0x67
main.processDownsampling(0xf12ce0, 0xc420041c80, 0xf06800, 0xc4202ebe60, 0xf187a0, 0xc4200419c0, 0xc42023f560, 0xc42016e8c0, 0x20, 0x493e0, ...)
    /go/src/github.com/improbable-eng/thanos/cmd/thanos/downsample.go:229 +0x3ed
main.downsampleBucket(0xf12ce0, 0xc420041c80, 0xf06800, 0xc4202ebe60, 0xf187a0, 0xc4200419c0, 0xc42016e8c0, 0x20, 0x0, 0x0)
    /go/src/github.com/improbable-eng/thanos/cmd/thanos/downsample.go:190 +0x658
main.runCompact.func1(0x0, 0x0)
    /go/src/github.com/improbable-eng/thanos/cmd/thanos/compact.go:190 +0x66f
main.runCompact.func2(0x0, 0x0)
    /go/src/github.com/improbable-eng/thanos/cmd/thanos/compact.go:206 +0x184
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42008dd40, 0xc42008d680, 0xc42032a980)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I deleted the block in question from S3 and the process continued and finished successfully eventually.

However, after the compactor finished, I tried again a particularly heavy dashboard and it failed again with the same error.

Finally, I was able to run the dashboard after a 4 restarts with the same error and 4 refreshes. There is some kind of cache that helped me display it I guess.

Thanks for these details! This sounds like we are hitting S3 not being strong consistent. We (store or compactor) think that the block is uploaded, but actually, it is only partially visible against the bucket (e.g partial index file). That's my theory so far. not confirmed yet. Did not experience any of this for GCS

Hey @Bplotka. I work with @mihailgmihaylov and I'm in the loop with our Thanos experiments. I just wanted to mention the following regarding S3 consistency, as per the official AWS documentation:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.

Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all regions.

Does the Thanos sidecar do overwrites, or does it always create new objects? How about the compactor?

So yea, I was assessing it some time ago: https://github.com/improbable-eng/thanos/issues/298

Basically, all our objects are 100% immutable. Even compactor when compacting/downsampling etc is adding new "block" and remove old ones.

Ups, I just found out the shipper (sidecar logic for uploading block) is actually doing check (Stat) on meta file (so <block ulid>/meta.json)

    // Check against bucket if the meta file for this block exists.
    ok, err := s.bucket.Exists(ctx, path.Join(meta.ULID.String(), block.MetaFilename))
    if err != nil {
        return errors.Wrap(err, "check exists")
    }
    if ok {
        return nil
    }

Before upload... And it is putting us into the caveat place I guess :/

This might mean that maybe meta json upload is not consistent. This does not immediately explain index reader nil pointers, but we might be in weird situation indeed. Let me think how to mitigate this or/and is there anything else.

We can replace this exists statement with the iter logic easily.

But my question related to caveat is following:
Let's say you GET (existsing or not) path /a/b
Now we know that uploading to /a/b is eventually consistent (caveat)

But what about /a/c/? is it strongly consistent? (We kind of have hit /a/ prefix). No details about it, but my guess is that it is strongly consistent (and hope).

My guess would be that it's consistent as the full path (actually the object's key) is different, "/a/c" != "/a/b". S3 is an object storage, which is basically a distributed key-value storage. Keys are what matter. The prefix logic is, I believe, just a convenience to do bulk operations on prefixes (list, delete, etc.) Listing the prefix might show stale data (as the docs mention), but reading the new key for the first time should return the result once you've made a successful PUT.

Ok, will try to fix this finding for now.

Another store segfault found by slack @clemence

evel=debug ts=2018-05-28T10:40:04.199789228Z caller=cluster.go:117 msg="resolved peers to following addresses" peers=10.52.11.11:10900,10.52.11.12:10900,10.52.6.14:10900,10.52.7.11:10900
level=debug ts=2018-05-28T10:40:04.207077257Z caller=store.go:128 msg="initializing bucket store"
level=debug ts=2018-05-28T10:40:19.432682781Z caller=store.go:132 msg="bucket store ready" init_duration=15.225599703s
level=info ts=2018-05-28T10:40:19.433013963Z caller=store.go:204 msg="starting store node"
level=debug ts=2018-05-28T10:40:19.438015463Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CEK5AS3EKA0G3G67EME2D8W0 addr=10.52.9.9:10900
level=debug ts=2018-05-28T10:40:19.475459658Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE8G45S8EAN2A0F0YEVNKJAW addr=10.52.11.11:10900
level=debug ts=2018-05-28T10:40:19.475561051Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE90D11C6YBECS8008FKEE9T addr=10.52.6.14:10900
level=debug ts=2018-05-28T10:40:19.475594962Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE92HQGXR52K4BGN0F64RW7E addr=10.52.7.11:10900
level=debug ts=2018-05-28T10:40:19.475621043Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CE94GKGBZ5FAGG6HTQW1DAQ9 addr=10.52.11.12:10900
level=debug ts=2018-05-28T10:40:19.482025711Z caller=cluster.go:201 component=cluster msg="joined cluster" peers=5
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6ea82]

goroutine 423 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc4979b0280, 0xa34b12, 0xc47dd98560)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc4676fe000, 0xa34b12, 0x1)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc4676fe000, 0xa340b5, 0xc47fc0c500)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:322 +0x6a
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc4676fe030, 0xa340b5, 0xc42d65d460)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc4676fe030, 0xc49c683f40)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf13840, 0xc4676fe030, 0x0, 0x3, 0x3, 0xf13840, 0xc4676fe030)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200c6ee0, 0xf12d80, 0xc49974da80, 0xecbc90e98ba56301, 0xfbfcecbde10d4632, 0xc420470c90, 0xc420161810, 0xc491fdb740, 0xc42058de30, 0x3, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:482 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc4778bf7b0, 0x432e88)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:667 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc491fdb800, 0xc420161880, 0xc4902df800)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

Looks really similar.

Additional info: The above exception happened on GCS.

Hello,

I've just hit the same issue (I think), my env is:

  • S3 (DO Spaces)
  • image: improbable/thanos:master-2018-06-15-5b66b72
  • This currently keeps reoccurring after each store node restart, node hangs for a while and Grafana attempts to query it, then it crashes. The short-term metrics (24h retention that Prometheus provides) seem unaffected.
  • If you need any further details to find the cause please advice and I'll try to provide it.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb65d02]
 goroutine 13638 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc443c9eb00, 0x5b159, 0xc427aef7c0)
    &lt;autogenerated&gt;:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc44ed9a090, 0x5b159, 0x5b159)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc44ed9a150, 0x5b159, 0x5b101)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*removedPostings).Seek(0xc44ed9a180, 0x5b159, 0x1)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:481 +0x82
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc43b2e9c20, 0x5b159, 0x5b0fa)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43b2e9c50, 0x5b159, 0xc42b8bba10)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43b2e9c50, 0xc4462edb00)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf13d60, 0xc43b2e9c50, 0x0, 0x4, 0x4, 0xf13d60, 0xc43b2e9c50)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200b4880, 0xf132a0, 0xc43315cc80, 0xe7c6c5ca9b3c6401, 0x7f9007d1d4d97ad9, 0xc4223baa20, 0xc433171260, 0xc43312d680, 0xc43315c7c0, 0x4, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:713 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43312dec0, 0xc4331712d0, 0xc4250714f0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

This seems to always be preceeded my messages complaining about missing blocks:

level=warn ts=2018-06-28T10:37:05.317699323Z caller=bucket.go:240 msg="loading block failed" id=01CH0W4A7XW4FTQZYCMN59GYXE err="new bucket block: load meta: download meta.json: get file: The specified key does not exist."
level=warn ts=2018-06-28T10:37:06.251591622Z caller=bucket.go:240 msg="loading block failed" id=01CGW52BEK1X6Q0KS4KN7EAT4M err="new bucket block: load index cache: download index file: get file: The specified key does not exist."
level=warn ts=2018-06-28T10:37:06.283778119Z caller=bucket.go:240 msg="loading block failed" id=01CGXE8QTDMC1BYWBC35PVBJM7 err="new bucket block: load index cache: download index file: get file: The specified key does not exist."

Interesting. This might be really connected to the issue fixed here: https://github.com/improbable-eng/thanos/pull/389

We forgot to properly check error on upload of chunks and index file so block were sometimes corrupted without notice (for example on network blip). This might be it. ):

can follow up the store handling for such blocks, but we definitely better handling of corrupted, partially uploads. Related issue: https://github.com/improbable-eng/thanos/issues/377

IMO to be unblocked here, please remove the broken blocks without missing bits and run newest code on sidecars. I am looking now how to ensure store gateway handles these gracefully

That would explain while it started failing a while after I upgraded from rc1 to thanos:master-2018-06-15-5b66b72

I will attempt to repair it later today. Thanks for the hint.

Why? The issue was all the time, until today fix ):

It seems to be resolved for now, but I had to delete a few blocks from the store. I've upgraded all deployments to the newest master tag as you suggested. If it happens again I'll let you know. (This project is great btw, works much better then using remote storage adapter)

awesome! Closing this for now, feel free to ping me here if you can repro

How can one identify or find these "broken blocks" and what is a sign that a block is broken that we should delete it?

We saw this today and based on the previous messages, our search turned up this:

  textPayload: "level=warn ts=2018-07-02T13:00:31.658807035Z caller=bucket.go:240 msg="loading block failed" id=01CHDH85H26TSTR7M0R260TWB8 err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"

I assume 01CHDH85H26TSTR7M0R260TWB8 is the block, but since it seems like the meta.json file can't be found, what do I delete (or do to resolve the issue)?


update: apparently that meta.json does exist. Not sure why Thanos is unable to find it, since GCS has strong consistency guarantees?

image

The whole 01CHDH85H26TSTR7M0R260TWB8 directory. And use the newest master-<date>-<sha>
I am going to release new RC this week.

update: apparently that meta.json does exist. Not sure why Thanos is unable to find it, since GCS has strong consistency ### guarantees?

Good question. Not sure. In fact, I can repro this panic only at the very beginning after restart of the store gateway, when I want query immediately from the thanos-query. I don't think it is because of partial upload so reopening for further investigation. After restart it works just perfectly nice for same query..

I think there is some race condition between store gateway being ready and thanos query using this store.

to follow up on @tonglil comment, we deployed the master-2018-07-03-3c8546c

and now the store pod is being killed on every query with this trace

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6c322] goroutine 493927 [running]: github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc48e1b2e40, 0x10b8, 0xc432857b80) <autogenerated>:1 +0x32 github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43e6d4b70, 0x10b8, 0xc44e5016d0) /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47 github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43e6d4b70, 0xc44e6dda30) /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60 github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf20160, 0xc43e6d4b70, 0x0, 0x2, 0x2, 0xf20160, 0xc43e6d4b70) /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57 github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200ad000, 0xf1f6a0, 0xc44f586900, 0x977149f410526401, 0x56fa74fc5c7c241c, 0xc42f2c90b0, 0xc44f591dc0, 0xc44f594c60, 0xc46c64c1a0, 0x2, ...) /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152 github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x771df016f201eab0, 0x49a8e8177625d113) /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3 github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc44f652e40, 0xc44f591e30, 0xc491061810) /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27 created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

We did not delete the directory yet, but instead run the bucket with repair flag. Hard to tell if it did anything actually from the output.

I'm going to remove the "bad" directory today and see if that help.

To give you a bit more context, we're running our prometheus with 24h retention and most of our grafana queries hit the store app.

~Removing~ Renaming the directory that we suspected included corrupted data did not change the behavior.

I finally removed the dictionary and it helped with the store app not crashing. My bad that I renamed it instead of deleting it in the first place. I did not want to loose the data in case it turned out it was not the problem or another block was at fault.

Thanks for providing a very useful tool!

I've been hitting this issue time and time again. I can reliably reproduce it with a new Instance and S3 (DO Spaces) as storage backend. It takes a few days for the issue to appear but this time there is no error message about a broken block. It just crashes when a query for a time frame outside of Prometheus 24h retention hits the store.

Exception:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6c322]

goroutine 8964 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc42224aac0, 0x24eaa5, 0xc4214137c0)
        <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc4220936b0, 0x24eaa5, 0x24eaa5)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc422093710, 0x24eaa5, 0x24ea01)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*removedPostings).Seek(0xc422093740, 0x24eaa5, 0xc42c9d6f01)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:481 +0x82
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc4220937a0, 0x24eaa5, 0x24e83e)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc4220937d0, 0x24eaa5, 0xc4276e9720)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc4220937d0, 0xc42224ae00)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf21340, 0xc4220937d0, 0x0, 0x4, 0x4, 0xf21340, 0xc4220937d0)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200a6780, 0xf20880, 0xc423abfb00, 0xc6a31ae1c76f6501, 0xdd7bea72a48cbf49, 0xc420365290, 0xc426f81340, 0xc4238feb40, 0xc423abf5c0, 0x4, ...)
        /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
        /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4238ff2c0, 0xc426f813b0, 0xc42311c950)
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
        /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

Are you running the latest version?

The latest tagged release, v0.1.0-rc.2

@davidhiendl you might want to update to the latest master. There were fixes pushed to it since RC2 was released.

If this is a dup of #499, then this happened with a recent build from master. details in #475

Yes, it is still ongoing (that's why it is not closed). But seems that this is related to to small chunk-size (or index size).

Workaround:

  • increase sizes of those

We need to handle lack of memory gracefully though, will look on it

Hello, I had the same problem when I asked for the last 7 days at one time.
After I asked for 24 hours, no error, then 25, 26, 27, ... until 168h and no error.
I then asked the last hour and the last 7 days and no problem.

If that can help.

/usr/local/bin/thanos --version
thanos, version 0.1.0-rc.2 (branch: master, revision: 82a62118dd0c5ae629aa67659fe19f25fb2e37dd)
  build user:       root@ip-172-31-13-160
  build date:       20180906-16:52:37
  go version:       go1.10.1
prometheus --version
prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e29e815795e9dd14742ee7725682fa14b7b)
  build user:       root@5258e0bd9cc1
  build date:       20180712-14:02:52
  go version:       go1.10.3

Cloud storage : s3

prometheus configuration

--storage.tsdb.min-block-duration=1h \
  --storage.tsdb.max-block-duration=1h \
  --storage.tsdb.retention=24h

@Labbs this is wrong:

--storage.tsdb.min-block-duration=1h \ --storage.tsdb.max-block-duration=1h \

You should never change those two. They should point to 2h

FYI: This seems to happen when store has not enough memory to answer (pools are full). This is obviously wrong, we should just return error not crash, so fix needs to be done.

Workarounds:

  • add more memory to store and set pools accordingly
  • add query safeguards (we are working on this), do not allow to query raw (non downsampled data) for super long time and lots of series

happened to me.

thanos:

$ docker run --rm improbable/thanos@sha256:d6b1cf91db2f438996328d3c434baf8a235deb8d0f0ad62acf5c25e9b804f8ee --version
thanos, version 0.1.0-rc.2 (branch: master, revision: a4128358e9df68170a4df7f0d4c7bef2bab2c3ca)
  build user:       circleci@36ffc8669a55
  build date:       20180718-06:16:19
  go version:       go1.10.3

s3 bucket: radosGW + ceph

first store

level=debug ts=2018-09-19T22:26:28.602245119Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=5 mint=1534801800000 maxt=1537398000000 lset="{prometheus=\"a\"}" spans="Range: 1534269600000-1536638400000 Resolution: 0"
level=debug ts=2018-09-19T22:26:28.602387009Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=163 mint=1534801800000 maxt=1537398000000 lset="{prometheus=\"b\"}" spans="Range: 1536213600000-1537387200000 Resolution: 0"
level=debug ts=2018-09-19T22:26:28.602964169Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=28 mint=1534801800000 maxt=1537398000000 lset="{prometheus=\"c\"}" spans="Range: 1536343200000-1537387200000 Resolution: 0"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb97fd2]
goroutine 82299 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc4213a6080, 0xc4f47a40a0)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc468fba1e0, 0xc4201e4290)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0xf5d940, 0xc468fba1e0, 0x0, 0x2, 0x2, 0xf5d940, 0xc468fba1e0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4202a6280, 0xf5cec0, 0xc454e22840, 0x3f2ea0857aeb6501, 0xea9e259077b31e5f, 0xc55497ced0, 0xc42030fd50, 0xc4b68482a0, 0xc6bbddce20, 0x2, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc93c3edb00, 0xc7039a7080)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4313a3380, 0xc42030fdc0, 0xc4c17b6d20)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

second store, different time (10 minutes earlier) and different case:

panic: runtime error: slice bounds out of range
goroutine 131494 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc49856ecc0, 0xf5cec0, 0xc61d5cab40, 0xc42dbc5148, 0x519, 0x43ae, 0x5, 0xb3f9d0f0b2012e9, 0x0, 0x0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1508 +0x6bf
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0xc5cfac37c0, 0xf5cec0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1479 +0xc4
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc6beb5f500, 0xc586e643c0, 0xc9af12dc30)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

What is the recommended way to tackle this issue before v0.2.0 roll out? We got hit by this issue also.

We are also running into this issue. 1 query fails for 7 days, 2 days work perfectly.
"Cmd": [
"store",
"--log.level=debug",
"--data-dir=./data",
"--grpc-address=0.0.0.0:10901",
"--http-address=0.0.0.0:10902",
"--s3.bucket=thanos",
"--s3.endpoint=archive-ceph-prod-a.[redacted]",
"--index-cache-size=20GB",
"--chunk-pool-size=50GB"
],
"Image": "improbable/thanos:v0.1.0",

level=debug ts=2018-11-06T14:20:58.339627196Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=67 mint=1540908959000 maxt=1541514059000 lset="{monitor=\"infrastructure\",prom_node=\"dep-prom00\",region=\"eu-west\",replica=\"A\"}" spans="Range: 1539820800000-1541505600000 Resolution: 0"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xc0bb22]

goroutine 606 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc459269c80, 0x1b4cf6, 0xc423835fe0)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc450d05560, 0x1b4cf6, 0x1aec01)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc450d05560, 0x1b4cf6, 0x1aec80)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:322 +0x6a
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc450d05590, 0x1b4cf6, 0xc443554830)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc450d05590, 0xc450d32e00)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x10070a0, 0xc450d05590, 0x0, 0x4, 0x4, 0x10070a0, 0xc450d05590)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4202e2280, 0x1006620, 0xc47c5b3180, 0xa6c6b400cfde6601, 0xc44507250502d424, 0xc4205bb290, 0xc4202c0310, 0xc4656a3b60, 0xc444abbbc0, 0x4, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x432e88, 0xf7c050)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc47f144960, 0xc4202c0380, 0xc42ac3b180)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I too have this problem with
image: improbable/thanos:master-2018-10-24-7d9a3ad

level=debug ts=2018-11-08T07:35:18.918602842Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=26 mint=1541575980000 maxt=1541662560000 lset="{monitor=\"prod2\"}" spans="Range: 1541548800000-1541660400000 Resolution: 0"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xc3c922]

goroutine 367093 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc4301caa80, 0x12e9, 0x1)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc426836e10, 0x12e9, 0xc432809220)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc426836e40, 0x12e9, 0xc43095c670)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc426836e40, 0xc432809220)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x101bfa0, 0xc426836e40, 0x0, 0x3, 0x3, 0x101bfa0, 0xc426836e40)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc42024ac00, 0x101b520, 0xc4304c1900, 0xa283288c7f06601, 0xdb3495f8f01717da, 0xc428d4e4b0, 0xc4238afce0, 0xc42f91d680, 0xc42d0aec60, 0x3, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x432e88)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4327bc0c0, 0xc4238afd50, 0xc428fbbfd0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

Same issue found here using latest version. It seems to occur if I run a "big" query against thanos-store right after restarting it:

level=info ts=2018-11-06T16:30:03.796433603Z caller=flags.go:83 msg="StoreAPI address that will be propagated through gossip" address=172.17.0.2:10901
level=info ts=2018-11-06T16:30:03.806818153Z caller=factory.go:34 msg="loading bucket configuration"
level=info ts=2018-11-06T16:30:33.756304687Z caller=main.go:245 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2018-11-06T16:30:33.756577907Z caller=store.go:191 msg="starting store node"
level=info ts=2018-11-06T16:30:33.756766907Z caller=main.go:297 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-11-06T16:30:33.756836776Z caller=store.go:160 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xc71da2]

goroutine 4706 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc0094b2740, 0x1746, 0x5)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Seek(0xc0006bf290, 0x1746, 0xc006ddb6c0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:402 +0x59
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc0006bf2c0, 0x1746, 0xc0094b2880)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc0006bf2c0, 0xc0007a82d0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x61
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc0006bf320, 0xc0094b2880)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x1094540, 0xc0006bf320, 0x0, 0x4, 0x5, 0x1094540, 0xc0006bf320)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x56
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc0002f4580, 0x1093ac0, 0xc007d2f4c0, 0x4a93cbde3cdf6601, 0x455c782baa981795, 0xc000976090, 0xc000114bd0, 0xc00090fd40, 0xc0009965f0, 0x4, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x4345e9, 0xfb6cc0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc00a180c60, 0xc000114c40, 0xc0007c3280)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe

Same here. I have set the sizes to more than enough for a large amount of data and I am still having this issue.
"--index-cache-size=100GB",
"--chunk-pool-size=200GB"

Thanks for feedback! It's our high priority now.

But it's quite tricky as we cannot repro it on our side.

On Fri, Nov 9, 2018, 18:51 Johnathan Falk notifications@github.com wrote:

Same here. I have set the sizes to more than enough for a large amount of
data and I am still having this issue.
"--index-cache-size=100GB",
"--chunk-pool-size=200GB"


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/improbable-eng/thanos/issues/335#issuecomment-437458393,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGoNu7jJ-e2IJ3Npe7HvA5mdHN46ucBtks5utc7AgaJpZM4T9kSY
.

I finally have some time to dive into it and spent some time on it.

So far I cannot repro it on my end which means that:

  • This might be racy behaviour
  • It depeds on special state of TSDB blocks I cannot reproduce
  • It depends on special type of query

Since in code flow there is nothing obvious, I tried different failure cases tests, like:

  • Removed block while doing store Series gRPC request
  • Extensive unit tests/integration tests with -race detection
  • Intensive load with small chunk-pool or index-cache sizes

I think there are places for memory optimizations (more streaming!), but cannot repro the actual panic.

I added more safeguard code to potentially broken logic of partitioning the postings fetches: https://github.com/improbable-eng/thanos/pull/627 - place where the source of panic should be allocated (lazyPosting). In case of potential panic, code should error now with actual useful info, what could be wrong (e.g edge case in paritioning).

There is docker image built specially with this additional logic here: https://hub.docker.com/r/improbable/thanos/tags/ v0.2.0-rc.0-lazy-postings. If someone uses v0.1.0, there might be a need to change to the new object.config flag and data-dir flags. Please use it to allow us to track this down and let us know here.

@bwplotka I was getting panics pretty consistently with v0.1.0 when querying Thanos over a 90-day period, so I tried v0.2.0-rc.0-lazy-postings to see if I could get more info to debug the issue.

But whatever you did, it looks like it improved things greatly, I haven't been able to reproduce the issue with v0.2.0-rc.0-lazy-postings! I'm curious to see if others in this thread get similar results, or if I just got lucky.

I've got another error:

panic: runtime error: slice bounds out of range
goroutine 17661 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).loadPostings(0xc531894540, 0x111c940, 0xc4591fa800, 0xc6ad9b9400, 0x25, 0x80, 0x2da2aa80, 0x2dcb61f8, 0x0, 0x0)
    /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1258 +0x421
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).preloadPostings.func3(0xc484989e60, 0x0)
    /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1209 +0xbe
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc455cc70e0, 0xc6be5bc870, 0xc6a7848d90)
    /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8
thanos, version 0.1.0 (branch: store-profile, revision: ca11572099cb4f64c916c13c7b411b97cccff24a)
  build user:       [email protected]
  build date:       20181111-01:16:32
  go version:       go1.10

@xjewer hmm your issue is in different place, which might help to debug this maybe... What object storage you are using?

@BenoitKnecht can you check logs as the request now should fail with sufficient info what went wrong for those queries that were triggering panic

@xjewer can you use v0.2.0-rc.0-lazy-postings to have some idea when this thing will happen?

revision: ca11572099cb4f64c916c13c7b411b97cccff24a is exactly the v0.2.0-rc.0-lazy-postings

https://github.com/improbable-eng/thanos/pull/627/commits/ca11572099cb4f64c916c13c7b411b97cccff24a

@bwplotka I did check the logs, but that's the thing: it's not even failing anymore, the request succeeds, so there's nothing in the logs.

Incidentally, Thanos store is now using way less RAM than on v0.1.0 (around 8% of 32GiB, when it used to be around 80%); did anything else change between v0.1.0 and v0.2.0-rc.0-lazy-postings that would explain that?

I'm going to try and downgrade to v0.1.0 again, just to confirm that I can still reproduce the issue reliably on that version.

No, nothing changed, so it must be lower traffic on your side (:

OK, so I'm not able to reproduce the issue on v0.1.0 anymore, so something else must have changed on my side. I'm going to investigate further and I'll keep you posted...

me as well, I don't see the initial lazyPostings error so far

I can confirm:
Switched to v0.2.0-rc.0-lazy-postings => No error.
Switched back to v0.1.0 => old error is back, same bucket.
It seems it is not memory related, these parameters did not change.

That seems to have resolved the issue for me as well. It's been up for 3 days so far with no crashes.

@bwplotka can those changes be merged into master so I can build against the latest version, it seems to have fixed the issue for now. We can reopen this if it occurs again.

We were encountering this error and as a result, did an upgrade from improbable/thanos:v0.1.0 to improbable/thanos:v0.2.0.

The error is still occurring (it only started re-occurring recently):

$ kubectl logs thanos-store-0 -f
level=info ts=2018-12-11T20:20:11.547099119Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.52.0.172:10901
level=debug ts=2018-12-11T20:20:11.576437636Z caller=cluster.go:158 component=cluster msg="resolved peers to following addresses" peers=10.52.0.166:10900,10.52.5.238:10900,10.52.6.236:10900,10.52.6.237:10900
level=info ts=2018-12-11T20:20:11.576723214Z caller=factory.go:39 msg="loading bucket configuration"
level=debug ts=2018-12-11T20:20:11.577427482Z caller=store.go:122 msg="initializing bucket store"
level=debug ts=2018-12-11T20:20:22.07387774Z caller=store.go:126 msg="bucket store ready" init_duration=10.496446262s
level=info ts=2018-12-11T20:20:22.074176321Z caller=main.go:256 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2018-12-11T20:20:22.074253301Z caller=store.go:191 msg="starting store node"
level=info ts=2018-12-11T20:20:22.074359914Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-12-11T20:20:22.074869481Z caller=store.go:160 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-12-11T20:20:22.079037314Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFETKNRY5VECXZ7DKBB7QQM addr=10.52.0.172:10900
level=debug ts=2018-12-11T20:20:22.099357831Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.6.237:10900"
level=debug ts=2018-12-11T20:20:22.102966902Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFD1MAQ4316WDVM184N5C3T addr=10.52.5.238:10900
level=debug ts=2018-12-11T20:20:22.10305351Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFAAZMNJ9WPZWXZ833W2D5R addr=10.52.6.237:10900
level=debug ts=2018-12-11T20:20:22.10308381Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFA10NQJ6QN6ND3DZ7NPCPZ addr=10.52.6.236:10900
level=debug ts=2018-12-11T20:20:22.103117775Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CYFA1D2CJ4T0KQ56FS75X4YY addr=10.52.0.166:10900
level=debug ts=2018-12-11T20:20:22.105910684Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.6.236:10900"
level=debug ts=2018-12-11T20:20:22.114819832Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.5.238:10900"
level=debug ts=2018-12-11T20:20:22.120939189Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.0.172:10900"
level=debug ts=2018-12-11T20:20:22.121030954Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Stream connection from=10.52.0.172:41034"
level=debug ts=2018-12-11T20:20:22.127303135Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":22 [DEBUG] memberlist: Initiating push/pull sync with: 10.52.0.166:10900"
level=debug ts=2018-12-11T20:20:22.13720523Z caller=cluster.go:233 component=cluster msg="joined cluster" peerType=store knownPeers=thanos-peers.monitoring.svc.cluster.local:10900
level=debug ts=2018-12-11T20:20:31.095351301Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":31 [DEBUG] memberlist: Stream connection from=10.52.5.238:37588"
level=debug ts=2018-12-11T20:20:57.594496184Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:20" msg=":57 [DEBUG] memberlist: Stream connection from=10.52.6.236:45614"
level=debug ts=2018-12-11T20:21:01.104371606Z caller=stdlib.go:89 component=cluster caller="peers2018/12/11 20:21" msg=":01 [DEBUG] memberlist: Stream connection from=10.52.5.238:38484"
level=debug ts=2018-12-11T20:21:04.196987471Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=174 mint=1542140164190 maxt=1544559664190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:04.19776128Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=171 mint=1542140164190 maxt=1544559664190 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:06.288557741Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=174 mint=1542140166190 maxt=1544559666190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:06.28967007Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=171 mint=1542140166190 maxt=1544559666190 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:07.196411862Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=174 mint=1542140167190 maxt=1544559667190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
level=debug ts=2018-12-11T20:21:07.197107644Z caller=bucket.go:672 msg="Blocks source resolutions" blocks=171 mint=1542140167190 maxt=1544559667190 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1541030400000-1544551200000 Resolution: 0"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xd08992]

goroutine 4465 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Seek(0xc4446835c0, 0x551c1, 0x2)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc43c8fe480, 0x551c1, 0xc449978920)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:319 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc43c8fe4b0, 0x551c1, 0xc434cf7020)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:294 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc43c8fe4b0, 0xc438f399e0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:315 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x112b100, 0xc43c8fe4b0, 0x0, 0x3, 0x3, 0x112b100, 0xc43c8fe4b0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4200b0f80, 0x112a680, 0xc42b0abe40, 0x6cb4d758b4646701, 0x64f1a7e3fb2ed087, 0xc4203835f0, 0xc43c6a78f0, 0xc43c6b01e0, 0xc44375d4a0, 0x3, ...)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc441fdf2c0, 0x0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43c86d5c0, 0xc43c6a7960, 0xc42c17deb0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I do suspect it is a memory related issue, however on observing the memory utilization, Store has not consumed much memory at all:

    spec:
      containers:
      - args:
        - store
        - --log.level=debug
        - --data-dir=/var/thanos/store
        - --cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900
        - --objstore.config-file=/etc/thanos/bucket.yml
        - --index-cache-size=10GB
        - --chunk-pool-size=10GB
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /creds/gcs-credentials.json
        image: improbable/thanos:v0.2.0
        name: thanos-store
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10900
          name: cluster
          protocol: TCP
        resources:
          requests:
            memory: 10Gi
        volumeMounts:
        - mountPath: /creds/
          name: gcs-credentials
          readOnly: true
        - mountPath: /var/thanos/store
          name: data
        - mountPath: /etc/thanos
          name: config-thanos

Actual memory usage after upgrading is <2GB:
image

The memory usage over the last 30 days as steadily increased:
image

I also notice that these messages are showing up on start-up, should I be deleting or fixing these blocks?

level=warn ts=2018-12-11T20:16:32.439508243Z caller=bucket.go:240 msg="loading block failed" id=01CXGDF4W931PVM6R3DDB0BXZM err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"
level=warn ts=2018-12-11T20:16:32.483353387Z caller=bucket.go:240 msg="loading block failed" id=01CXGMAW4AKN0JG26YRSPF99J5 err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"
level=warn ts=2018-12-11T20:16:32.520181068Z caller=bucket.go:240 msg="loading block failed" id=01CXGV6KC995HR2WTVFE5PWCH3 err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"
level=warn ts=2018-12-11T20:16:32.559296797Z caller=bucket.go:240 msg="loading block failed" id=01CXH22AMFX6JX6CX6ZM95PRDG err="new bucket block: load meta: download meta.json: get file: storage: object doesn't exist"

Following up on this, running this image: improbable/thanos:master-2018-12-17-ec30cf2 since Dec 17, 2018, 1:39:01 PM, have gotten 3 restarts since then.

However, the logs are not showing the messages that you added in #627.

A different panic is now appearing:

level=debug ts=2018-12-19T03:58:21.088695219Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=6 mint=1542772400384 maxt=1545191900384 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:13.389926255Z caller=stdlib.go:89 component=cluster caller="peers2018/12/19 03:58" msg=":13 [WARN] memberlist: Got ping for unexpected node '01CYZ1QCMHVF7J8JC0A3QEYS83' from=10.52.5.239:10900"
level=info ts=2018-12-19T03:58:12.796319948Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-12-19T03:58:12.795805719Z caller=store.go:160 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2018-12-19T03:58:12.795656417Z caller=store.go:191 msg="starting store node"
level=info ts=2018-12-19T03:58:12.795529539Z caller=main.go:256 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2018-12-19T03:58:10.973564518Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2018-12-19T03:58:10.967076525Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.52.0.180:10901
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45a9a7]

goroutine 23416168 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc42e8eb2c0, 0x1135a60, 0xc43736ea00, 0xc42a993ac0, 0x30, 0x50, 0x3, 0x11abd62511ab62e7, 0x0, 0x0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1524 +0x56f
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0xc426066000, 0x0)
    /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1506 +0xc4
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc42b70aea0, 0xc438c2ed70, 0xc42b03c3b0)
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8
level=debug ts=2018-12-19T03:58:07.668501945Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:5 postingsTouched:15 postingsTouchedSizeSum:2280 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:5 seriesTouchedSizeSum:1563 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:74 chunksTouchedSizeSum:17313 chunksFetched:74 chunksFetchedSizeSum:96513 chunksFetchCount:5 chunksFetchDurationSum:660954990 getAllDuration:199683288 mergedSeriesCount:2 mergedChunksCount:74 mergeDuration:87141}"
level=debug ts=2018-12-19T03:58:07.418663352Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=11 mint=1542772687000 maxt=1545191887000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.363549417Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:32 postingsTouchedSizeSum:2016 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:24 seriesTouchedSizeSum:5196 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:156 chunksTouchedSizeSum:36527 chunksFetched:156 chunksFetchedSizeSum:988205 chunksFetchCount:4 chunksFetchDurationSum:622909497 getAllDuration:265704911 mergedSeriesCount:12 mergedChunksCount:156 mergeDuration:202548}"
level=debug ts=2018-12-19T03:58:07.355788081Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=3 mint=1545145080000 maxt=1545191940000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545120000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.188290034Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=3 mint=1545141480000 maxt=1545191940000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545120000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.134090584Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169980000 maxt=1545191910000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545148800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:07.125512366Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=11 mint=1542772687000 maxt=1545191887000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:58:02.195285633Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=11 mint=1542772382190 maxt=1545191882190 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1542240000000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.50377173Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169950000 maxt=1545191880000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545148800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.472821359Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:12 postingsTouchedSizeSum:7560 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:8 seriesTouchedSizeSum:988 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:24 chunksTouchedSizeSum:9831 chunksFetched:24 chunksFetchedSizeSum:74526 chunksFetchCount:4 chunksFetchDurationSum:206441229 getAllDuration:73051321 mergedSeriesCount:4 mergedChunksCount:24 mergeDuration:63435}"
level=debug ts=2018-12-19T03:57:57.452496915Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:24 postingsTouchedSizeSum:8612 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:31 seriesTouchedSizeSum:4722 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:128 chunksTouchedSizeSum:31302 chunksFetched:128 chunksFetchedSizeSum:1107642 chunksFetchCount:4 chunksFetchDurationSum:187772808 getAllDuration:61025590 mergedSeriesCount:16 mergedChunksCount:128 mergeDuration:463253}"
level=debug ts=2018-12-19T03:57:57.289427688Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169920000 maxt=1545191880000 lset="{monitor=\"prometheus\",replica=\"prometheus-0\"}" spans="Range: 1545004800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.216719426Z caller=bucket.go:675 msg="Blocks source resolutions" blocks=2 mint=1545169950000 maxt=1545191880000 lset="{monitor=\"prometheus\",replica=\"prometheus-1\"}" spans="Range: 1545148800000-1545184800000 Resolution: 0"
level=debug ts=2018-12-19T03:57:57.191289412Z caller=bucket.go:808 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:12 postingsTouchedSizeSum:7324 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:4 seriesTouchedSizeSum:187 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:557206 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:2417}"

It happens here every hour or two but no one wants to upload a core dump that is around 200GB (at least in our case since we have a bucket which is >1TB). And I haven't had enough time to dig into it and research the warts of the Go's memory handling code myself.
FWIW, sysctl vm.overcommit_memory=2 which disables overcommit immediately leads to Thanos Store getting killed by the OOM killer here since there is no memory left available. The default value is 0 and the Linux kernel uses a heuristic to check if there is enough memory available to fulfill any memory allocation request. And because the nil pointers come in places where they shouldn't it leads me to think that the allocations fail because the Kernel had decided that there isn't enough RAM.
Also, if you haven't noticed but Go's specification, unfortunately, doesn't specify anything exact about what happens when make() or append() fails when it tries to create a new object underneath in the heap... so who knows? :( Plus, in the Thanos code we are only checking for errors which are represented by the Error interface.
Lastly, https://github.com/golang/go/issues/16843 this has been open for almost 3 years by now and funnily enough even some Prometheus developers weighed in onto this issue. Which just leads me to think that the language itself doesn't provide a way to control this or check this.
And this is why https://github.com/improbable-eng/thanos/pull/798 was born since in our case random users sometimes were sending queries which overload the servers

Yea.. so as we suspected previously, it has to do with OOM. Wonder what we can do to improve case.

I think the acceptance criteria for this issue is then to add means to control memory consumption to fail query request rather than killing box. This seems like a first solution: https://github.com/improbable-eng/thanos/pull/798 so similar control flags as Prometheus has.

Okay, we got a bit more RAM and it is still reproducible even though there is a lot of free RAM :( The original PR that I've made is still very valid so that random queries wouldn't be able to kill Thanos Store but the problem lies somewhere else. I will be able to load up a debugger now since there is lots of RAM and the core dump files are huge, and, unfortunately, it's impossible to upload such huge files.
However, one thing I have noticed is that in the comments above this and in our own deployment the segfault comes when there a lot of goroutines spawned. Usually, a much, much less number of them is running. Which implies to me that Thanos Store starts downloading some uncompacted data since in such cases a lot of different requests are needed to get all of the data which is sparsely laid out. Also, this also probably means that this only happens when there is a huge amount of data in the TSDB blocks i.e. series with a lot of labels.
I have been mainly eyeing over the gap-based partitioner code, the type conversions, and perhaps some kind overflow happens there? That part looks suspicious to me. I guess I will find out soon once I will have the time to explore the dump(-s) with delve.

Yes, we checked Golang memory management and the behaviour you mentioned should not happen (nil on lack of mem).

We recently found race condition on this: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1627 Now it's inside lock, but it was before lock and we fixed that recently:
image

here: https://github.com/improbable-eng/thanos/commit/1b6f6dae946fb023710dbbd9e154630aadf623b2#diff-a75f50a9f5bf5b21a862e4e7c6bd1576

Can we check master if this is still reproducible? Highly plausible that we fixed this.

Will do and update this ticket. Thank you a lot! 👍 Awesome work

I updated our cluster components to thanos 0.3.2 recently. While testing data access for old data (only available via store/bucket access) I started to get similar panics:

Mar 22 12:11:42 thanosstore01 thanos[10659]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 22 12:11:42 thanosstore01 thanos[10659]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x58f0d9]
Mar 22 12:11:42 thanosstore01 thanos[10659]: goroutine 32153 [running]:
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc01a120360, 0x58ee95)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:341 +0x29
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x129ac00, 0xc01a120360, 0x4, 0x129ac00, 0xc01a120360, 0xc0503a0f80, 0x2)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:250 +0x56
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).ExpandedPostings(0xc024a11310, 0xc03a09c660, 0x3, 0x3, 0x11d465, 0x279436851c800546, 0x32d734e732aa07e0, 0xd00641e635b035f9, 0x48d0f05ab4bfe1bf)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1202 +0x2ac
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc000328360, 0x129a140, 0xc00f453440, 0x4a279fb132a56901, 0xccdf002dd53b8032, 0xc021330510, 0xc024a11310, 0xc03a096d20, 0xc03a09c660, 0x3, ...)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:484 +0x8c
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x4346e9, 0x1190d28)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:706 +0xe6
Mar 22 12:11:42 thanosstore01 thanos[10659]: github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc03a097620, 0xc024a24070, 0xc03a09a360)
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
Mar 22 12:11:42 thanosstore01 thanos[10659]: created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
Mar 22 12:11:42 thanosstore01 thanos[10659]:         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe

@rsommer it seems like you are running into https://github.com/improbable-eng/thanos/issues/874. I tested the master version on our systems and this particular segfault has been fixed. Closing. If you still run into this then please feel free to comment again :smile:

Was this page helpful?
0 / 5 - 0 ratings