Thanos: Thanos store OOM

Created on 17 Nov 2019  ·  11Comments  ·  Source: thanos-io/thanos

Thanos, Prometheus and Golang version used:
Thanos: 0.6.0
Prometheus: 2.10.0

Object Storage Provider: S3

What happened:
Thanos Store won't start. It runs for 2 mins and crashes with OOM. Increased the memory to 64 GB and will still fail. Compactor is running and generating index.cache.json files.

Bucket size: 202.4 GiB. Total Objects: 4952
Biggest index.cache.json: 3.2 GiB

  • --index-cache-size=20GB
  • --chunk-pool-size=40GB
store

Most helpful comment

Using the same version here.

# thanos --version
thanos, version 0.8.1 (branch: HEAD, revision: bd8278859b2321aaaa7514edde764816cc039d34)
  build user:       root@2227d9a2fdb1
  build date:       20191014-12:03:55
  go version:       go1.13.1

Total objects is a little over 8000.

Running on a VM, I had to go from 2GB --> 4GB --> 8GB --> 16GB of memory before the OOM-killer was not an issue anymore!
Now thanos store is using 12.3GB of RAM.

All 11 comments

Hi, any reason you are running such an old version? Please try the master and/or 0.8.1 and see if it is still reproducible.

Same issue with v0.8.1.

Using the same version here.

# thanos --version
thanos, version 0.8.1 (branch: HEAD, revision: bd8278859b2321aaaa7514edde764816cc039d34)
  build user:       root@2227d9a2fdb1
  build date:       20191014-12:03:55
  go version:       go1.13.1

Total objects is a little over 8000.

Running on a VM, I had to go from 2GB --> 4GB --> 8GB --> 16GB of memory before the OOM-killer was not an issue anymore!
Now thanos store is using 12.3GB of RAM.

Same issue with 0.9.0. I wonder what the relations between Bucket size, Total Objects,--index-cache-size and --chunk-pool-size are - as to how to come up with a formula indicating the proper memory requirements. Even if it's only an estimate.

Same issue with 0.9.0. I wonder what the relations between Bucket size, Total Objects,--index-cache-size and --chunk-pool-size are - as to how to come up with a formula indicating the proper memory requirements. Even if it's only an estimate.

we're having the same issue, it'd be really useful if the documentation would contains some informations about ball park values to set.

Things seem to have gotten a bit worse with v0.9.0.
I upgraded at 14:13CET and this is what Grafana shows:

Screenshot 2019-12-24 at 15 32 58

Thanos Store behaves a bit like "The Very Hungry Caterpillar" when it comes to memory usage...

On the positive side, I see there's being worked on: https://github.com/thanos-io/thanos/issues/1471 👍

Just started testing thanos store, thanos-0.10.0.linux-amd64
@bwplotka Can you please explain this?

Instead of OOM it is now restarting pretty often with fatal error: runtime: out of memory (20GB RAM).

Jan 21 15:35:28 thanos0-grq thanos[10904]: created by net.(*netFD).connect
Jan 21 15:35:28 thanos0-grq thanos[10904]:         /usr/local/go/src/net/fd_unix.go:128 +0x275
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Failed with result 'exit-code'.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Service RestartSec=100ms expired, scheduling restart.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Scheduled restart job, restart counter is at 3.
Jan 21 15:35:29 thanos0-grq systemd[1]: Stopped Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq systemd[1]: Started Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.707034623Z caller=main.go:149 msg="Tracing will be disabled"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.711552838Z caller=factory.go:43 msg="loading bucket configuration"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.817545669Z caller=inmemory.go:167 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818622141Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818923211Z caller=store.go:288 msg="starting store node"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819046301Z caller=store.go:243 msg="initializing bucket store"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819459371Z caller=prober.go:127 msg="changing probe status" status=healthy
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819557861Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:19191
Jan 21 15:35:39 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:39.306737083Z caller=fetcher.go:361 component=block.MetaFetcher msg="successfully fetched block metadata" duration=9.487646873s cached=11563 returned=11563 partial=0
Jan 21 15:39:11 thanos0-grq thanos[11516]: fatal error: runtime: out of memory

Thanos-store

Just started testing thanos store, thanos-0.10.0.linux-amd64
@bwplotka Can you please explain this?

Instead of OOM it is now restarting pretty often with fatal error: runtime: out of memory (20GB RAM).

Jan 21 15:35:28 thanos0-grq thanos[10904]: created by net.(*netFD).connect
Jan 21 15:35:28 thanos0-grq thanos[10904]:         /usr/local/go/src/net/fd_unix.go:128 +0x275
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Failed with result 'exit-code'.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Service RestartSec=100ms expired, scheduling restart.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Scheduled restart job, restart counter is at 3.
Jan 21 15:35:29 thanos0-grq systemd[1]: Stopped Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq systemd[1]: Started Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.707034623Z caller=main.go:149 msg="Tracing will be disabled"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.711552838Z caller=factory.go:43 msg="loading bucket configuration"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.817545669Z caller=inmemory.go:167 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818622141Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818923211Z caller=store.go:288 msg="starting store node"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819046301Z caller=store.go:243 msg="initializing bucket store"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819459371Z caller=prober.go:127 msg="changing probe status" status=healthy
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819557861Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:19191
Jan 21 15:35:39 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:39.306737083Z caller=fetcher.go:361 component=block.MetaFetcher msg="successfully fetched block metadata" duration=9.487646873s cached=11563 returned=11563 partial=0
Jan 21 15:39:11 thanos0-grq thanos[11516]: fatal error: runtime: out of memory

Thanos-store

The PR that closed this is not in 0.10.0. Please try out the master version with the experimental flags turned on.

It's on master indeed. It's still on experimental but you can enable it via https://github.com/thanos-io/thanos/blob/master/cmd/thanos/store.go#L78 (--experimental.enable-index-header).

We are still working on various benchmarks especially around query resource usage, but functionally it should work! (: Please try it our on dev/testing/staging environments and give us feedback! :heart:

@bwplotka the result is amazing after applied this feature flag. Is there any side effects for performance?

Nothing we have seen or any user reported. Theoretically queries might be
insignificantly slower and a slight more CPU cycles are needed as well disk
lookups, but not much more.

Label APIs might be bit slower, and we need this to fix this:
https://github.com/thanos-io/thanos/issues/1811

On Wed, 25 Mar 2020 at 15:01, Maria Kotlyarevskaya notifications@github.com
wrote:

@bwplotka https://github.com/bwplotka the result is amazing after
applied this feature flag. Does it has some side-effects for performance?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/1750#issuecomment-603890337,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O33CR6QA2YHD3UP4S3RJIMEZANCNFSM4JOHMXJA
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bwplotka picture bwplotka  ·  3Comments

rmrf picture rmrf  ·  4Comments

bwplotka picture bwplotka  ·  4Comments

wiardvanrij picture wiardvanrij  ·  3Comments

abursavich picture abursavich  ·  4Comments