Thanos: receive: high cpu when upgrading from 0.12.2 with old data

Created on 23 Jun 2020 · 18Comments · Source: thanos-io/thanos

Thanos, Prometheus and Golang version used:
thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8)
build user: circleci@b8cd18f8b553
build date: 20200622-10:04:50
go version: go1.14.2

Object Storage Provider:
GCP

What happened:
We're using 3 thanos-receive v0.12.2 pods running with --receive.replication-factor=3.
At 3:30 i've restarted one pod (green line) as v0.13.0, and it's cpu usage doubled:

Memory usage is 5-10% higher, which is fine.
Here is another graph, from node where pod has been running:

What you expected to happen:
Statistically negligible resource usage change between v0.12.2 and v0.13.0 for receive, as for other thanos components.

How to reproduce it (as minimally and precisely as possible):
We run receive with such args:

          args:
          - receive
          - |
            --objstore.config=type: GCS
            config:
              bucket: jb-thanos
          - --tsdb.path=/data
          - --tsdb.retention=1d
          - --tsdb.min-block-duration=30m  # have to be >2x(prometheus side 15m), https://github.com/thanos-io/thanos/issues/2114
          - --tsdb.max-block-duration=30m
          - --label=replica="$(NAME)"
          - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
          - --receive.hashrings-file=/cfg/hashrings.json
          - --receive.replication-factor=3

Full logs to relevant components:
These are new events, which were not in v0.12.2. Also, they written each ~15sec comparing to ~15min in prometheus v0.19.0 with almost the same settings for tsdb:

Logs

level=info ts=2020-06-23T00:37:15.498358444Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=308.339725ms
level=info ts=2020-06-23T00:37:22.528425096Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=287.226708ms
level=info ts=2020-06-23T00:37:25.321693858Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=51 last=52 duration=2.793211656s
level=info ts=2020-06-23T00:37:34.117011112Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=305.063766ms
level=info ts=2020-06-23T00:37:42.249223736Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=353.091759ms
level=info ts=2020-06-23T00:37:45.317240183Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=53 last=54 duration=3.067957396s

Anything else we need to know:

bug help wanted stale

Source

sepich

👍1

Most helpful comment

We can repro this on our side, looking into this more (:

bwplotka on 29 Jun 2020

👍3

All 18 comments

could you take a cpu profile of this?

brancz on 23 Jun 2020

Please find 30s cpu profiles attached, they were taken almost at the same time:
Archive.zip

sepich on 23 Jun 2020

Thanks, looks like GC is the current bottleneck:

DIFF 0.12.2 vs 0.13.0:
profile001.pdf

bwplotka on 23 Jun 2020

We can repro this on our side, looking into this more (:

bwplotka on 29 Jun 2020

👍3

JFYI,
i've backported https://github.com/thanos-io/thanos/pull/2621, https://github.com/thanos-io/thanos/pull/2679 to v0.12.2 and it does not lead to cpu usage increase.

sepich on 29 Jun 2020

I think it's multiTSDB so https://github.com/thanos-io/thanos/pull/2012

bwplotka on 29 Jun 2020

It's not particularly in v0.13.0, but this code does not help for sure: https://github.com/prometheus/prometheus/blob/b788986717e1597452ca25e5219510bb787165c7/tsdb/db.go#L812

TSDB we use, forces GC multiple times. It might make sense for single TSDB, but not really for multiple.

bwplotka on 29 Jun 2020

they written each ~15sec comparing to ~15min in prometheus v2.19.0 with almost the same settings for tsdb:

Turns out this high cpu usage happens when i'm start thanos-receive v0.13.0 on non-empty tsdb folder from v0.12.2 (which has many subfolders like 01EC0YX4APXB1MGA39DWFTS96C, etc)
This leads to each subfolder being converted to separate tsdb:

...
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1KGAFNHYWZ6Z8CPMZ8TQQ6"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1HSC4SJ9KSZRYENXX9PS18"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1G2E84CNBJ3NFY7P4PE5RW"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1EBGTBP4HRW2WGG9KEXN7P"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1CMKR2521YKPS1283WGBKJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1AXMY36Z98AQ64QQRD10MP"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC196PXDRFDA39FW476KR7SJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC17FSYJY8AWTYXCP3QW9J9W"}
...

And then log is full of component=tsdb msg="Head GC completed" events.
If I wipe tsdb folder before starting of v0.13.0 - then it consumes the same cpu as v0.12.2

sepich on 30 Jun 2020

👍1

Also https://github.com/thanos-io/thanos/issues/2823

bwplotka on 30 Jun 2020

This looks like something we experienced as well, but not when I recall this we wiped the dir indeed.

It's migration issue from non-multiTSDB to multiTSDB.

This can be easily repro in unit test when porting old receive. Let's fix it, might be annoying for Receive users who want to upgrade to 0.13+

bwplotka on 30 Jun 2020

I think for you you can ensure it works with higher CPU long enough for WAL from 0.12 to be non-existent then wipe those dirs. In the mean time we will look on fix. Still help wanted (:

bwplotka on 30 Jun 2020

Those look like the migration didn't work correctly. The dirs look like they are block directories. Just to make sure though, you are not running a multi tenant setup right?

brancz on 1 Jul 2020

Yes, we're not running multi tenancy, and these are just block dirs.

sepich on 1 Jul 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.