Thanos: receive: can not load WAL data when restart

Created on 10 Oct 2019 · 28Comments · Source: thanos-io/thanos

Thanos, Prometheus and Golang version used:

Thanos: self build from thanos master: a09a4b97c243b7652446685c73f6b80bb9417fe2
Golang: go version go1.12.7 darwin/amd64
Prometheus: 2.13.0

thanos build command: GOOS=linux GOARCH=amd64 go build -o thanos ./cmd/thanos

What happened:

I'm trying to use thanos receive, they are running as expected hours before I restart it.
the thanos receive try to load WAL data again and again

What you expected to happen:

load WAL data and listen & receive new data.

How to reproduce it (as minimally and precisely as possible):

I'm not sure is there any logic issue when receive starting? the thanos-0.7 can restart successful but master code cannot.

Full logs to relevant components:

Logs

level=info ts=2019-10-10T13:11:00.909450843Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:11:00.909646509Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:11:00.910759805Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:11:00.913326316Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:11:00.913810978Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:11:00.913638216Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:11:00.913837063Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:11:00.91392518Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:11:00.914250195Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:11:00.91435138Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:11:00.914334174Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:11:00.914388157Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:11:00.914437502Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:11:01.480532567Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:11:54.313611621Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:11:58.976905831Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:12:03.674326957Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:12:08.825274021Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:12:13.699480427Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:12:18.64123707Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167
level=info ts=2019-10-10T13:12:23.570922965Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2039 maxSegment=2167

.....

level=info ts=2019-10-10T13:22:40.613369573Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2165 maxSegment=2167
level=warn ts=2019-10-10T13:22:41.261691466Z caller=head.go:492 component=receive component=tsdb component=tsdb msg="unknown series references" count=104
level=info ts=2019-10-10T13:22:41.261809977Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2166 maxSegment=2167
level=info ts=2019-10-10T13:22:41.26233365Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2167 maxSegment=2167
level=info ts=2019-10-10T13:23:18.541986663Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:24:06.787360411Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:24:11.174920806Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:24:15.48524744Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:24:20.140672496Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:24:24.467852131Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:24:28.944796843Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167

....

level=info ts=2019-10-10T13:33:51.954823544Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2136 maxSegment=2167
level=warn ts=2019-10-10T13:33:57.647949531Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973
level=info ts=2019-10-10T13:33:57.648056531Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2137 maxSegment=2167
level=warn ts=2019-10-10T13:34:03.485428154Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=7034
level=info ts=2019-10-10T13:34:03.485536561Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2138 maxSegment=2167
level=info ts=2019-10-10T13:34:22.347698132Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:34:22.348067504Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:34:22.34921212Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:34:22.353678479Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:34:22.353887745Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:34:22.354027884Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:34:22.354002076Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:34:22.354402573Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:34:22.35507048Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:34:22.355474967Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:34:22.355602831Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:34:22.355642604Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:34:22.355704854Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:34:22.840486871Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"

Environment:

OS (e.g. from /etc/os-release): ubuntu
Kernel (e.g. uname -a): 4.15.0
Others:

receive stale

Source

GuyCheung

👍2

Most helpful comment

same problem for v0.12.2, restart receive, then oom

daxwang on 5 Jun 2020

👍2

All 28 comments

On master receive will on startup ensure that it starts with a clean state, so it loads the WAL, flushes it to a block and then uploads it. That's why you're seeing it be loaded multiple times.

cc @squat

brancz on 10 Oct 2019

@GuyCheung the WAL gets replayed and written to disk in the following cases:

startup (for cleanup)
every time that the hashring configuration changes
every two hours
shutdown

Are you seeing an error where the WAL data is not successfully getting loaded?

squat on 10 Oct 2019

@brancz @squat sorry for later reply. I didn't see critical errors except some unknown series error log like caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973.

if it will load multiple times, is there an exact number of how many times? seems it's an infinite loop and always loading

GuyCheung on 15 Oct 2019

@GuyCheung the WAL will be reloaded twice on startup (only one of the two times actually has data played back) and two times every two hours:

During startup:

startup for ReadOnly DB flush (data is played back)
startup for Write DB (WAL is empty so nothing is played back)

... then every two hours:

once for ReadOnly DB flush (data is played back)
again for Write DB open (WAL is empty so nothing is played back)

Are you seeing it more often than that?

squat on 15 Oct 2019

@squat thanks for ur information! I think I know why it's always keep loading... the process had been OOM and been killed. I have some questions based on this, could you help me?

before my questions, update the information: I have changed thanos verison to v0.8.1 now.

1st, during multiple times replay, the memory will not released?

In my mind, I think the memory will not much bigger than the data folder on the disk? but I found the process took about 47G memory(total 48G), and the WAL data folder on disk was about 22G

the process took about 47G memory when been killed:

[7140727.569918] Out of memory: Kill process 89696 (thanos) score 936 or sacrifice child
[7140727.573522] Killed process 89696 (thanos) total-vm:52478600kB, anon-rss:47156872kB, file-rss:0kB, shmem-rss:0kB
[7140730.168611] oom_reaper: reaped process 89696 (thanos), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

the receiver data folder size like below:

2nd, according to this situation, is there some way I can limit the memory to avoid OOM?

I try to find the answer on the internet, but unfortunately, I didn't find any useful information, and there is a voice said the Prometheus 2 can not limit the memory, I'm not sure about this.

3rd, about the WAL replay, I got succeed on a 128G memory host, but seems not only replay twice?

I think it was replayed triple times during startup, and each time seems took 10 minutes, did this act as ur expectation? is there sth we can do to reduce the replay time cost?

the key points of my logs like below:

level=info ts=2019-10-15T11:30:13.41402523Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-15T11:30:13.414148657Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-15T11:30:13.414159228Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-15T11:30:13.414214989Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-15T11:30:13.414218471Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-15T11:30:13.414273506Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-15T11:30:13.414441585Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571104800000 maxt=1571112000000 ulid=01DQ6WK24Q2YT0FF2P4X9C24QQ
level=info ts=2019-10-15T11:30:13.414492525Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571112000000 maxt=1571119200000 ulid=01DQ73ESJB1ADEC1AJ53449AXY
level=info ts=2019-10-15T11:30:13.414524465Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571119200000 maxt=1571126400000 ulid=01DQ7AAGNR3YW4KE0JZ16E7VSJ
level=info ts=2019-10-15T11:30:13.4145516Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571126400000 maxt=1571133600000 ulid=01DQ7JCVJMGZKY88TTVF6Z9WX8
level=info ts=2019-10-15T11:30:13.728236906Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:30:47.562607851Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:30:53.447669188Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2588
level=info ts=2019-10-15T11:30:58.981511175Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2588
level=info ts=2019-10-15T11:31:04.501141802Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2588
level=info ts=2019-10-15T11:31:10.169544723Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2588

......

level=info ts=2019-10-15T11:40:25.17463596Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2584 maxSegment=2588
level=info ts=2019-10-15T11:40:31.55063322Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2585 maxSegment=2588
level=info ts=2019-10-15T11:40:32.134839757Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2588
level=info ts=2019-10-15T11:40:32.13568359Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2588
level=info ts=2019-10-15T11:40:32.136520863Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2588
level=info ts=2019-10-15T11:40:48.050314426Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:41:19.363551827Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:41:26.103639804Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2588
level=info ts=2019-10-15T11:41:31.860346412Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2588
level=info ts=2019-10-15T11:41:37.503849905Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2588
level=info ts=2019-10-15T11:41:43.180104605Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2588

.......

level=info ts=2019-10-15T11:51:48.263074931Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2588
level=info ts=2019-10-15T11:51:48.263503714Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2588
level=info ts=2019-10-15T11:51:48.263827295Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2588
level=info ts=2019-10-15T11:55:32.977551161Z caller=compact.go:495 component=receive component=tsdb msg="write block" mint=1571133600003 maxt=1571137701384 ulid=01DQ7M5MPT3ZVP0YGNVWZZBRX3 duration=3m26.870889547s
level=info ts=2019-10-15T11:55:32.977767127Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571104800000 maxt=1571112000000 ulid=01DQ6WK24Q2YT0FF2P4X9C24QQ
level=info ts=2019-10-15T11:55:32.977804379Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571112000000 maxt=1571119200000 ulid=01DQ73ESJB1ADEC1AJ53449AXY
level=info ts=2019-10-15T11:55:32.977831925Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571119200000 maxt=1571126400000 ulid=01DQ7AAGNR3YW4KE0JZ16E7VSJ
level=info ts=2019-10-15T11:55:32.977858102Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571126400000 maxt=1571133600000 ulid=01DQ7JCVJMGZKY88TTVF6Z9WX8
level=info ts=2019-10-15T11:55:32.977885957Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571133600003 maxt=1571137701384 ulid=01DQ7M5MPT3ZVP0YGNVWZZBRX3
level=info ts=2019-10-15T11:55:33.274419251Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:55:43.440394647Z caller=shipper.go:349 component=receive msg="upload new block" id=01DQ7M5MPT3ZVP0YGNVWZZBRX3
level=info ts=2019-10-15T11:56:06.821192263Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:56:13.313346626Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2589
level=info ts=2019-10-15T11:56:19.591774203Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2589
level=info ts=2019-10-15T11:56:25.923561156Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2589
level=info ts=2019-10-15T11:56:32.700377317Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2589
level=info ts=2019-10-15T11:56:39.431858684Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2497 maxSegment=2589

..........

level=info ts=2019-10-15T12:06:15.766904275Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2589
level=info ts=2019-10-15T12:06:15.767436341Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2589
level=info ts=2019-10-15T12:06:15.767896116Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2589
level=info ts=2019-10-15T12:06:15.768469372Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2589 maxSegment=2589
level=info ts=2019-10-15T12:06:37.956202426Z caller=receive.go:239 component=receive msg="tsdb started"
level=info ts=2019-10-15T12:06:37.9562544Z caller=prober.go:114 component=receive msg="changing probe status" status=ready
level=info ts=2019-10-15T12:06:37.95626502Z caller=receive.go:243 component=receive msg="server is ready to receive web requests."
level=info ts=2019-10-15T12:06:37.956699189Z caller=receive.go:347 component=receive msg="listening for StoreAPI gRPC" address=0.0.0.0:19219

GuyCheung on 16 Oct 2019

@GuyCheung, this huge growth in WAL looks to me to be the same issue that #1654 fixes. Every time that the receive component triggers a flush, the WAL is replayed and written to a block but the WAL is never deleted. Only new samples are actually written to the block but ALL samples are processed. This means that the the receive component has to do way more work than necessary. If the receive component has correctly shut down and cleaned the WAL, there should be 0 replay cost when starting back up.

squat on 16 Oct 2019

@GuyCheung now that #1654 has merged, could you test again using the image built from the commit?

squat on 16 Oct 2019

@squat I have tried the new version which commit id is 48a8fb6e2f6a476bcffa508d6609a19847c695ef

but I got OOM on the 128GB memory hosts... and the data folder size just about 6GB, could you help to look into this?

I shared the pprof/heap file to you: https://drive.google.com/file/d/1iKqfMD9brOXbt7mLqJCX689AhRPuzJ_N/view?usp=sharing

GuyCheung on 17 Oct 2019

👀1

@squat I also tried to test the receive component with this patch - same story, but with small chunk - ~80 MB generated wal logs can't be replayed (single segment) - got OOM with 4GB memory limit on the container.

Victorion on 21 Nov 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 11 Jan 2020

Does this continue to occur for you?

brancz on 13 Jan 2020

We have a similar problem, running Thanos in a Kubernetes cluster. When a pod gets restarted the receiver is OOM killed. The StatefulSet looks similar to the following example: https://github.com/thanos-io/kube-thanos/blob/master/examples/all/manifests/thanos-receive-statefulSet.yaml

The problem occurs with Thanos v0.10.0 and all earlier versions. We also tested it with up to 16GB of memory (when the pods are running they using around 1GB of memory). These are the logs until the pod gets OOM killed:

level=info ts=2020-01-17T07:40:38.874473819Z caller=main.go:149 msg="Tracing will be disabled"
level=warn ts=2020-01-17T07:40:38.874629267Z caller=receive.go:170 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2020-01-17T07:40:38.874722296Z caller=options.go:20 component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-01-17T07:40:38.902765617Z caller=options.go:80 component=receive msg="TLS client using system certificate pool"
level=info ts=2020-01-17T07:40:38.904426534Z caller=factory.go:43 component=receive msg="loading bucket configuration"
level=info ts=2020-01-17T07:40:39.133165452Z caller=receive.go:476 component=receive msg="starting receiver"
level=info ts=2020-01-17T07:40:39.133391365Z caller=prober.go:127 component=receive msg="changing probe status" status=healthy
level=info ts=2020-01-17T07:40:39.133430517Z caller=http.go:53 component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-01-17T07:40:39.133494173Z caller=options.go:20 component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-01-17T07:40:39.133543844Z caller=receive.go:322 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2020-01-17T07:40:39.13335757Z caller=handler.go:171 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19291
level=info ts=2020-01-17T07:40:39.394096288Z caller=head.go:583 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-01-17T07:40:41.731857492Z caller=head.go:607 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2020-01-17T07:40:44.053388267Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=15 maxSegment=22
level=info ts=2020-01-17T07:40:45.417218717Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=16 maxSegment=22
level=info ts=2020-01-17T07:40:47.796282924Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=17 maxSegment=22

ricoberger on 17 Jan 2020

This entire code path is simply leveraging Prometheus TSDB packages. Do we see similar spikes in memory when Prometheus restarts and replays a large WAL?

squat on 17 Jan 2020

We are seeing similar spikes in memory, when Prometheus restarts and replays a large WAL? The Prometheus spikes are not as extreme as in Thanos.

Thanos, gets OOM killed with 16GB, after that we decreased the limit to 4GB:

Bildschirmfoto 2020-01-17 um 15 00 07

Prometheus:

Bildschirmfoto 2020-01-17 um 15 00 32

ricoberger on 17 Jan 2020

Also see that thanos-receive v0.10.1 spikes in memory usage on start and got in OOM-loop.
Have to reduce --tsdb.block-duration=15m to have smaller WAL size - startup memory usage reduced dramatically.

sepich on 27 Jan 2020

I am also seeing huge spikes in memory after a restart. 9GB becomes 93GB. I have more details in https://github.com/thanos-io/thanos/issues/2107 (which I closed as a duplicate after finding I missed this issue), but sounds pretty similar to what everyone else is seeing.

rdzimmer on 7 Feb 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 8 Mar 2020

This still happens with Thanos v0.11.0

mikkeloscar on 9 Mar 2020

👍1

This can largely be attributed to WAL replay cost of the Prometheus TSDB, which thanos-receive makes use of. This is being improved on Prometheus itself already, once merged there, it'll trickle down to here.

brancz on 17 Mar 2020

@brancz - can you point us to some existing Prometheus PR?

norbertwnuk on 25 Mar 2020

One example: https://github.com/prometheus/prometheus/pull/6679. There are more ideas in the pipeline once that work is completed.

brancz on 26 Mar 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 25 Apr 2020

Still valid for v0.12.1

sepich on 26 Apr 2020

Not necessarily a general solution (as it just pushes the problem out further), but latest version of prometheus TSDB should at least cut this problem in half: https://github.com/prometheus/prometheus/pull/7098

brancz on 27 Apr 2020

stale[bot] on 27 May 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] on 3 Jun 2020

same problem for v0.12.2, restart receive, then oom

daxwang on 5 Jun 2020

👍2

We are issuing a similar problem with our setup, we have Thanos-Receive to get metrics from 8 clusters, we are running the pods on 64GB memory VMs and it gets OOM Killed atleast once a day and we lose the data for atleast 2 hours. Is there a way to circumvent this scenario?