Thanos: store: extremely slow and CPU pegged at 100% (lru.RemoveOldest)

Created on 21 Mar 2019 · 5Comments · Source: thanos-io/thanos

Thanos, Prometheus and Golang version used

thanos 0.3.2

    image: improbable/thanos:v0.3.2

Only reproducible in 0.3.2 and not 0.3.1.

What happened

store does not respond to any queries, as clients like grafana all timeout.
CPU is pegged at 100% usage.
profiling CPU usage using:

curl http:/xxx:10902/debug/pprof/profile -O
go-torch -f "flame.svg" thanos profile 
go tool pprof thanos -svg profile

shows all the CPU time being spent on 'RemoveOldest':
https://github.com/GiedriusS/thanos/blob/9679a193f433353287ea3052320dbc9e46bc3e9e/pkg/store/cache.go#L131

profile003

flame

What you expected to happen

CPU not be pegged

How to reproduce it (as minimally and precisely as possible):

I don't know how to reproduce this but it happens only on our largest prometheus instances with 1.5+ million head time series. Restarting store, and making a few queries lead to CPU being pegged at 100% cpu again.
Edit: This problem does not occur in thanos-store 0.3.1.

Maybe relevant:
https://github.com/improbable-eng/thanos/pull/873

Full logs to relevant components

Anything else we need to know
Linux 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux

Source

ottoyiu

Most helpful comment

great to hear, well it depends on the queries you send, for how long time range, how many series.. there is lot of factors. Not sure if there is one universal formula.

Hmm.. not sure about the warning. It's still valid behavior aligned with the set cache size but I see the motivation.
I think better would suit you watching the cache metrics. there are:

thanos_store_index_cache_items_size_bytes
thanos_store_index_cache_items
thanos_store_index_cache_hits_total
thanos_store_index_cache_items_overflowed_total
thanos_store_index_cache_requests_total
thanos_store_index_cache_items_added_total
thanos_store_index_cache_items_evicted_total

I think those should tell you how big it should be hopefully

FUSAKLA on 23 Mar 2019

🚀2

All 5 comments

Like the changelog said, I can also bump up index-cache-size to something very large to revert back to 0.3.1's behaviour.

warning WARING warning #873 fix fixes actual handling of index-cache-size. Handling of limit for this cache was broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To "revert" the old behaviour (no boundary), use a large enough value.

ottoyiu on 21 Mar 2019

Yes, you should definitely try that. What was the value of index-cache-size now?

The store does not have enough space for the cache so it has to remove the oldest all the time possibly.

FUSAKLA on 23 Mar 2019

👍1

@FUSAKLA thanks for the reply! I have it set to 16GB now, and don't have anymore issues related to this.

Is there a formula in which to compute how big the index-cache-size needs to be, relative to the # of blocks and size of the blocks?

The store does not have enough space for the cache so it has to remove the oldest all the time possibly.

If that's the case, should the 'store' report a warning about the set size being too large for the LRU size?

ottoyiu on 23 Mar 2019

great to hear, well it depends on the queries you send, for how long time range, how many series.. there is lot of factors. Not sure if there is one universal formula.

Hmm.. not sure about the warning. It's still valid behavior aligned with the set cache size but I see the motivation.
I think better would suit you watching the cache metrics. there are:

thanos_store_index_cache_items_size_bytes
thanos_store_index_cache_items
thanos_store_index_cache_hits_total
thanos_store_index_cache_items_overflowed_total
thanos_store_index_cache_requests_total
thanos_store_index_cache_items_added_total
thanos_store_index_cache_items_evicted_total

I think those should tell you how big it should be hopefully

FUSAKLA on 23 Mar 2019

🚀2

@FUSAKLA those metrics are definitely useful. Thank you!

I can definitely see the 250MB plateau that caused the constant cpu churn while it tried to constantly evict the index. Going to close this issue now :)