Prometheus: Implement strategies to limit memory usage.

Created on 21 Jan 2015  Â·  90Comments  Â·  Source: prometheus/prometheus

Currently, Prometheus simply limits the chunks in memory to a fixed number.

However, this number doesn't directly imply the total memory usage as many other things take memory as well.

Prometheus could measure its own memory consumption and (optionally) evict chunks early if it needs too much memory.

It's non-trivial to measure "actual" memory consumption in a platform independent way.

kinenhancement

Most helpful comment

Branch https://github.com/prometheus/prometheus/tree/beorn7/storage currently contains the implementation of an experimental flag -storage.local.target-heap-size and obsoletes -storage.local.max-chunks-to-persist and -storage.local.memory-chunks. It's running with great success on a couple of SoundCloud servers right now, but I have to polish it a bit before submitting it for review.

All 90 comments

I too have had issues with this, I've found Prometheus very quickly eats up a lot of RAM and it can't easily be managed.

@sammcj The problem here is that there is no standard way to get a Go program's actual memory consumption. The heap sizes reported in http://golang.org/pkg/runtime/#MemStats are usually off by a factor of 3 or so from the actual resident memory. This can be due to a number of things: memory fragmentation, Go metadata overhead, or the Go runtime not returning pages to the OS very eagerly. A proper solution needs yet to be found.

One thing you can tune right now is how many sample chunks to keep in RAM. See this flag:

  -storage.local.memory-chunks=1048576: How many chunks to keep in memory. While the size of a chunk is 1kiB, the total memory usage will be significantly higher than this value * 1kiB. Furthermore, for various reasons, more chunks might have to be kept in memory temporarily.

Keep in mind this is only one (albeit major factor) in RAM usage. Other factors are:

  • number of queries
  • number of time series
  • frequency of samples per time series
  • shape / length of label sets
  • etc.

Also the various queues (Prometheus's own ones like sample ingestion queue but also Go and OS internal like queued up network queries or whatever). I have that idea of implementing a kind of memory chaperon that will not only evict evictable chunks but also throttle/reject queries and sample ingestion to keep total memory usage (or the amount of free memory on the machine) within limits. But that's all highly non-trivial stuff...

There are by now many things that may take memory, and there are many knobs to turn to tweak it. I changed then name of the issue into something more generic.

A good news already: The ingestion queue is gone, so there will not be wild ram usage jumps anymore if ingestion piles up scrapes.

I am running with a retention of 4 hours, and the default "storage.local.memory-chunks", on version 0.13.1-fb3b464. While @juliusv said that I should expect that this means that the memory used up may be more than 1GB I am seeing it run out with the docker container setting at 2.5GB. Basically it looks like a memory lea, because on restart all of the memory goes back down and then slowly over time creeps back up. Is there any formula that could give me a good idea what to set the memory limit to, is there any way I can figure out if there is a leak somewhere or if it just because more and more data is coming in.

@a86c6f7964 One thing to start out with: if you configure Prometheus to monitor itself (I'd always recommend it), does the metric prometheus_local_storage_memory_chunks go up at the same rate as the memory usage you're seeing? Or does it plateau at the configured maximum while the memory usage continues to go up? Checking prometheus_local_storage_memory_series would also be interesting to see how many series are current (not archived) in memory. If those are plateauing, and the memory usage is still going up, we'll have to dig deeper.

ya it was going up. it got almost to 1million, so maybe it just needs a little more memory

@a86c6f7964 Retention is fundamentally a bad way to get memory usage under control. It will only affect memory usage if _all_ your chunks fit into memory. Retention is meant to limit disk usage.

Please refer to http://prometheus.io/docs/operating/storage/#memory-usage for a starter. Applying the rule of thumb given there, you should set -storage.local.memory-chunks to 800,000 at most if you have only 2.5GiB available. The default is 1M, which will almost definitely make your Prometheus use more than 2.5GiB in steady state.

I recommend to start with -storage.local.memory-chunks=500000 and a retention tailored to your disk size (possibly many days or weeks).

Problem here is that "what's my memory usage?" or "how much memory is free on the system?" are highly non trivial questions. See http://www.redhat.com/advice/tips/meminfo.html/ as a starter...

I'm currently running prometheus (0.15.1) on a bare metal server with 64GB memory, default settings (except retention, one week) and around 750 compute servers to be scraped every 30s. The server is dedicated to prometheus.

We have observed that memory consumption is going up until the machine is not responding anymore. It takes around two days to reach this point and killing prometheus process does not free all memory immediately. As suggested by @juliusv , I monitored prometheus_local_storage_memory_chunks. It started with 1.242443e+06 and end up in a plateau around 1.86631e+06, please see bellow. My question is, what should I look to get more information about this growing and from where is coming from?

Mem used 34549836 KiB
prometheus_local_storage_memory_chunks 1.964022e+06
prometheus_local_storage_memory_series 1.822449e+06

Mem used 38098228 KiB
prometheus_local_storage_memory_chunks 2.013611e+06
prometheus_local_storage_memory_series 1.648374e+06

Mem used 41139708 KiB
prometheus_local_storage_memory_chunks 2.062455e+06
prometheus_local_storage_memory_series 1.472947e+06

Mem used 53843712 KiB
top: 1431 prometh+  20   0 21.967g 0.015t   7968 S  99.4 23.9   2189:20 prometheus  
prometheus_local_storage_memory_chunks 1.899084e+06
prometheus_local_storage_memory_series 1.653677e+06

Mem used 56187240 KiB
top: 1431 prometh+  20   0 22.384g 0.015t   7968 S  86.8 25.0   2441:44 prometheus
prometheus_local_storage_memory_chunks 1.969578e+06
prometheus_local_storage_memory_series 1.518563e+06

Mem used 63289448 KiB
top: 1431 prometh+  20   0 23.886g 0.017t   7972 S  88.5 27.4   3461:40 prometheus   
prometheus_local_storage_memory_chunks 1.86631e+06
prometheus_local_storage_memory_series 1.518586e+06

@pousa Yeah, sounds like that kind of server should normally not use that much RAM.

Some things to dig into:

  • How much query traffic does the machine get? sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m]))
  • What's the sample ingestion rate? rate(prometheus_local_storage_ingested_samples_total[5m]). with 1.5 million series and a 30s scrape rate, I'd expect roughly 50k samples per second.
  • What's the type of monitored jobs? Are they node exporters or something else?
  • Doing a heap profile via go tool pprof http://prometheus-host:9090/debug/pprof/heap could be interesting to see what section of memory is growing over time (web in the resulting pprof shell will open an SVG graph in the browser).

I would say that the machine does get that much traffic. I had to reboot it in the afternoon (again memory problems) and right now it has:

sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) = 0.029629190678656617
rate(prometheus_local_storage_ingested_samples_total[5m]) = 12478.366666666667

They are node exporter and a collectd plugin developed by us, that basically aggregate data from /proc/PID/stat for each job. On our data center, a job is basically a parallel application from HPC domain composed of processes|threads.

Great! Thanks for the pprof tip, I will do it and post it here later.

@pousa That sounds like a low number of queries and very reasonable ingestion rate for that kind of server. Though that makes me wonder, if you have 1.5 million series active in memory, I would have expected ~50k samples per second ingestion rate (at 30s scrape interval) instead of 12k/s, unless your series are frequently changing, so that only a small subset of active memory series gets an update on every scrape. If you are monitoring jobs via /proc/PID/stat, and these jobs are labeled by their pid, I wonder if that is what's leading to frequent churn in series (by PIDs changing all the time?). Still not sure how exactly that would lead to your memory woes though.

Your memory usage is shooting up, while memory chunks and series are staying pretty constant. Weird!

@juliusv the ingestion rate keeps around this and the number of monitored jobs is around 7k. I do not monitor PID explicitly, but JOBIDs (set of PIDs). However, they also change quiet a lot. Jobs have a maximum duration of 4h, 24h or 120h. And it is common to have jobs that run for few minutes.

Yep, that is why I posted here. I could not understand this either. I still have to run pprof, will do this today.

@juliusv run pprof and saw only 3GB being used. Need to investigate more..

@pousa The Go-reported heap size is always smaller than what the OS sees in terms of resident memory usage (due to internal metadata overhead and memory fragmentation), but that's usually factor 2 or so. I don't see how it would report 3GB, but then fill up 64GB in reality. Odd!

@juliusv Thanks for the information. Ideed odd. I replicated the service on a different server today and I'm monitoring both servers/prometheus. I want to see if this could some how be related to the server and not prometheus itself, since top reports only half of memory being used by prometheus.

I have similar issue as @pousa , the EC2 instance repeatly run out of memory after about 2 - 4 days. I do have much less memory than @pousa . But I am wondering what is the minimum/recommended memory capacity is required for running prometheus for long-term. Is it possible for prometheus to control its memory usage automatically instead of exhausting all the memory on the instance until it dies?

@killercentury I still have the problem. I tried to build it with the newer version of GO but, no luck. I also looked into GO runtime environment variables, but could not find anything.

Hi everybody, please read http://prometheus.io/docs/operating/storage/ . @killercentury if you have very little memory (less than ~4GiB), you want to _reduce_ -storage.local.memory-chunks from its default value. @pousa with 1.5M time series, you want to _increase_ -storage.local.memory-chunks to something like 5M to make Prometheus work properly. -storage.local.max-chunks-to-persist should be increased then, too. Each time series should be able to keep at least 3 chunks in memory, ideally more... Also, if -storage.local.max-chunks-to-persist is not high enough, Prometheus will attempt a lot of little disk ops, which will slow everything else down and might increase RAM usage a lot because queues fill up and everything. That's especially true with 7k targets. If everything slows down, this might easily result in a spiral of death... Once you have tweaked the two flags mentioned above (perhaps to even higher settings), I would next increase the scraping interval to something very high (like 3min or so) to check if things improve. Then you can incrementally reduce the interval until you see problems arising. (In different news: 7k is a very high number of targets. Sharding of some kind might be required. But that's a different story.)

And yes, ideally all of these value would auto-tweak themselves. However, that's highly non-trivial and not a priority to implement right now.

@beorn7 I will try tweaking those flags and if needed the scrape interval. Thanks!

@beorn7 Changing those flags and increasing a bit the scraping interval allow our prometheus instances to run without running out of memory. Thanks! However, I still have long timings to get back results from expressions... sometimes even get timeouts. Concerning sharding, we were already doing it.

Expensive queries can be pre-computed with recording rules: http://prometheus.io/docs/querying/rules/

To try out very expensive queries, you can increase the query timeout via the -query.timeout flag.

(Obviously, that's all now off-topic and has nothing to do with memory usage anymore. ;)

@beorn7 Thanks, I will try the flag. We already have rules in place, and I was talking about very simple queries (e.g single metrics). But, as you said, this is a different topic ;)

If a single time series takes a long time to query, then we are kind of on-topic again. Because the time is most likely needed to load chunks from disk. Tweaking the flags discussed here, you can maximize the number of chunks Prometheus keeps in memory, and thereby avoid loading in chunks from disk.
But in different news, loading a single series from disk should be very fast (because all the data is in a single file, one seek only). So I guess your server is very busy and overloaded anyway, so that everything is slow.

Are we really talking about single _series_ and not single metric names,
but with multiple series?

On Wed, Sep 16, 2015 at 2:43 PM, Björn Rabenstein [email protected]
wrote:

If a single time series takes a long time to query, then we are kind of
on-topic again. Because the time is most likely needed to load chunks from
disk. Tweaking the flags discussed here, you can maximize the number of
chunks Prometheus keeps in memory, and thereby avoid loading in chunks from
disk.
But in different news, loading a single series from disk should be very
fast (because all the data is in a single file, one seek only). So I guess
your server is very busy and overloaded anyway, so that everything is slow.

—
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/455#issuecomment-140730386
.

@juliusv and @beorn7 I'm talking about single metric names, e.g. virtual memory of jobs or processor load on servers. But, this should not take that much time, right? When My timeout is set to 3m and only happens when I see with top that all memory is being used on the server.

@pousa If the server is swapping, all bets are off. If it's just using a lot of memory and not swapping, it really depends on how many series the metric consists of, and what your exact query is (graph or tabular?). A single metric may still consist of tens of thousands of time series or more, in which case a very busy server might become too slow to even do a tabular query for all series matching that metric name (though tens of thousands normally still works fast and fine in the tabular view). Some more details about the query would be interesting...

@juliusv I will have to check if server is swapping when this happens. I don't see any timeout now, it usually starts when all memory is being used... and this takes around 2-3 days after I start prometheus on the server.

Concerning the query, we only use tabular queries. And what we use most here is something very simple like:

collectd_jobs_vm{encl="45"}

which shows virtual memory for all jobs running on chassis 45. And this is enough to get timeout when server has all memory being used.

Ok, assuming there are not a hundred thousand jobs on chassis 45, I'd expect this query to always complete in a reasonable amount of time unless the machine is swapping (or the server is otherwise so incredibly overloaded that nothing really works anymore). So yeah, check the si and so columns of vmstat 1 says next time that happens :)

I'm experiencing almost the same problem as @pousa, on 0.20.0. Much smaller setup though, only about 50 nodes with 30s interval.

prometheus_local_storage_memory_chunks = 1048996
prometheus_local_storage_memory_series = 897363
rate(prometheus_local_storage_ingested_samples_total[5m]) = 6463
sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) = 0.0169

pprof.prometheus.inuse_objects.inuse_space.001.pb.gz
prom-metrics.txt

Looks reasonable, yet my prometheus instance is using upwards of 26Gi of rss. When I do a heap dump with debug/pprof, it only shows 4Gi in usage.

I haven't tried the solution of increasing max chunks, although it seems counter intuitive to me as the docs say it increases memory usage. Although it may fix things, I'd really like to understand what is consuming the ~22Gi of out of heap memory, so I can plan my instance size properly and know how to tune things. Any ideas where to start looking?

Do you by any chance either have only a handful of targets that are really large, or are federating or otherwise querying large amounts of data?

Biggest targets are kubelet which exports cadvisor metrics: a simple wc -l shows 2258 lines (includes comments). No federation, although we do pull from a pushgateway but that has only about 1k lines.

That's quite odd then. Can you repeatedly take heap snapshots and find what's taking up the space? Go only gives up memory every 5m or so.

I'd like to second this issue. Having quite the same issues even with a small scale environment of one EC2 node, running 15 docker containers, scraped by cAdvisor + host metrics.

rate(prometheus_local_storage_ingested_samples_total[5m]) = 371
prometheus_local_storage_memory_chunks = 524140
prometheus_local_storage_memory_series = 2409

I set the -storage.local.memory-chunks=524288 and Prometheus container used ~220MB RAM after recreation. Over the period of 1 week, RAM usage constantly climbed up to 1.3GB as of now.

To me it seems like Prometheus is eating up memory as long as it can. If I would have to make an educated guess, I'd say this is some sort of memory leak.

Here is a graph of Prometheus own memory usage. The purple graph is the old container with default storage.local.memory-chunks and the yellow graph is the new one with that setting halved to 524288

prom

My next idea would be to limit prometheus docker container max memory. However I'm not sure how Prometheus will react. If its really a memory leak, it could dead-lock :scream:

@philicious 1.3GiB RAM usage is very reasonably with 500k chunks in memory. The memory usage will climb up until Prometheus has maxed out the configured number of memory chunks. Then it should stabilize, with spikes while running queries, scrapes, whatever else will require RAM temporarily. If you limit RAM usage hard, Prometheus will simply crash, i.e. Prometheus does not detect available memory in any way. The only lever you have is the storage.local.memory-chunks flag, which is not rocket science, but RAM in kiB divided by 5 as the flag value should work out fine in most cases.

@jsravn The tens of GiB used is really weird. Something irregular is going on here. Run-away scrape, as Brian suspected, as an example. Or some of the more exotic features going wild. Are yue using remote storage or a special kind of service discovery? It's unlikely this has to do with storing ingested samples in RAM.

@beorn7 oh ok. I read https://prometheus.io/docs/operating/storage/ but thats vague about relation between RAM usage and that flag.
So if I got you right: one can approx expect(storage.local.memory-chunks \ 1024)*5 ~= max RAM used in MiB ?

Strictly speaking, each chunk will only take 1k of RAM. But then there is some overhead of managing the chunks, and then there is a whole lot of other things the server is doing, most notably serving queries. Each time series in memory has a footprint, too, which becomes relevant if you have a lot of time series with relatively few chunks in memory. The x5 multiplier just turned out as a threshold that is rarely crossed. In most cases, it will be clearly below that. In extreme cases, it will be above that.

Having something smarter than that, is the whole point of this issue. However, in practice, tweaking for your available memory is rarely a problem (speaking as somebody who is in charge of 50+ production Prometheus servers with very different loads).

@beorn7 ye ok. understood. Usually I either dont care and just give it more RAM or even tweak that flag upwards. Unfortunately, for the first time, I have to squeeze out some memory. Now, thx to you, I have a better understanding of that flag. Maybe you could add this x5 multiplier rule-of-thumb to the aforementioned docs page

The doc states "As a rule of thumb, you should have at least three times more RAM available than needed by the memory chunks alone."

Adding something like "4x is a safer bet, and 5x is almost certainly but not always safe" sounds a bit like Monty Python... ;)

@beorn7

The tens of GiB used is really weird. Something irregular is going on here. Run-away scrape, as Brian suspected, as an example. Or some of the more exotic features going wild. Are yue using remote storage or a special kind of service discovery? It's unlikely this has to do with storing ingested samples in RAM.

We're using the aws node discovery. I could try turning that off and see what happens. I'm away for a couple days but I'll give it a shot when I get back. We're writing to a provisioned iops EBS in AWS.

Here's a couple more screenshots. Here's the container memory usage:
screenshot from 2016-07-24 09-28-38

Here's disk i/o. It seems high to me (70MB/s sustained writes), but not sure. We provisioned 2000 iops on the EBS, so there's headroom left, and queue sizes are low, so I don't think it's a bottleneck.
screenshot from 2016-07-24 09-36-11

I'll try a few things:

  • increase max memory chunks
  • disable aws node discovery, and do it manually instead
  • get heap dumps as requested every 5 minutes

Thanks for the help so far.

get heap dumps as requested every 5 minutes

I'd like you to take heap dumps as often as you can get. When you get one that shows the 26GiB of usage in Go, please send it on.

Had a crash, and noticed this in the logs:

time="2016-07-26T15:20:28Z" level=info msg="File scan complete. 10092420 series found." source="crashrecovery.go:81"

It's taking a long time to do crash recovery as well, > 15minutes. For 50 nodes, 10 million time series sounds quite wrong.

edit: Actually, it may make sense for us, since cadvisor (kubelet) generates new metrics for new containers, and we're constantly deploying new versions.

Ok I took continuous heap dumps, enabled gctrace=1 and restarted prometheus.

Here's first 5 minutes rss usage:
screenshot from 2016-07-26 12-18-13

I did notice cadvisor seems to be reporting a larger container memory usage than the system though, but it's still high (21.8Gi):

# ps aux | grep prom
root     14613  316 70.1 22824660 22806488 ?   Ssl  19:09  37:31 /bin/prometheus -config.file=/etc/prometheus/prometheus.yml -alertmanager.url=http://alertmanager -log.level=debug -web.external-url=http://prometheus-default-tools.k8s.api.bskyb.com -storage.local.retention=360h0m0s

Largest collection shows 14Gi, and target heap hovers around 8-10:

c 41 @148.186s 6%: 74+1032+0.93 ms clock, 599+2167/2065/2951+7.4 ms cpu, 9404->9426->5986 MB, 9431 MB goal, 8 P
scvg0: inuse: 6363, idle: 7996, sys: 14359, released: 0, consumed: 14359 (MB)
gc 42 @152.086s 6%: 7.0+1165+0.65 ms clock, 42+5814/2330/183+3.9 ms cpu, 12369->12370->8385 MB, 12370 MB goal, 8 P
gc 43 @166.613s 6%: 25+2241+1.0 ms clock, 103+2721/4482/303+4.3 ms cpu, 14867->16218->6073 MB, 16771 MB goal, 8 P
gc 44 @170.886s 6%: 1.1+1023+0.72 ms clock, 9.5+307/1941/4909+5.7 ms cpu, 8375->8772->6420 MB, 9448 MB goal, 8 P
time="2016-07-26T19:12:20Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539" 
gc 45 @181.074s 6%: 2.4+1137+1.0 ms clock, 12+4520/2258/173+5.0 ms cpu, 11755->12011->4152 MB, 12049 MB goal, 8 P
gc 46 @184.955s 7%: 4.7+1187+0.72 ms clock, 37+2644/2238/2261+5.7 ms cpu, 7601->7823->4221 MB, 7796 MB goal, 8 P

2016-07-26-prom-logs-startup.txt

Largest heap dump was 7380MiB:

Fetching profile from http://prom/debug/pprof/heap
Saved profile in /home/me/pprof/pprof.prom.inuse_objects.inuse_space.353.pb.gz
         0     0% 99.39%  7380.74MB 99.67%  runtime.goexit

pprof.prom.inuse_space.353.pb.gz

I also did continuous heap dumps on an earlier, longer running prometheus and got a 9081MiB heap as the largest (with it usually being about 5-6):
pprof.prom.inuse_space.088.pb.gz

Ok the reason cadvisor shows higher memory usage is it includes file caches for the process in addition to rss. So about 6Gi of file cache:

# cat /sys/fs/cgroup/memory/docker/f9574512ad67ab6303907c33b522794cbe151ce6b77ad873854e9c6a433c8a50/memory.stat
cache 6561488896
rss 24723570688
rss_huge 16739467264
mapped_file 7610368
swap 0
pgpgin 57215766
pgpgout 53660702
pgfault 2006917
pgmajfault 50
inactive_anon 0
active_anon 24723570688
inactive_file 5335183360
active_file 1226301440

@jsravn With almost a million time series in your checkpoint and 10M file series altogether, you should definitely configure more than the default 1M memory chunks, more like 3M. Otherwise, you will have a lot of eviction and reload. Perhaps that causes the GC to not keep up. The GC amount looks pretty big to me. Please read https://prometheus.io/docs/operating/storage/ and apply the tweaks recommended there.

@beorn7 I tried doubling it to 2M memory chunks, but it didn't seem to make a difference. Looking at the metrics, the process hits >20Gi rss long before it even reaches 1M memory chunks, it's usually about 700K-800K when it hits 20G+ then it slowly increases up to the chunk limit.

At a bit of a loss now, I spent a lot of time trying all sorts of combinations. Disabling all scrape config + alerts brings memory usage down to 10Gi. Turning on either alerts or scrape config seems to put it back. I've also raised scrape time to 60s and retention time to 180hours, but that hasn't seemed to make much difference at all.

Ok finally got somewhere. I was able to reduce memory usage by about 30-40% by disabling huge pages, e.g. echo never > /sys/kernel/mm/transparent_hugepage/enabled. prometheus rss has dropped to a steady 11-13Gi vs 18-22Gi+. This sounds like the hugepage problems golang 1.5 had (https://github.com/golang/go/issues/8832). I'm not sure why it happens in my case though - maybe due to the large number of new time series being created because of containers spinning up.

This seems to have finally got prometheus's memory usage under control for me:

echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag'

Which seems to let the go runtime's MADV_HUGEPAGE/MADV_NOHUGEPAGE operate correctly. With these set to always (default on RHEL7), it doesn't seem to play nicely with golang's gc and I get that large rss usage after a while.

I'll keep prometheus running for a few days with these settings and see how it goes. So far, this has dropped my mem usage to 7-10Gi, and cpu usage has dropped down to 1-2 core usage from 3-4.

Final thing to mention is that it seems prometheus alloc patterns (at least in my use case) lead to a lot of fragmentation in the go 1.6 heap. This is probably why the hugepages wastes so much memory as well. I'm not sure what can be done about that. Even though my heap has dropped to ~9Gi, over half of that is idle and the go scavenger is unable to reclaim it. Here's an example of gctrace=1:

scvg33: 5096 MB released
scvg33: inuse: 2144, idle: 7333, sys: 9477, released: 5096, consumed: 4381 (MB)
scvg34: inuse: 3472, idle: 6005, sys: 9477, released: 0, consumed: 9477 (MB)
scvg35: inuse: 2762, idle: 6942, sys: 9704, released: 0, consumed: 9704 (MB)
scvg36: inuse: 3207, idle: 6591, sys: 9799, released: 0, consumed: 9799 (MB)
scvg37: inuse: 2822, idle: 6976, sys: 9799, released: 0, consumed: 9799 (MB)
scvg38: 1770 MB released
scvg38: inuse: 2496, idle: 7302, sys: 9799, released: 1770, consumed: 8028 (MB)
scvg39: inuse: 2175, idle: 7623, sys: 9799, released: 0, consumed: 9799 (MB)
scvg40: inuse: 2370, idle: 7429, sys: 9799, released: 0, consumed: 9799 (MB)
scvg41: 5348 MB released
scvg41: inuse: 2980, idle: 6818, sys: 9799, released: 5348, consumed: 4450 (MB)
scvg42: 372 MB released
scvg42: inuse: 2828, idle: 6970, sys: 9799, released: 5720, consumed: 4078 (MB)
scvg43: 2 MB released
scvg43: inuse: 2195, idle: 7603, sys: 9799, released: 25, consumed: 9773 (MB)
scvg44: inuse: 2763, idle: 7035, sys: 9799, released: 5, consumed: 9793 (MB)
scvg45: 5623 MB released
scvg45: inuse: 2208, idle: 7590, sys: 9799, released: 5626, consumed: 4172 (MB)
scvg46: 121 MB released
scvg46: inuse: 3420, idle: 6378, sys: 9799, released: 5107, consumed: 4691 (MB)
scvg47: inuse: 2784, idle: 7014, sys: 9799, released: 0, consumed: 9799 (MB)
scvg48: inuse: 2462, idle: 7337, sys: 9799, released: 0, consumed: 9799 (MB)
scvg49: 3663 MB released
scvg49: inuse: 3347, idle: 6451, sys: 9799, released: 3663, consumed: 6135 (MB)
scvg50: inuse: 2746, idle: 7052, sys: 9799, released: 3663, consumed: 6135 (MB)
scvg51: 59 MB released
scvg51: inuse: 3226, idle: 6572, sys: 9799, released: 59, consumed: 9739 (MB)
scvg52: inuse: 3331, idle: 7589, sys: 10920, released: 0, consumed: 10920 (MB)
scvg53: inuse: 3278, idle: 7641, sys: 10920, released: 0, consumed: 10920 (MB)
scvg54: inuse: 2404, idle: 8516, sys: 10920, released: 0, consumed: 10920 (MB)
scvg55: 83 MB released
scvg55: inuse: 2297, idle: 8622, sys: 10920, released: 83, consumed: 10837 (MB)
scvg56: inuse: 3014, idle: 7905, sys: 10920, released: 0, consumed: 10920 (MB)
scvg57: 2891 MB released
scvg57: inuse: 2770, idle: 8150, sys: 10920, released: 2891, consumed: 8028 (MB)
scvg58: inuse: 3542, idle: 7378, sys: 10920, released: 0, consumed: 10920 (MB)
scvg59: inuse: 3169, idle: 7750, sys: 10920, released: 0, consumed: 10920 (MB)
scvg60: 93 MB released
scvg60: inuse: 2483, idle: 8436, sys: 10920, released: 93, consumed: 10827 (MB)
scvg61: inuse: 2856, idle: 8064, sys: 10920, released: 76, consumed: 10844 (MB)
scvg62: inuse: 3581, idle: 7339, sys: 10920, released: 0, consumed: 10920 (MB)
scvg63: inuse: 3132, idle: 7788, sys: 10920, released: 0, consumed: 10920 (MB)
scvg64: inuse: 2802, idle: 8117, sys: 10920, released: 0, consumed: 10920 (MB)
scvg65: 6606 MB released
scvg65: inuse: 3306, idle: 7614, sys: 10920, released: 6606, consumed: 4313 (MB)
scvg66: 46 MB released
scvg66: inuse: 2889, idle: 8031, sys: 10920, released: 6653, consumed: 4267 (MB)
scvg67: 17 MB released
scvg67: inuse: 2882, idle: 8037, sys: 10920, released: 36, consumed: 10883 (MB)
scvg68: inuse: 3250, idle: 7670, sys: 10920, released: 0, consumed: 10920 (MB)
scvg69: 5924 MB released
scvg69: inuse: 2847, idle: 8072, sys: 10920, released: 5924, consumed: 4995 (MB)

I don't have experience with Go and hugepages.

I haven't seen a problem with excessive GC or heap fragmentation with Prometheus in quite some time. So this is indeed something special with your setup, possibly related to hugepages, or the AWS node discovery.

Prometheus definitely does a lot of small constant size allocations. At some point, we managed those in a free list, but found that it has diminishing returns. The GC seems to be quite happy with constant size allocation (rationale is that it doesn't cause any fragmentation). Perhaps that changes with hugepages.

In any case, this issue is off-topic here. Feel free to file an issue about the hugepages findings. It might be worth trying out a sync.Pool for chunks in that case.

Thanks for all your research work.

@beorn7 its either me or prometheus but its consuming 2.25 GB RAM although I have -storage.local.memory-chunks=393216 which should max out between 1.1 - 1.9GB (3-5 multiplier). Should I further reduce memory-chunks ? I need to bring the RAM usage down :/

@philicious See https://github.com/prometheus/prometheus/issues/1836#issuecomment-236131816

The fewer memory chunks you have, the more impact has the baseline memory footprint and all the other things that need memory.

I understand that these problems are quite hard to solve. But I think having sane defaults should be quite easy to implement. If you have a recommedation of using 1/5 of the available system memory. You should also set this as a default value instead of a fixed value. This would help new users and ensure that their servers do not crash after a few days....

@runningman84 That would indeed be nice, however:

If you have a recommedation of using 1/5 of the available system memory. You should also set this as a default value instead of a fixed value.

Implementing this depends on knowing two things:

  • How much memory is available.
  • How much memory Prometheus would use, given certain settings and external parameters (query rates etc.).

Both are non-trivial to determine, as described further up in this issue.

I played a bit with heuristics to predict the memory usage of a Prometheus server based on the memory time series and the memory chunks. My data set were the many Prometheus servers at SoundCloud, with data over many weeks.

Result: It's really hard. If you take 10k per memory series and 3k pre chunk, and add 1GiB "for the pot", you have a pretty solid prediction for most cases, i.e. the RAM usage estimated by that formula is way too high in most cases, but at least it's rare that you need more RAM. _However,_ it's still easy to cause a higher memory usage, by many expensive and long-running rule evaluation or heavy query load. If you are short with the provided RAM, you could easily trigger an OOM.

Given that the actual RAM usage depends on so many things, it would be really great to let Prometheus manage things itself, with only one parameter given: how much RAM Prometheus is allowed to use.

I have a much better idea how to accomplish that than when this issue was filed, but it will still be pretty complex.

Just wanted to thank everyone in this thread for the great info. I am using prometheus to monitor about 200 containers via cadvisor and I have been struggling with configuring Prometheus to get its memory usage under control. I may have to restart the stack (docker compose makes this easy) at periodic intervals to avoid the increasing memory usage if I cannot find a configuration that is stable for my current environment.

I don't think the flag -storage.local.memory-chunks works in my case. Running prometheus v1.1.1 docker container inside kubernetes cluster v1.2.0, -storage.local.memory-chunks = 1048576 and -storage.local.max-chunks-to-persist = 524288, but as prometheus reports after 14h hours running

# HELP prometheus_local_storage_memory_chunkdescs The current number of chunk descriptors in memory.
# TYPE prometheus_local_storage_memory_chunkdescs gauge
prometheus_local_storage_memory_chunkdescs 2.451975e+06
# HELP prometheus_local_storage_memory_chunks The current number of chunks in memory, excluding cloned chunks (i.e. chunks without a descriptor).
# TYPE prometheus_local_storage_memory_chunks gauge
prometheus_local_storage_memory_chunks 1.154092e+06
# HELP prometheus_local_storage_memory_series The current number of series in memory.
# TYPE prometheus_local_storage_memory_series gauge
prometheus_local_storage_memory_series 2.033516e+06

So the memory chunks and series in memory still grow, much more than the assigned value.

The number of memory series is determined by the targets you scrape. Since each memory series needs at least one chunk in memory (preferably more), you cannot have fewer memory chunks than series.
https://prometheus.io/docs/operating/storage/ explains the context a bit.

@beorn7 What I understand is that the memory will grow as the number of series grow and there is no way around for that. Prometheus can take up to half of our total RAM assigned to application servers. So unfortunate that we cannot use prometheus, it is just what we need for our in-depth monitoring.

The more you ask Prometheus to do, the more RAM it'll use. This is going to be the same with any realtime monitoring solution.

@ntquyen When Prometheus becomes too large to run in a single process, you can scale it by running multiple processes (using sharding) and merging a subset of data together in a single master process (using federation).

@brian-brazil What I want to ask prometheus is that:

  • OK, I have 3 million points, but please take only 500k into memory and not more. or:
  • OK, I assigned to you 10GB RAM, please use half of it to store active points and not more. And:
  • With those I gave you, you should handle queries with 1h range. If not, I'm still ok if the queries run slowly, but please don't crash.

Though it might sound hard to support (due to some unsolved limitation in golang?), but reasonable to ask, right?

The Prometheus storage engine isn't designed to work in that way. If something is in active use it must be kept in memory, and everything flows from that.

In case this helps others: I have resolved the memory usage issue for my use/deployment with a combination of tuning the prometheus configs, and by restarting the Prometheus container once every 24 hours. We don't lose data this way, memory usage is constrained to the desired levels. Been running stable for weeks like this.

@randomInteger Restarting Prometheus every 24h really shouldn't be needed. The only "creeping" growth in memory usage of Prometheus would come from a too high setting of -storage.local.memory-chunks (and the related -storage.local.max-chunks-to-persist). Memory usage will flatten out at some point (once the configured limit of chunks is kept in memory), and you can control via that flag (roughly) at which level it does.

Yeah I am still working on the tuning. I am in a tricky deployment here where I have to manage hundreds of containers on each host, and resources are a bit constrained. I was not able to get the memory usage down to where I wanted it via those two configuration variables alone, so I am using this rather silly bandaid until I can circle back around and put the proper cycles into figuring out how to get this right without needing to restart the container.

@juliusv If you read my comments above and @brian-brazil's answers, you wouldn't be sure that the high memory would only come from storage.local.memory-chunks setting. It actually comes from the number of active series. Put it another way to make it clear: the number of active series decides the memory usage, not the storage.local.memory-chunks.

I think we have do some deeper analysis here to identify bottlenecks in the case of many active time series. The more dynamic the environment, the higher the time series churn, the more we encounter this issue.
Maybe we properly did – back-of-the-envelope or by runtime investigation? @beorn7 @juliusv?

We all know I'm not a friend of storage tuning flags dictating memory usage (with limited accuracy), but at least it's static to a degree.
But memory usage highly fluctuating depending on number of time series in various states is inherently a runtime property. Unlike the chunk flags this is not just an undesirable step from an operational perspective but quite literally unmanageable. At least if we agree that restart cycles, between which we have a secondary process inspecting and reconfiguring from the outside, are not scalable with checkpoint/restore taking up to 15 minutes for sufficiently large servers.

Our in-memory held data should be the largest part of memory usage – even considering a moderate querying load.
There's certainly a per series management overhead and buffers used for ingestion. However, even knowing the code quite well, I believe we must be missing something if our actual sample data only accounts for about 25% of memory usage. With that number shrinking as the number of series grows.

If there are any ideas where this memory goes to, I'm sure we can optimize in the right places.

I did some research as reported in https://github.com/prometheus/prometheus/issues/455#issuecomment-248358605 above. I essentially tried to model the memory usage with two linear variables: chunks in memory and timeseries in memory. Without any luck at all. There is definitely more creating variable memory usage. Candidates include: number of targets, queries, service discovery, …

It would obviously be great to understand the memory consumption pattern in more detail. But we can pretty safely assume that they are complicated enough so that you cannot create a simple rule of how to set the flags (or how to auto-set them). My plans are therefore more into the direction to make chunk eviction depend on memory pressure so that you ultimately tell Prometheus how much RAM it may take, and then the server tries to dynamically balance memory chunks (and even persistence efforts) accordingly. Since these are the only levers we have, we don't really have to know _where_ the memory is used (outside of finding memory leaks or optimizing code) as long as we know _how much_ is used.

Sorry, I missed that comment.

My general point: If there's a place where we can fix algorithmic complexity or a significant constant factor of our memory usage, it is well worth knowing about it.

For example, a chunk is 1KB with in-mem chunks fully pre-allocated. Then there's obviously management around it – but why does that add up to 3KB?
Why does management of a single series need 10KB of memory, which is 10x as much as max data of its active chunk and in the K8S case, more than most series ever accumulate in total?

That's not to say that there aren't very good reasons for that. But it's a complex system by now and chances are we missed an opportunity for baseline improvement so far.

For example, a chunk is 1KB with in-mem chunks fully pre-allocated. Then there's obviously management around it – but why does that add up to 3KB?
Why does management of a single series need 10KB of memory, which is 10x as much as max data of its active chunk and in the K8S case, more than most series ever accumulate in total?

My research resulted in the conclusion that there is no linear relationship like the above. The memory certainly goes somewhere, but it doesn't make sense to say that every memory chunk we add adds 3kiB of RAM usage or each series adds 10kiB of RAM usage.

More results from ongoing investigation:
I run two servers with exactly the same config and load, with the only exception that the one is the active one serving dashboard queries, while the other one is a hot spare. Both have the same number of memory chunks and time series (i.e. the dashboard queries are not super-long-term, so they don't keep otherwise archived series in RAM or similar). The active server has around 55GiB RAM usage, the hot spare 35GiB. Query load can thus create a significant memory footprint, even if it is not resulting in more memory chunks or memory series.

We are also seeing extreme memory usage by Prometheus at Discourse, 55GB of RAM. This actually caused our https termination machines to have small blip-like interruptions in service due to extreme memory pressure. :(

What a kick in the pants, seeing Jeff Atwood's reply pop up in my inbox because of my involvement in this thread. I am a huge coding-horror fan.

Back on topic: I have been playing with trying to tune Prometheus but nothing I do seems to stop the ever creeping memory consumption. I am still restarting prometheus' container once every 24 hours and that is keeping memory use in check while still continuing to harvest/store/serve the data I'm collecting. Its not a great solution, but its an ok workaround in this instance.

If there is anything I can do from my end to provide you with more data on this issue, please let me know.

I've implemented a number of improvements to Prometheus ingestion that'll be in the 1.5.0, and in the best case will cut memory usage by 33%. That slow growth is likely chunkDescs which depending on your setup could take weeks to stabilise, with my changes things should now stabilise about 12 hours after your chunks fill.

https://www.robustperception.io/how-much-ram-does-my-prometheus-need-for-ingestion/ has more information.

We finally managed to run Prometheus stable in our production GKE cluster with following settings:

10 core cpu request
14 core cpu limit
64gb memory request
80gb memory limit
500gb pd-ssd

-storage.local.retention=168h0m0s
-storage.local.memory-chunks=13107200 (approx 64gb * 1024 * 1024 / 5)

Earlier tests with local memory chunks calculated by dividing by 3 or 4 constantly ran us in to out of memory kills, since using 5 this no longer seems to happen. Memory usage slowly grows after start but flattens off somewhere around 60gb.

We do experience an occasional hang, for that we just added-storage.local.max-chunks-to-persist=8738132 but still have to see whether this has a positive effect on stability.

In the GKE cluster itself we have about 40 4-core nodes running 750 pods, getting scraped each 15 seconds. Each node runs the node-exporter, so there's a wealth of information coming from Kubernetes itself and those node exporters.

Ingestion isn't a problem at all for Prometheus, the querying side is more tricky. We can see clearly in cpu pressure - and to a lesser extent memory usage - whether people have their Grafana dashboards opened during the day. It drops to a really low level when the office goes empty.

The quick math indicates that you needed ~3.9KB per chunk with 1.4 just to handle ingestion, so 5 with queries isn't surprising.

With our current settings it seems more like 4.6kb per chunk (ending up at 60GB used memory). But part of that memory usage could be for querying or disk cache, or any other resident memory usage by the application I guess?

Is it possible to see the separate types of memory usage in Prometheus' own metrics?

Branch https://github.com/prometheus/prometheus/tree/beorn7/storage currently contains the implementation of an experimental flag -storage.local.target-heap-size and obsoletes -storage.local.max-chunks-to-persist and -storage.local.memory-chunks. It's running with great success on a couple of SoundCloud servers right now, but I have to polish it a bit before submitting it for review.

That looks promising. One thing I noted during benchmarking was a ~10% overhead for memory usage above what purely the heap used, so reducing the number the user provides accordingly might be an idea.

We did see solid improvements (reduction in memory usage) after we deployed 1.5 on our infra thanks @brian-brazil -- keep the improvements coming! Let us know how we can help.

@beorn7 https://github.com/golang/go/issues/16843 and linked discussions may be of interest.

Thanks for the pointer. The linked proposal document aligns very well with my research (and it even mentions Prometheus explicitly as a use case :).

I noticed that :)

The design also looks fairly similar to what your proposal is. Having two very similar control systems running on top of each other may cause undesirable interactions.

It's merged! Will be released in 1.6.0.

--storage.local.target-heap-size seems to have been removed in 2.0, is there any strategy to limit memory usage nowadays?

See https://groups.google.com/forum/#!topic/prometheus-users/jU0Ghd_SyrQ

It makes more sense to ask questions like this on the prometheus-users mailing list rather than in a GitHub issue (in particular if it is a _closed_ GitHub issue). On the mailing list, more people are available to potentially respond to your question, and the whole community can benefit from the answers provided.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dannyk81 picture dannyk81  Â·  45Comments

VR6Pete picture VR6Pete  Â·  44Comments

F21 picture F21  Â·  52Comments

juliusv picture juliusv  Â·  170Comments

bandesz picture bandesz  Â·  50Comments