Cadvisor: Document and Test Cadvisor Overhead

Created on 11 Oct 2016 · 8Comments · Source: google/cadvisor

I was wondering if it is possible to evaluate the overhead induced by cadvisor + storage drivers compared to a non-container solution in terms of memory for example?
I mean is there any tests that handle that?

areperformance aretesting help wanted

Source

mboussaa

👍2

Most helpful comment

Just providing a little data point: we're using cAdvisor in standalone mode across about 1000 instances here. On average, it's using 0.2% of CPU and 20MB of RAM. There are a few outliers of course, but we've never really had problems with cAdvisor performance.

As @timstclair mentioned, tuning down the collection interval is helpful here. In our case, we use the following settings:

    "--housekeeping_interval=30s" \
    "--global_housekeeping_interval=2m" \
    "--disable_metrics=disk,tcp" \
    "--enable_load_reader" \
    "--load_reader_interval=5s"

In our experience, the tcp and disk metrics can be very expensive (though that largely depends on what your containers are doing), but the rest (CPU, LA, Memory) is very cheap.

krallin on 28 Nov 2016

👍5

All 8 comments

I'm also wondering if there is a way to run cadvisor so it is more light weight (like other prometheus containers).
The fact that it consumes quite a lot of CPU & RAM, while not being probed or looked at, is concerning and maybe my setup is wrong...

RRAlex on 28 Nov 2016

👍1

@RRAlex I think nobody tested the impact of cAdvisor on system overhead. We have to capture resource usage metrics from cAdvisor and compare them to linux monitoring tools

mboussaa on 28 Nov 2016

👍2

We have done manual profiling of cAdvisor, along with some tuning as part of scalaing Kubernetes. However, for Kubernetes we were only interested in the container manager performance, since we don't run the full standalone version. We decided we could meat our performance goals by lowering the resolution of collected stats. Since most of CPU time is spent scraping metrics, if you decrease the scraping interval from 1s to 10s, it will roughly cut the CPU usage by 90%.

timstclair on 28 Nov 2016

As @timstclair mentioned, tuning down the collection interval is helpful here. In our case, we use the following settings:

    "--housekeeping_interval=30s" \
    "--global_housekeeping_interval=2m" \
    "--disable_metrics=disk,tcp" \
    "--enable_load_reader" \
    "--load_reader_interval=5s"

In our experience, the tcp and disk metrics can be very expensive (though that largely depends on what your containers are doing), but the rest (CPU, LA, Memory) is very cheap.

krallin on 28 Nov 2016

👍5

Perfect. From my side, I tested the memory overhead induced by plugging my containers to cadvisor and influxdb. It was very negligible.
I compared the following values:

the memory usage of my job gathered from cadvisor and saved in influxdb: select max(memory_usage) from stats where...
non-containerized solution: time -v (my job) ... and I gathered the "max resident size"

These 2 values where roughly equal. Sounds good like this?

mboussaa on 28 Nov 2016

👍1

Thanks all, I indeed started playing with the housekeeping settings _after_ my initial comment and realized I could tone down it's requirements quite a lot...
Everything is much smoother now! :-)

edit: @mboussaa bellow:
--allow_dynamic_housekeeping=true --housekeeping_interval=10s
It might vary over time, but that worked for now.
I'm just starting setting up Prometheus as I need to integrate cadvisor & al. on prod instances. :)

RRAlex on 28 Nov 2016

@RRAlex can you provide us your new settings? So that I can use it for the future

mboussaa on 28 Nov 2016

👍1

Leaving this open in-case anyone has any interest in adding performance-related testing, or documenting ways to lower cavisor's resource consumption.