Cadvisor: kubernetes 1.3, systemd 229, memory working_set is radically different from `free`

Created on 4 Nov 2016 · 6Comments · Source: google/cadvisor

At least with kubernetes 1.3 and systemd 229, the remorted memory usage by cadvisor is radically different from the usage indicated by free.

Free reports 1778024 used byte (+/- cache, buffers) and heapster/cadvisor report 2282536960 used byte.

Following is data that was extracted from the cluster via heapster and elasticsearch and the output of free -h at the same time:

{
  "_index": "heapster-2016.11.04",
  "_type": "memory",
  "_id": "f3f9d8c2-a2ab-11e6-9941-2aa42b43b130",
  "_score": null,
  "_source": {
    "MemoryMetricsTimestamp": "2016-11-04T16:30:00Z",
    "Metrics": {
      "memory/limit": {
        "value": 1472200704
      },
      "memory/major_page_faults": {
        "value": 151
      },
      "memory/major_page_faults_rate": {
        "value": 0
      },
      "memory/node_allocatable": {
        "value": 3950506000
      },
      "memory/node_capacity": {
        "value": 3950506000
      },
      "memory/node_reservation": {
        "value": 0.31692135
      },
      "memory/node_utilization": {
        "value": 0.81257236
      },
      "memory/page_faults": {
        "value": 37131328
      },
      "memory/page_faults_rate": {
        "value": 10.287301
      },
      "memory/request": {
        "value": 1251999744
      },
      "memory/usage": {
        "value": 3210072064
      },
      "memory/working_set": {
        "value": 2282536960
      }
    },
    "MetricsTags": {
      "host_id": "i-79c274f2",
      "hostname": "ip-10-0-0-67.eu-west-1.compute.internal",
      "labels": "beta.kubernetes.io/arch:amd64,beta.kubernetes.io/instance-type:m3.medium,beta.kubernetes.io/os:linux,failure-domain.beta.kubernetes.io/region:eu-west-1,failure-domain.beta.kubernetes.io/zone:eu-west-1a,kubernetes.io/hostname:ip-10-0-0-67.eu-west-1.compute.internal",
      "nodename": "ip-10-0-0-67.eu-west-1.compute.internal",
      "type": "node"
    }
  },
  "fields": {
    "MemoryMetricsTimestamp": [
      1478277000000
    ]
  },
  "highlight": {
    "MetricsTags.hostname.raw": [
      "@[email protected]@/kibana-highlighted-field@"
    ],
    "MetricsTags.type": [
      "@kibana-highlighted-field@node@/kibana-highlighted-field@"
    ],
    "MetricsTags.nodename.raw": [
      "@[email protected]@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1478277000000
  ]
}

ip-10-0-0-67 core # free 
             total       used       free     shared    buffers     cached
Mem:       3857916    3751744     106172       1460     237904    1735816
-/+ buffers/cache:    1778024    2079892
Swap:            0          0          0

kinsupport

Source

Thermi

👍6

Most helpful comment

Is the decision to include active page cache in working set memory deliberate and arguably correct, or do cAdvisor devs believe it should be excluded? Looking at kubernetes release 1.5.3, it appears that cAdvisor itself is what's subtracting total inactive file memory from memory.usage_in_bytes when calculating working set (cadvisor/container/libcontainer/helpers.go func setMemoryStats).

This is interesting to me because of how it impacts kubelet's eviction settings. It's pretty difficult to choose a memory eviction threshold that won't erroneously trip since predicting the amount of active page cache at any time, in general, is not really possible. Since the kernel can reclaim the page cache (use this for context around "erroneously" above), it seems like it shouldn't count against available memory.

This might be something that should be pushed back up into kubelet, but I wonder if the general case of including the active page cache, which it appears to be cAdvisor is responsible for, is generally unexpected?

(I ended up opening this against kubelet/kubernetes https://github.com/kubernetes/kubernetes/issues/43916)

vdavidoff on 31 Mar 2017

👍6

All 6 comments

1460K + 237904K + 1735816K = 1975180K = 2022584320
2282536960 - 2022584320 = 259952640

So it's off by about 250Mb. Could you post the output of the summary API on that node? (localhost:10255/stats/summary)

timstclair on 4 Nov 2016

And the output of free at the same time.

timstclair on 4 Nov 2016

summary.txt
free.txt

Thermi on 5 Nov 2016

The following is taken from kubernetes/kubernetes.github.io#2892, and seems relevant here.

The value for memory.available is derived from the cgroupfs instead of tools like free -m. This is important because free -m does not work in a container, and if users use the node allocatable feature, out of resource decisions are made local to the end user pod part of the cgroup hierarchy as well as the root node. The following script simulates the same set of steps that the kubelet performs to calculate memory.available. The kubelet excludes inactive_file (i.e. # of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that memory is reclaimable under pressure.

```sh

!/bin/bash

!/usr/bin/env bash

This script reproduces what the kubelet does

to calculate memory.available relative to root cgroup.

current memory usage

memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=$memory_usage_in_bytes
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
memory_working_set=0
else
memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"

dashpole on 17 Mar 2017

(I ended up opening this against kubelet/kubernetes https://github.com/kubernetes/kubernetes/issues/43916)

vdavidoff on 31 Mar 2017

👍6

predicting the amount of active page cache at any time, in general, is not really possible.

I contend in a situations where the file cache is significantly used (for example, a database), it's quite easy to predict the size of the active file list: It's simply all the RAM which isn't occupied by anonymous pages, kernel slab, and a few other minor things; multiplied by a fraction which starts at 0.5 and approaches 1 as the total RAM increases.

In other words, for an IO workload where cache performance is critical, all the "extra" RAM (that is, RAM which could be reclaimed without significant detriment to performance) ends up in the active file LRU list. This is the opposite of what's likely desirable for some definition of a "working set".

More detail at https://github.com/kubernetes/kubernetes/issues/43916#issuecomment-393228487.