Our cadvisor reports different containers each we time I query the /metrics route. The problems are consistent across various environments and VMs. I initially found #1635 and thought this to be the same, but the linked #1572 explains that cadvisor seems to pickup two systemd slices for the container, which is not the case according to my logs. Thus a separate issue, just to be sure.
17:50 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
98
17:51 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
18
17:51 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
98
17:51 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
:8701 is started as follows: $ sudo /opt/cadvisor/bin/cadvisor -port 8701 -logtostderr -v=10
Neither dockerd nor cadvisor print any logs during those requests.
I0725 17:02:09.462596 109834 storagedriver.go:50] Caching stats in memory for 2m0s
I0725 17:02:09.462727 109834 manager.go:143] cAdvisor running in container: "/"
W0725 17:02:09.496040 109834 manager.go:151] unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
I0725 17:02:09.531430 109834 fs.go:117] Filesystem partitions: map[/dev/dm-0:{mountpoint:/ major:254 minor:0 fsType:ext4 blockSize:0} /dev/mapper/rs--pre--docker--012--vg-var:{mountpoint:/var/lib/docker/aufs major:254 minor:2 fsType:ext4 blockSize:0} /dev/mapper/rs--pre--docker--012--vg-varlog:{mountpoint:/var/log major:254 minor:3 fsType:ext4 blockSize:0}]
I0725 17:02:09.534803 109834 manager.go:198] Machine: {NumCores:8 CpuFrequency:2397223 MemoryCapacity:38034182144 MachineID:c63b565c3eea4c1bab8cc5d972595a51 SystemUUID:423B1F3E-804D-219F-8D0B-EECB74C81279 BootID:9b2c8857-539f-4adf-b2b5-c8e2672968b8 Filesystems:[{Device:/dev/mapper/rs--pre--docker--012--vg-var DeviceMajor:254 DeviceMinor:2 Capacity:40179982336 Type:vfs Inodes:2501856 HasInodes:true} {Device:/dev/mapper/rs--pre--docker--012--vg-varlog DeviceMajor:254 DeviceMinor:3 Capacity:20020748288 Type:vfs Inodes:1250928 HasInodes:true} {Device:/dev/dm-0 DeviceMajor:254 DeviceMinor:0 Capacity:12366823424 Type:vfs Inodes:775200 HasInodes:true}] DiskMap:map[254:1:{Name:dm-1 Major:254 Minor:1 Size:1023410176 Scheduler:none} 254:2:{Name:dm-2 Major:254 Minor:2 Size:40957378560 Scheduler:none} 254:3:{Name:dm-3 Major:254 Minor:3 Size:20476592128 Scheduler:none} 8:0:{Name:sda Major:8 Minor:0 Size:75161927680 Scheduler:cfq} 254:0:{Name:dm-0 Major:254 Minor:0 Size:12700352512 Scheduler:none}] NetworkDevices:[{Name:eth0 MacAddress:00:50:56:bb:37:43 Speed:10000 Mtu:1500}] Topology:[{Id:0 Memory:38034182144 Cores:[{Id:0 Threads:[0] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:2 Memory:0 Cores:[{Id:0 Threads:[1] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:4 Memory:0 Cores:[{Id:0 Threads:[2] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:6 Memory:0 Cores:[{Id:0 Threads:[3] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:8 Memory:0 Cores:[{Id:0 Threads:[4] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:10 Memory:0 Cores:[{Id:0 Threads:[5] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:12 Memory:0 Cores:[{Id:0 Threads:[6] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:14 Memory:0 Cores:[{Id:0 Threads:[7] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}
I0725 17:02:09.535661 109834 manager.go:204] Version: {KernelVersion:3.16.0-4-amd64 ContainerOsVersion:Debian GNU/Linux 8 (jessie) DockerVersion:1.13.1 DockerAPIVersion:1.26 CadvisorVersion:v0.26.1 CadvisorRevision:d19cc94}
I0725 17:02:09.577920 109834 factory.go:351] Registering Docker factory
W0725 17:02:09.577951 109834 manager.go:247] Registration of the rkt container factory failed: unable to communicate with Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
I0725 17:02:09.577957 109834 factory.go:54] Registering systemd factory
I0725 17:02:09.578235 109834 factory.go:86] Registering Raw factory
I0725 17:02:09.578542 109834 manager.go:1121] Started watching for new ooms in manager
I0725 17:02:09.579461 109834 oomparser.go:185] oomparser using systemd
I0725 17:02:09.579565 109834 factory.go:116] Factory "docker" was unable to handle container "/"
I0725 17:02:09.579582 109834 factory.go:105] Error trying to work out if we can handle /: / not handled by systemd handler
I0725 17:02:09.579586 109834 factory.go:116] Factory "systemd" was unable to handle container "/"
I0725 17:02:09.579592 109834 factory.go:112] Using factory "raw" for container "/"
I0725 17:02:09.579959 109834 manager.go:913] Added container: "/" (aliases: [], namespace: "")
I0725 17:02:09.580102 109834 handler.go:325] Added event &{/ 2017-07-22 16:40:48.746304841 +0200 CEST containerCreation {<nil>}}
I0725 17:02:09.580139 109834 manager.go:288] Starting recovery of all containers
I0725 17:02:09.580237 109834 container.go:407] Start housekeeping for container "/"
Example: I am missing the metrics for f7ba91df74c8. Cadvisor mentions the container ID only once:
I0725 17:02:09.693203 109834 factory.go:112] Using factory "docker" for container "/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855"
I0725 17:02:09.695423 109834 manager.go:913] Added container: "/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855" (aliases: [containernameredacted f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855], namespace: "docker")
I0725 17:02:09.695640 109834 handler.go:325] Added event &{/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855 2017-07-25 16:20:00.930924661 +0200 CEST containerCreation {<nil>}}
I0725 17:02:09.695779 109834 container.go:407] Start housekeeping for container "/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855"
cadvisor_version_info{cadvisorRevision="d19cc94",cadvisorVersion="v0.26.1",dockerVersion="1.13.1",kernelVersion="3.16.0-4-amd64",osVersion="Debian GNU/Linux 8 (jessie)"} 1
We are running an old docker swarm setup with consul, consul-template and nginx per host. No Kubernetes.
We're observing the same behavior, in the kubernetes 1.7.0 kubelet (port 4194), and the docker image for v0.26.1
Versions:
docker: 1.12.6
Kubelet: v1.7.0+coreos.0
OS: CoreOS Linux 1409.7.0
Kernel Version: 4.11.11-coreos
I ran cadvisor on kubernetes using the following DaemonSet
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: cadvisor
namespace: default
labels:
app: "cadvisor"
spec:
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app: "cadvisor"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: 4194
prometheus.io/path: '/metrics'
spec:
containers:
- name: "cadvisor"
image: "google/cadvisor:v0.26.1"
args:
- "-port=4194"
- "-logtostderr"
livenessProbe:
httpGet:
path: /api
port: 4194
volumeMounts:
- name: root
mountPath: /rootfs
readOnly: true
- name: var-run
mountPath: /var/run
- name: sys
mountPath: /sys
readOnly: true
- name: var-lib-docker
mountPath: /var/lib/docker
readOnly: true
- name: docker-socket
mountPath: /var/run/docker.sock
resources:
limits:
cpu: 500.0m
memory: 256Mi
requests:
cpu: 250.0m
memory: 128Mi
restartPolicy: Always
volumes:
- name: "root"
hostPath:
path: /
- name: "var-run"
hostPath:
path: /var/run
- name: "sys"
hostPath:
path: /sys
- name: "var-lib-docker"
hostPath:
path: /var/lib/docker
- name: "docker-socket"
hostPath:
path: /var/run/docker.sock
And this is what it looked like in Prometheus:

Running the binary without root permissions fixes the problems, but now container labels are missing. Using the -docker-only flag or accessing docker via tcp/ip leads to no change from the initial behavior.
@zeisss @micahhausler are you both running Prometheus 2.0? In 1.x versions the flapping metrics are not caught by the new staleness handling and thus it should have no immediately visible effect.
In general it's definitely wrong behavior by cAdvisor though that violates the /metrics contract.
This seems to be a recent regression. @derekwaynecarr @timothysc any idea what could have caused this?
@fabxc I'm using Prometheus 1.5.2 and cAdvisor on host machine and I also have this problem.
As @zeisss said, if I run cAdvisor without root permission, this fix the problem except that container labels is missing.
Worst of all with this bug is that Prometheus sometimes lose some containers metrics... In Grafana my graph with running containers looks like this:

And I see Alerts from AlertManager that containers is down, but actually all containers working all time.
We currently have a workaround by running cadvisor as an explicit user. this is ok for us, as having the CPU and memory graphs is already a win for us. But afaict this mode is missing the docker container labels as well as network and disk I/O metrics.
@fabxc no, we are still running a 1.x prometheus version - but having Prometheus work around this bug in cadvisor is not a good solution IMO.
We are currently in the progress of updating our DEV cluster to Docker 17.06-ce where we are still seeing this behavior, if run as root (/opt/cadvisor/bin/cadvisor -port 8701 -logtostderr):
$ while true; do curl -sS docker-host:8701/metrics | fgrep container_cpu_system_seconds_total | wc -l; sleep 1; done
28
28
9
9
5
6
^C
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="057293a",cadvisorVersion="v0.26.0.20+057293a1796d6a-dirty",dockerVersion="17.06.0-ce",kernelVersion="3.16.0-4-amd64",osVersion="Debian GNU/Linux 8 (jessie)"} 1
I've the same issue with Kubernetes 1.7.2 & 1.7.3.
I have the exact same problem as @DexterHD, makes me crazy, my container-down alert spams me with false alerts all the time.

I just started to explore cAdvisor. Seems to have the same issue using InfluxDB:

Having the same issue with docker 17.06, prometheus and docker swarm.
Running v0.24.1 solved it for me
cc @grobie
/cc
Same thing, 0.26, 0.26.1 are unusable with Prometheus (in our case 1.7.x).
They provide a random number of metrics - different number of metrics exposed by /metrics path at a single moment. Had to go back to the old good 0.25.
Docker 17.03/17.06.
@Hermain @roman-vynar According to release notes 0.26 "Bug: Fix prometheus metrics."
So when reverting to 0.25, one misses out on whatever they fixed (but at the same time did break something and introduced the gaps)? I cant find the prometheus-commit that's connected to v0.26 in order to see whats "fixed".
Do we have an ETA on fixing this? No devs in this issue? And no assignee?
According to https://github.com/google/cadvisor/issues/1690#issuecomment-313597011 the fix in 0.26.1 isn't working / incomplete, maybe this is the same problem?
Does this problem happen on a cAdvisor built from master, which includes #1679?
~If someone can indicate me how to build hyperkube with a custom cAdvisor commit I'd like to make some tests.~ I think I found how to do this.
Thanks.
I meet the same promblem using cadvisor 0.26.1 and prometheus 1.7.1, but it's OK when I changed cadvisor to v0.25.0, and it OK with cadvisor 0.26.1 and prometheus 1.5.3, I'm a little confused, it seems to be a compatibility issue.
Seeing the same high-level symptoms: for me it's the labels that are missing, not the containers. And when the labels are missing I get a lot more lines for other cgroups.
I'm running kuberntes 1.7.3 on Ubuntu Linux ip-172-20-3-76 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Two examples from the same kubelet on the same machine, a few seconds apart:
Example 1:
# curl -s 127.0.0.1:10255/metrics/cadvisor | grep container_cpu_user_seconds_total
# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{id="/"} 3.6788206e+06
container_cpu_user_seconds_total{id="/init.scope"} 69.43
container_cpu_user_seconds_total{id="/kubepods"} 3.49797001e+06
container_cpu_user_seconds_total{id="/kubepods/besteffort"} 162742.99
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod13eacef1-8342-11e7-9534-0a97ed59c75e"} 69.47
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod5f43c843-7db5-11e7-9534-0a97ed59c75e"} 703.82
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod6b2e45d7-7db5-11e7-9534-0a97ed59c75e"} 70.04
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e"} 363.18
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod965b711b-8262-11e7-9534-0a97ed59c75e"} 5.9
container_cpu_user_seconds_total{id="/kubepods/besteffort/podd2b82b9c-8355-11e7-9534-0a97ed59c75e"} 35733.13
container_cpu_user_seconds_total{id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e"} 150.78
container_cpu_user_seconds_total{id="/kubepods/burstable"} 3.33525364e+06
container_cpu_user_seconds_total{id="/kubepods/burstable/pod53559243-7db5-11e7-9534-0a97ed59c75e"} 276743.3
container_cpu_user_seconds_total{id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e"} 105958.75
container_cpu_user_seconds_total{id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee"} 366.77
container_cpu_user_seconds_total{id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee/8d2eb34023eab40d08ba6e4be149e315c3844749f8321f44be2dcda024534757/\"\""} 366.65
container_cpu_user_seconds_total{id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e"} 434974.97
container_cpu_user_seconds_total{id="/kubepods/burstable/podcb5d3cc0-8364-11e7-9534-0a97ed59c75e"} 891563
container_cpu_user_seconds_total{id="/kubepods/burstable/podcf18531c-8365-11e7-9534-0a97ed59c75e"} 17225.18
container_cpu_user_seconds_total{id="/system.slice"} 151482.27
container_cpu_user_seconds_total{id="/system.slice/acpid.service"} 0
container_cpu_user_seconds_total{id="/system.slice/apparmor.service"} 0
container_cpu_user_seconds_total{id="/system.slice/apport.service"} 0
container_cpu_user_seconds_total{id="/system.slice/atd.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cgroupfs-mount.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cloud-config.service"} 0.32
container_cpu_user_seconds_total{id="/system.slice/cloud-final.service"} 0.37
container_cpu_user_seconds_total{id="/system.slice/cloud-init-local.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cloud-init.service"} 0.63
container_cpu_user_seconds_total{id="/system.slice/console-setup.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cron.service"} 25.49
container_cpu_user_seconds_total{id="/system.slice/dbus.service"} 14.82
container_cpu_user_seconds_total{id="/system.slice/docker.service"} 94117.92
container_cpu_user_seconds_total{id="/system.slice/ebtables.service"} 0
container_cpu_user_seconds_total{id="/system.slice/grub-common.service"} 0
container_cpu_user_seconds_total{id="/system.slice/[email protected]"} 0
container_cpu_user_seconds_total{id="/system.slice/[email protected]"} 0.79
container_cpu_user_seconds_total{id="/system.slice/irqbalance.service"} 40.56
container_cpu_user_seconds_total{id="/system.slice/iscsid.service"} 1.69
container_cpu_user_seconds_total{id="/system.slice/keyboard-setup.service"} 0
container_cpu_user_seconds_total{id="/system.slice/kmod-static-nodes.service"} 0
container_cpu_user_seconds_total{id="/system.slice/kubelet.service"} 21323.06
container_cpu_user_seconds_total{id="/system.slice/lvm2-lvmetad.service"} 8.94
container_cpu_user_seconds_total{id="/system.slice/lvm2-monitor.service"} 0
container_cpu_user_seconds_total{id="/system.slice/lxcfs.service"} 0.37
container_cpu_user_seconds_total{id="/system.slice/lxd-containers.service"} 0
container_cpu_user_seconds_total{id="/system.slice/mdadm.service"} 0.02
container_cpu_user_seconds_total{id="/system.slice/networking.service"} 0
container_cpu_user_seconds_total{id="/system.slice/ondemand.service"} 0
container_cpu_user_seconds_total{id="/system.slice/open-iscsi.service"} 0
container_cpu_user_seconds_total{id="/system.slice/polkitd.service"} 3.63
container_cpu_user_seconds_total{id="/system.slice/rc-local.service"} 0
container_cpu_user_seconds_total{id="/system.slice/resolvconf.service"} 0
container_cpu_user_seconds_total{id="/system.slice/rsyslog.service"} 100.82
container_cpu_user_seconds_total{id="/system.slice/setvtrgb.service"} 0
container_cpu_user_seconds_total{id="/system.slice/snapd.firstboot.service"} 0
container_cpu_user_seconds_total{id="/system.slice/snapd.service"} 0.04
container_cpu_user_seconds_total{id="/system.slice/ssh.service"} 51.39
container_cpu_user_seconds_total{id="/system.slice/system-getty.slice"} 0
container_cpu_user_seconds_total{id="/system.slice/system-serial\\x2dgetty.slice"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-journal-flush.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-journald.service"} 489.31
container_cpu_user_seconds_total{id="/system.slice/systemd-logind.service"} 3.02
container_cpu_user_seconds_total{id="/system.slice/systemd-modules-load.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-random-seed.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-remount-fs.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-sysctl.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-timesyncd.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-tmpfiles-setup-dev.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-tmpfiles-setup.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-udev-trigger.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-udevd.service"} 0.52
container_cpu_user_seconds_total{id="/system.slice/systemd-update-utmp.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-user-sessions.service"} 0
container_cpu_user_seconds_total{id="/system.slice/ufw.service"} 0
container_cpu_user_seconds_total{id="/user.slice"} 29270.98
Example 2:
# curl -s 127.0.0.1:10255/metrics/cadvisor | grep container_cpu_user_seconds_total
# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod5f43c843-7db5-11e7-9534-0a97ed59c75e/e49ec1309ec25475a7edd8c4dd6d7003fef3f7debd053b234716649d920ac15f",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_prom-node-exporter-w4nvq_monitoring_5f43c843-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="prom-node-exporter-w4nvq"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod6b2e45d7-7db5-11e7-9534-0a97ed59c75e/2bf50e4b99aaf24eb05a61b9808d9e60d4fd78ba47ac7669ce29bb3f8c862501",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_reboot-required-rn9h4_monitoring_6b2e45d7-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="reboot-required-rn9h4"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e/f3a1a656eabae83bb3a50206d7278b154fe1ddf2521e6a0bfd31667642867968",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_memcached-296817331-t3q5v_kube-system_94ad7fd4-8351-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="memcached-296817331-t3q5v"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod965b711b-8262-11e7-9534-0a97ed59c75e/c6e3b1012a1e607e4d164233f96a4c2ef83f377fc9dfb82e0dab7fc218e4e72a",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_kured-wp23j_kube-system_965b711b-8262-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="kured-wp23j"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/podd2b82b9c-8355-11e7-9534-0a97ed59c75e/e62bf79dd1981e285df9138a057b481357a5be6e464b43235e1335ac33bcf00b",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluxd-3608285890-x4bz7_kube-system_d2b82b9c-8355-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="fluxd-3608285890-x4bz7"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e/e618f7cb1f3ec97f463ed9f97143890b80c730f53075a127d9f59714aab35163",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_nats-651776541-6vrk3_scope_e4c7eace-8352-11e7-9534-0a97ed59c75e_0",namespace="scope",pod_name="nats-651776541-6vrk3"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pod53559243-7db5-11e7-9534-0a97ed59c75e/44f9f0113185f75f827eca36a42a7d2f91e166594c63eb2efecc7155eda03a70",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_scope-probe-master-3cktj_kube-system_53559243-7db5-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="scope-probe-master-3cktj"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e/9f9696a06e93a617a4e606731a474966c139681eef1a66344f0d06c965c68e47",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_authfe-1607895901-bjnd9_default_55af46fe-834c-11e7-9534-0a97ed59c75e_0",namespace="default",pod_name="authfe-1607895901-bjnd9"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee/7c3dc6bb8bb540224ca1f6d121d5fe2c5df0606ce5d45e7a0c802c29765c6625",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_kube-proxy-ip-172-20-3-76.ec2.internal_kube-system_7964f3e653196edee64f6bad72589dee_1",namespace="kube-system",pod_name="kube-proxy-ip-172-20-3-76.ec2.internal"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e/3f5329bc7772496d70821ea9c9bc80045af6c29299c42d74e2d27baf8c3cc72a",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_prometheus-2177618048-kgczb_monitoring_c7af9dff-8364-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="prometheus-2177618048-kgczb"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/podcb5d3cc0-8364-11e7-9534-0a97ed59c75e/decb876fb0dad43964deed741609ee45d3cf9049ae9f3ac934aefd596695302c",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluxsvc-438909710-2jtz8_fluxy_cb5d3cc0-8364-11e7-9534-0a97ed59c75e_0",namespace="fluxy",pod_name="fluxsvc-438909710-2jtz8"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/podcf18531c-8365-11e7-9534-0a97ed59c75e/133181c676d51606d4fa3d7d5c7e7455535636d30c5629526f0ba0cac5fcb522",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluentd-loggly-z9jp4_monitoring_cf18531c-8365-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="fluentd-loggly-z9jp4"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pode84e93da-865d-11e7-940d-12467a080e24/dabbd2c12e2d2666dd818b0c44be54760a701bdaf850ee4804b32efd36c42754",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_collection-3392593966-7nxpw_scope_e84e93da-865d-11e7-940d-12467a080e24_0",namespace="scope",pod_name="collection-3392593966-7nxpw"} 0
container_cpu_user_seconds_total{container_name="authfe",id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e/c796e0b2c3afc41e1ed6750c9dc9f5550e19efe25f0aa717fe4f9b2578c16c67",image="quay.io/weaveworks/authfe@sha256:c82cb113d15e20f65690aa3ca7f3374ae7ed2257dee2bc131bd61b1ac2bf180a",name="k8s_authfe_authfe-1607895901-bjnd9_default_55af46fe-834c-11e7-9534-0a97ed59c75e_0",namespace="default",pod_name="authfe-1607895901-bjnd9"} 64953.6
container_cpu_user_seconds_total{container_name="billing-ingester",id="/kubepods/burstable/pode84e93da-865d-11e7-940d-12467a080e24/50c82895bd84971bc6b8b9f5873512710ab06f754a0e0d3261bc20a2fddd4533",image="quay.io/weaveworks/billing-ingester@sha256:5fd857a96cac13e9f96678e63a07633af45de0e83a34e8ef28f627cf0589a042",name="k8s_billing-ingester_collection-3392593966-7nxpw_scope_e84e93da-865d-11e7-940d-12467a080e24_0",namespace="scope",pod_name="collection-3392593966-7nxpw"} 187.34
container_cpu_user_seconds_total{container_name="collection",id="/kubepods/burstable/pode84e93da-865d-11e7-940d-12467a080e24/f33f520aa2ed2c6f2277064fed34c4797ddf76a0a0bef25309348517cb1c4030",image="quay.io/weaveworks/scope@sha256:45be0490dba82f68a20faba8994cde307e9ace863a310196ba91401122bda4f8",name="k8s_collection_collection-3392593966-7nxpw_scope_e84e93da-865d-11e7-940d-12467a080e24_0",namespace="scope",pod_name="collection-3392593966-7nxpw"} 5411.72
container_cpu_user_seconds_total{container_name="exporter",id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e/cabd4c16d300232a8b823bd5a9553816ff7f0830c6d91634651b4f723035664f",image="prom/memcached-exporter@sha256:b814aa209e2d5969be2ab4c65b5eda547ba657fd81ba47f48b980d20b14befb7",name="k8s_exporter_memcached-296817331-t3q5v_kube-system_94ad7fd4-8351-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="memcached-296817331-t3q5v"} 142.5
container_cpu_user_seconds_total{container_name="exporter",id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e/81fd164c5cc91a483b73ada15ce13f19d3171fc6beddc940fc2b6e747141905d",image="tomwilkie/nats_exporter@sha256:189354d9c966f94d9685009250dc360582baf02f76ecbaa2233e15cff2bc8f7f",name="k8s_exporter_nats-651776541-6vrk3_scope_e4c7eace-8352-11e7-9534-0a97ed59c75e_0",namespace="scope",pod_name="nats-651776541-6vrk3"} 107.62
container_cpu_user_seconds_total{container_name="fluentd-loggly",id="/kubepods/burstable/podcf18531c-8365-11e7-9534-0a97ed59c75e/6fe6a67e02419f47a21854b73734042c0d457d42704be4302356180e4f357935",image="quay.io/weaveworks/fluentd-loggly@sha256:19a02a2f8627573572cc2ee3c706aa4ccdab0f59c3a04e577d28035681d30ddc",name="k8s_fluentd-loggly_fluentd-loggly-z9jp4_monitoring_cf18531c-8365-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="fluentd-loggly-z9jp4"} 17386.12
container_cpu_user_seconds_total{container_name="flux",id="/kubepods/besteffort/podd2b82b9c-8355-11e7-9534-0a97ed59c75e/d4ef6d20b97c7f0fefc9d13c0f4b94290eb661035bf21b7f07f38acdd18cb85d",image="quay.io/weaveworks/flux@sha256:e462c0a7c316f5986b3808360dc7c8c269466033c75a1b9553aa8175e02646f7",name="k8s_flux_fluxd-3608285890-x4bz7_kube-system_d2b82b9c-8355-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="fluxd-3608285890-x4bz7"} 36097.96
container_cpu_user_seconds_total{container_name="fluxsvc",id="/kubepods/burstable/podcb5d3cc0-8364-11e7-9534-0a97ed59c75e/aa00624319b1a96a18e0a4717f13e7456e558fea8b84e2694dc8d2b168a44d3d",image="quay.io/weaveworks/fluxsvc@sha256:8d91991f2f6894def54afda4b4afb858b0502ed841a7188db48210b94bfdae4a",name="k8s_fluxsvc_fluxsvc-438909710-2jtz8_fluxy_cb5d3cc0-8364-11e7-9534-0a97ed59c75e_0",namespace="fluxy",pod_name="fluxsvc-438909710-2jtz8"} 897247.03
container_cpu_user_seconds_total{container_name="kube-proxy",id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee/8d2eb34023eab40d08ba6e4be149e315c3844749f8321f44be2dcda024534757",image="gcr.io/google_containers/kube-proxy-amd64@sha256:dba7121df9f74b40901fb655053af369f58c82c3636d8125986ce474a759be80",name="k8s_kube-proxy_kube-proxy-ip-172-20-3-76.ec2.internal_kube-system_7964f3e653196edee64f6bad72589dee_1",namespace="kube-system",pod_name="kube-proxy-ip-172-20-3-76.ec2.internal"} 368.98
container_cpu_user_seconds_total{container_name="kured",id="/kubepods/besteffort/pod965b711b-8262-11e7-9534-0a97ed59c75e/12b3c19d2f114a6a111fdc0375bb0c27fb9e108c166e6f674aeddcd5178faa0b",image="weaveworks/kured@sha256:305b073cd3fff9ba0f21a570ee8a9c018d30274fc35045134164c762f44828e0",name="k8s_kured_kured-wp23j_kube-system_965b711b-8262-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="kured-wp23j"} 5.91
container_cpu_user_seconds_total{container_name="logging",id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e/8d7e46f3d99d2f13b04b7e07a4f1062e82450f02f8f7f03c8fb33a83f0248857",image="quay.io/weaveworks/logging@sha256:63c4e6783884e6fcdd24026606756748e5913ab4978efa61ed09034074ddbe27",name="k8s_logging_authfe-1607895901-bjnd9_default_55af46fe-834c-11e7-9534-0a97ed59c75e_0",namespace="default",pod_name="authfe-1607895901-bjnd9"} 41780.76
container_cpu_user_seconds_total{container_name="memcached",id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e/e5d81ddecc6a587e55491e837db3ed46f274e3b02c764f4d6d1ca2e6228fbe0c",image="memcached@sha256:00b68b00139155817a8b1d69d74865563def06b3af1e6fc79ac541a1b2f6b961",name="k8s_memcached_memcached-296817331-t3q5v_kube-system_94ad7fd4-8351-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="memcached-296817331-t3q5v"} 222.96
container_cpu_user_seconds_total{container_name="nats",id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e/511ce33319ecc50b928e3dda7025d643c310a5573d89596f89798496d9868342",image="nats@sha256:2dfb204c4d8ca4391dbe25028099535745b3a73d0cf443ca20a7e2504ba93b26",name="k8s_nats_nats-651776541-6vrk3_scope_e4c7eace-8352-11e7-9534-0a97ed59c75e_0",namespace="scope",pod_name="nats-651776541-6vrk3"} 44.25
container_cpu_user_seconds_total{container_name="prom-node-exporter",id="/kubepods/besteffort/pod5f43c843-7db5-11e7-9534-0a97ed59c75e/1ceb1514b5339c67c70ec37d609d361d5ba656ee3697a12de0918f9902d0a134",image="weaveworks/node_exporter@sha256:4f0c14e89da784857570185c4b9f57acb20f4331ef10e013731ac9274243a5a8",name="k8s_prom-node-exporter_prom-node-exporter-w4nvq_monitoring_5f43c843-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="prom-node-exporter-w4nvq"} 707.54
container_cpu_user_seconds_total{container_name="prom-run",id="/kubepods/besteffort/pod6b2e45d7-7db5-11e7-9534-0a97ed59c75e/75468eaf52cf3577dbb462d586fc5aa49a3f5a151fb668a734f8e99f825c1fc5",image="quay.io/weaveworks/docker-ansible@sha256:452d1249e40650249beb700349c7deee26c15da2621e8590f3d56033babb890b",name="k8s_prom-run_reboot-required-rn9h4_monitoring_6b2e45d7-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="reboot-required-rn9h4"} 70.57
container_cpu_user_seconds_total{container_name="prometheus",id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e/e4e3b4f6285c9a12415f347aadbf150c6d782e6b881d2701d4257bf3a4de2651",image="prom/prometheus@sha256:4bf7ad89d607dd8de2f0cff1df554269bff19fe0f18ee482660f7a5dc685d549",name="k8s_prometheus_prometheus-2177618048-kgczb_monitoring_c7af9dff-8364-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="prometheus-2177618048-kgczb"} 438158.08
container_cpu_user_seconds_total{container_name="scope-probe",id="/kubepods/burstable/pod53559243-7db5-11e7-9534-0a97ed59c75e/e57413febbcc1c28321ccb99df3bf30b9d6555a1db62b743d1b4ee877f23346b",image="quay.io/weaveworks/scope@sha256:bc6ee4a4a568f8075573a8ac44c27759307fce355c22ad66acb1e944b6361b62",name="k8s_scope-probe_scope-probe-master-3cktj_kube-system_53559243-7db5-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="scope-probe-master-3cktj"} 278471.28
container_cpu_user_seconds_total{container_name="watch",id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e/fe6cdaa2c542c90cbca951cd97952d35c8c42fcd5e8f452030369a98e27c9b3f",image="weaveworks/watch@sha256:bb113953e19fff158de017c447be337aa7a3709c3223aeeab4a5bae50ee6f159",name="k8s_watch_prometheus-2177618048-kgczb_monitoring_c7af9dff-8364-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="prometheus-2177618048-kgczb"} 0.1
Different metrics in the same scrape will be fine, e.g. container_fs_inodes_free
I think I figured out what is going wrong.
The function DefaultContainerLabels() conditionally adds various metric labels from container labels - name, image, etc. When used inside kubelet this function is containerPrometheusLabels() but essentially the same.
However, when it receives the metrics, Prometheus checks that all metrics in the same family have the same label set, and rejects those that do not.
Since containers are collected in (somewhat) random order, depending on which kind is seen first you get one set of metrics or the other.
Changing the container labels function to always add the same set of labels, adding "" when it doesn't have a real value, eliminates the issue in my testing.
Thanks @bboreham! Can you submit a PR with your fix? I will try and get this in the 1.8 release.
@dashpole this also needs to be fixed in the Kubernetes 1.7.x series, or it will be impossible to collect useful container metrics for anyone who relies on the Prometheus format.
If the real fix is too complex for a cherry-pick, there is an option that can be passed when initialising the Prometheus client that turns off these validations and restores the previous behaviour. Of course, not producing invalid metrics in the first place is preferable.
Shouldn't these errors have shown up in log files all over the place?
Related discussions are ongoing in kubernetes/kube-state-metrics#194 – I thought there is a knob but in reality it's just using a hacked up Prometheus client. I think @bboreham's fix is the right way to go and should be cherry-picked both onto the cAdvisor and Kubernetes release branches.
@matthiasr I can cherrypick this to 1.7.
We had some errors that were introduced when we updated to prometheus v0.8.0 (#1680), but were not sure what the root cause was until now. Because the checks were introduced recently, we couldn't point to a change in cAdvisor that caused the inconsistency, and haven't had a chance to look into this myself yet.
Aha, another knob. Looks like #1679 is not sufficient, since the issue still persists?
Yes, I think that just made it report an incomplete set of metrics instead of none at all.
Actually it looks like the function in cAdvisor is more complicated, as it copies all Docker labels, etc.
I will make a PR for kubelet.
To clarify, so far my testing has been in a stand-alone program bringing in parts of kubelet to find out what it was doing.
It looks like #51473 fixes this in kubernetes, and ill cherrypick it into the 1.7 branch. However, this wont fix it for stand-alone cAdvisor, as that uses the DefaultContainerLabels function. It would appear that exposing container labels as prometheus labels is considered an anti-patern. I am not quite sure what the best way forward is in that respect.
seems like InfluxDB has similar issues: https://github.com/google/cadvisor/issues/1730
I know nothing of stand-alone cAdvisor usage. Are there any users reading this?
Could we have a pre-defined set of container labels which are copied as metric labels?
once we have the label whitelist #1730, we can ensure all whitelisted labels are present as metric labels, or set them as ""
But would that whitelist be mandatory?
Even for an unbounded set of labels, I think the following would work:
This way, the consistency condition will be fulfilled, but because the
Prometheus server actually treats empty values the same as the label not
being present, maintains current behavior at query time.
This is a bit inefficient (multiple passes over the data, lots of copies
and allocations), maybe someone can come up with a better solution?
On Tue, Aug 29, 2017, 00:48 David Ashpole notifications@github.com wrote:
once we have the label whitelist #1730
https://github.com/google/cadvisor/issues/1730, we can ensure all
whitelisted labels are present as metric labels, or set them as ""—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/cadvisor/issues/1704#issuecomment-325504774,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAICBg21L0iGbsqT6Bfg3b-CtcIKLllsks5sc0O8gaJpZM4Oixos
.
Does the set of labels need to be consistent across time, or just at a single point in time? That would only work if we don't need consistency across time.
Only at each point in time.
On Wed, Aug 30, 2017, 23:46 David Ashpole notifications@github.com wrote:
Does the set of labels need to be consistent across time, or just at a
single point in time? That would only work if we don't need consistency
across time.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/cadvisor/issues/1704#issuecomment-326128466,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAICBlgyW0sZuqfsBYCgmZkdWFefPU61ks5sddgwgaJpZM4Oixos
.
It's only checked by the Prometheus client library at each scrape, but surely the desire from Prometheus is that labelling be consistent over all samples for the same metric?
It's not even "across time"; consider two samples from different machines taken at the same time.
In principle, yes, but there's a limit to how far into the future you can predict which labels there will be. Things get wonky if this changes all the time, and depending on the queries potentially at the point of change, but at some point things do need to change.
An approach that is pretty common (and that kube-state-metrics uses) is to contain the variable label sets in a separate "foo_labels" metric. This metric would need to deal with this variability but all the other metrics would have a fixed set of labels. This pushes the responsibility for getting labels and actual metrics together to query time, with the hope that at that point you know which labels you want. This kind of joining is possible in Prometheus, but I don't know if there are other systems that consume this endpoint; and if you want to do this kind of fundamental change at all.
So yeah, this is fundamentally the same problem as the kube-state-metric issue referenced above.
I see three possible paths to a solution:
client_golang could provide a LabelFixingRegistry that auto-adds missing labels with empty strings, i.e. if some metrics have foo="bar" but others don't, it would attach foo="" to those other metrics. A variant on this would be to not provide this tooling in client_golang and ask the user of the package to implement it themself, as already suggested above. I believe the latter would be in line with what @brian-brazil said in https://github.com/kubernetes/kube-state-metrics/pull/194, namely that we should not make it easy for users to create labels with essentially inconsistent label dimensions.client_golang implements). So we needed to give up on that officially. I would still make it an explicit opt-in, i.e. you needed to use a LenientRegistry that would allow label inconsistencies. The downside over the previous bullet point is that we needed to change our contract about exposition. Also, the empty-valued labels created by the solution in the previous bullet points would be a nice marker of where the inconsistencies happen.Looking forward to feedback. If we go for a solution that required either a LenientRegistry or a LabelFixingRegistry, this tooling needed to be provided in client_golang, which most likely boils down to myself coding it.
However, when it receives the metrics, Prometheus checks that all metrics in the same family have the same label set, and rejects those that do not.
This sounds like a bug on the Prometheus side. This should cause the whole scrape to fail, not silently drop metrics. Partial data is to be avoided.
Munge all the container labels into a single Prometheus label with some syntax convention, like @matthiasr suggested above, e.g. `container_labels = "foo:bar,dings:bums"
I don't think that's a great idea, labels should be represented as labels and we generally try to dissuade users from building up structure inside label values. Having non-trivial relabelling rules doesn't really help anyone.
A variant on this would be to not provide this tooling in client_golang and ask the user of the package to implement it themself, as already suggested above.
Yes, this is what I'd go for. It shouldn't be too many lines of code. Likely most of these labels should also be moved to an per-container _info metric rather than being on all time series.
I believe the latter would be in line with what @brian-brazil said in kubernetes/kube-state-metrics#194, namely that we should not make it easy for users to create labels with essentially inconsistent label dimensions.
Yes, the client library guidelines are very clear about not allowing this for direct instrumentation, so it'd be in the spirit of the guidelines not to allow this given that Go already does label consistency checks. The Go client is the only client currently checking for this sort of inconsistency, though with the 2.0 scrape parser being laxer I can see other clients starting to have some checks to make up for that.
OK, nobody likes approach 1. Fair enough. @matthiasr also told me, that was not what he meant. My bad for not reading carefully enough.
However, when it receives the metrics, Prometheus checks that all metrics in the same family have the same label set, and rejects those that do not.
This sounds like a bug on the Prometheus side. This should cause the whole scrape to fail, not silently drop metrics. Partial data is to be avoided.
I guess, “Prometheus” above means “the Prometheus client library”. The default behavior is indeed to fail the whole scrape, but you can explicitly set a _continue on error_ behavior to still serve as many metrics as possible, see https://godoc.org/github.com/prometheus/client_golang/prometheus/promhttp#HandlerErrorHandling .
I like the approach of a …_info metric with all those container labels instead of assigning them everywhere. That's also how Kubernetes labels are handled. However, this approach is orthogonal to solving the consistency problem (it just reduces it to only one metric family).
Let's figure out if we want support for this in client_golang at all. If not, you know what to do in cAdvisor, and the same has to happen in kube-state-metrics. I'll document that approach in client_golang then.
If we feel we should have support in client_golang, the question would be between LenientRegistry (easy to implement, five lines of code or something) or the LabelFixingRegistry (slightly more complicated). In any case, it would be opt-in with a lot of warning signs attached to it.
I'm for the LabelFixingRegistry in the library. We already have two
concrete examples at hand that need this, and if we don't solve it
generally we'll just end up with several badly copy-pasted versions of the
same thing.
On Thu, Aug 31, 2017, 17:35 Björn Rabenstein notifications@github.com
wrote:
OK, nobody likes approach 1. Fair enough. @matthiasr
https://github.com/matthiasr also told me, that was not what he meant.
My bad for not reading carefully enough.However, when it receives the metrics, Prometheus checks that all metrics
in the same family have the same label set, and rejects those that do not.This sounds like a bug on the Prometheus side. This should cause the whole
scrape to fail, not silently drop metrics. Partial data is to be avoided.I guess, “Prometheus” above means “the Prometheus client library”. The
default behavior is indeed to fail the whole scrape, but you can explicitly
set a continue on error behavior to still serve as many metrics as
possible, see
https://godoc.org/github.com/prometheus/client_golang/prometheus/promhttp#HandlerErrorHandling
.I like the approach of a …_info metric with all those container labels
instead of assigning them everywhere. That's also how Kubernetes labels are
handled. However, this approach is orthogonal to solving the consistency
problem (it just reduces it to only one metric family).Let's figure out if we want support for this in client_golang at all. If
not, you know what to do in cAdvisor, and the same has to happen in
kube-state-metrics. I'll document that approach in client_golang then.If we feel we should have support in client_golang, the question would be
between LenientRegistry (easy to implement, five lines of code or
something) or the LabelFixingRegistry (slightly more complicated). In any
case, it would be opt-in with a lot of warning signs attached to it.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/cadvisor/issues/1704#issuecomment-326333626,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAICBi8dhW_mlC2OSpkig4TpAkR46-Luks5sdtKrgaJpZM4Oixos
.
I guess, “Prometheus” above means “the Prometheus client library”. The default behavior is indeed to fail the whole scrape, but you can explicitly set a continue on error behavior to still serve as many metrics as possible, see
Yes, that's what I meant. In that case it's a Cadvisor bug that it sets ContinueOnError rather than using the default HTTPErrorOnError as that was hiding this problem. This was introduced in #1679.
However, this approach is orthogonal to solving the consistency problem (it just reduces it to only one metric family).
Agreed.
Let's figure out if we want support for this in client_golang at all.
I would say no, there are only two use cases so far which I don't think is enough. Even with warning signs users will use it where it doesn't apply, just like ContinueOnError was used here to paper a over problem rather than fixing it.
I've implemented related code in the past by hand, it's not particularly complicated to write. It's standard data munging.
If we feel we should have support in client_golang, the question would be between LenientRegistry (easy to implement, five lines of code or something
If we go for it I'd go for this, but it'd feel weird that there'd now be the default registry settings, the lenient registry and the pedantic registry.
cAdvisor devs, how do you feel about implementing the "fill up with empty-valued labels to reach label consistency" as done in https://github.com/vladimirvivien/kubernetes/commit/8935d66160f5a53306c914c57f718aad58a8b508 ?
@brancz as the main https://github.com/kubernetes/kube-state-metrics/ dev, how do you feel about implementing it in parallel?
Just trying to test the waters if we want/need support in the Prometheus client_golang for that.
For those stuck on 0.25.0 because of this issue, I've cherry-picked (04fc089) the patch to kube-state-metrics mentioned above (https://github.com/google/cadvisor/issues/1704#issuecomment-325418911) onto cadvisor's local copy of client_golang/prometheus/registry.go. This simply voids the labels consistency checking introduced in 0.26.0. I also pushed an image with the workaround to docker.io/camptocamp/cadvisor:v0.27.1_with-workaround-for-1704
NB: this is merely a workaround until a proper fix is available in a release !
We're observing the same behavior with version 0.27.0 and Docker 17.06.1.
Metrics always contain cAdvisor, alertmanager and Prometheus, but every couple of minutes, our applications' containers metrics are missing.
Could you please update if (and when) a fix would be available?
@mfournier workaround URL is broken.
Thanks.
After several discussions I had with various people, I came to the conclusion we want to support "label filling" within the Prometheus Go client. You can track progress here: https://github.com/prometheus/client_golang/issues/355
I've looked into this, and there looks to be a simpler solution.
I believe that using the approach at https://github.com/kubernetes/kubernetes/pull/51473 in Cadvisor would be sufficient to resolve the issue here. That is in DefaultContainerLabels produce an empty string for the missing labels.
Is there something I'm missing?
Ah, I see. It's the container.Spec.Labels and container.Spec.Envs which need extra handling.
I've put together https://github.com/google/cadvisor/pull/1831 which I believe will fix this.
The fix is released in version v0.28.3
Thank you all ♥️
Most helpful comment
The fix is released in version v0.28.3