Cadvisor: prometheus metric for container healthcheck status

Created on 6 Feb 2019  路  11Comments  路  Source: google/cadvisor

Hi,

As far as I know, no metrics are available for healthcheck status of a container.

I see a metric about the "up" state of a container (container_last_seen) but nothing about what can be checked over State.Health.Status with docker

This statistic isn't really a metric because it return a string but i would guess that a bolean for each possible value would be useful (running, healthy, unhealthy for the ones I know )

kinenhancement

Most helpful comment

This would be one very useful addition.

All 11 comments

Does an equivalent exist for all container runtimes cAdvisor supports (mesos, containerd, rkt, docker)?

We usually try and stay away from spec-based metrics, as they tend to be runtime-specific, and generate large numbers of metric streams for each container.

I'm quite unaware of all specifications that could exist at this time. I'm under the impression (and could be wrong) that the OCI had or would propose something standard for this.

So, I've no idea unfortunately

The need I have is to have a metric that is about the work produced by a container rather than a state (container_tasks_state) of a processus or the fact that a container might be up or not.

The healthcheck instruction and related statistics with docker helps to really figure out if a container actually does what it should and I don't really see metrics about that for now

This would be one very useful addition.

Does anyone find the workaround?

I am also looking to accomplish this.

The kubelet does have these kind of metrics: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/prober_manager.go#L38.
Those metrics are registered at /metrics/probes on the kubelet's port.

But that doesn't help anyone not using kubernetes...

I'm not sure if cAdvisor should take on metrics collection on probes, as it isn't performing them. I believe we currently only fetch the container from docker at container creation time, so this would require us to poll the runtime for the information. I'm not sure we can provide accurate cumulative probe metrics based on sampling the state. It seems like we are bound to miss probe failures.

Hi Team,

Any advice/update/workaround here is much helpful for everyone. We needed this "health_check" very badly.

Hi everybody! sum(time() - container_last_seen) by (name) is a workaround for me, but sometimes it works really bad.

Also, for alerts sum(rate(container_last_seen{name=~".+"}[5m])) by (container_label_com_docker_compose_service) < 1, with 15s scrapes helps me to stop crying all day.

It's hard to create alerts based on metrics that disappear and it also goes against prometheus best practices. I still don't understand why we can't just use absent and move on but you can read more about it here:

https://www.robustperception.io/existential-issues-with-metrics

Recently, a coworker discovered this exporter:

https://github.com/prometheus-net/docker_exporter

Which exposed a very valuable metric: docker_container_running_state, this metric won't disappear when the container stops!

Here's an example:

$ sudo docker run \
    --name docker_exporter \
    --detach \
    --restart always \
    --volume /var/run/docker.sock:/var/run/docker.sock \
    --publish 9417:9417 \
    prometheusnet/docker_exporter
$ sudo docker create --name foo -it ubuntu sleep 10
$ sudo docker start foo
$ curl -s localhost:9417/metrics | grep state
docker_container_running_state{name="foo"} 1
docker_container_running_state{name="docker_exporter"} 1
# wait ten seconds
$ curl -s localhost:9417/metrics | grep state
docker_container_running_state{name="foo"} 0
docker_container_running_state{name="docker_exporter"} 1

Healthchecking should be added to the above repo when https://github.com/prometheus-net/docker_exporter/pull/11 is merged.

Was this page helpful?
0 / 5 - 0 ratings