Nomad: Feature Request: Add specification-based metrics

Created on 10 May 2018 · 7Comments · Source: hashicorp/nomad

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

Debian Linux 8.7

Issue

We've been using the prometheus-exported job metrics, which have been really useful. However, it would be even more useful if we could get metrics based on the latest evaluated job specification. There may be other useful metrics, but the metrics that immediately jump to mind are:

per-task-group group_count (to validate running allocations == intended allocations)
per-task desired resource metrics (cpu, iops, memory, network mbits)
Memory in particular would be useful in order to determine tasks that are close to getting OOM-killed.

themclient themmetrics typenhancement

Source

gmichalec-pandora

👍7

Most helpful comment

Having the memory metric is great, but it would still be nice to get other metrics based on the intent of the submitted job spec. Here are the metrics we are 'backfilling' via a spec-polling process:

nomad_job_spec_task_cpu_allocation_mhz (cpu resources requested per task)
nomad_job_spec_task_network_allocation_mbits (network resources requested per task)
nomad_job_spec_task_group_count (allocation count requested per task group)
nomad_job_submit_time (the submitTime of the job)

submit_time is very useful for annotating deploys on dashboards.
having the group_count is also very useful for creating alerts based on running vs expected allocation counts

gmichalec-pandora on 2 Oct 2019

👍3

All 7 comments

I second this request and am quite surprised this wasn't already an included metric. Especially since Nomad has such strict hard memory caps.

In order for
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage
or
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss

to be actually useful, we need a way to determine the percentage of memory the allocation has used or has free, whether I have to calculate it against another metric that just has .memory.allocated or is a percentage like memory.total_percent doesn't matter but this is a really important metric to alleviate the pain of Nomad's strict memory caps.

I personally suggest the following to be consistent with the other allocs metrics. I think it should just show the defined allocated memory as specified in the latest successful deployment (which should be documented)
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.allocated

mlehner616 on 30 Jun 2018

Anyone?

Having the ability to compare current alloc resource consumption (especially memory) against the limit for those resources would be incredibly useful.

nvx on 29 Mar 2019

Sorry for the lack of response. I quick added an allocated metric in bytes in #5492. There's a Linux binary attached if anyone is will to test.

schmichael on 29 Mar 2019

That looks like it'll solve my use case!

nvx on 3 Apr 2019

Having the memory metric is great, but it would still be nice to get other metrics based on the intent of the submitted job spec. Here are the metrics we are 'backfilling' via a spec-polling process:

nomad_job_spec_task_cpu_allocation_mhz (cpu resources requested per task)
nomad_job_spec_task_network_allocation_mbits (network resources requested per task)
nomad_job_spec_task_group_count (allocation count requested per task group)
nomad_job_submit_time (the submitTime of the job)

submit_time is very useful for annotating deploys on dashboards.
having the group_count is also very useful for creating alerts based on running vs expected allocation counts

gmichalec-pandora on 2 Oct 2019

👍3

just to add a real-world use case for these, here's an example of an alerting query we have to notify when a service has less than 60% of its desired allocations reporting as health in consul:
sum(consul_health_service_status{job="consul-exporter", service_name="doppler"}) / max(nomad_job_spec_task_group_count{exported_job="doppler"}) < 0.6

gmichalec-pandora on 21 Oct 2019

Following suit with https://github.com/hashicorp/nomad/pull/5492, which exposed per-task memory allocated metrics, I have opened https://github.com/hashicorp/nomad/pull/6784 to expose per-task CPU allocated metrics.