Nomad: Prometheus metric data job label name conflict

Created on 21 Feb 2019  路  15Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.9.0-beta2

Operating system and Environment details

CentOS Linux release 7.6.1810 (Core)

Issue

Prometheus metric data job label name conflict.
The prometheus server has a default job label

Nomad Server logs (if appropriate)

nomad: 2019-02-17T03:09:15.387+0800 [INFO ] http.prometheus_handler: error gathering metrics: 35 error(s) occurred:
nomad: * collected metric nomad_nomad_job_summary_queued label: nomad: * collected metric nomad_nomad_job_summary_queued label:

stagneeds-investigation themmetrics typbug

Most helpful comment

Hi,
I think the problem is in the method iterateJobSummaryMetrics()
https://github.com/hashicorp/nomad/blob/master/nomad/leader.go#L648

Depending of task type, we inject different label but prometheus lib seems not compatible with this. All metrics with same name should have same labels.

As example service task labels are:

label: label: label: label:

And sync task labels are:

label: label: label: label:

For me, 2 solutions are possible,

  • add all labels for all metrics and add a new label "job_type" with value periodic, sync or
  • add a suffix/prefix for metrics name depending of type

What do you think ?

All 15 comments

@chenjpu 馃憢 - Is the default job label one that is a Prometheus default or is it one you've added in your configuration?

it's prometheus default

Interesting - I thought prometheus namespaced all of it's default labels?

Afaik you can use relabel_configs to rename collected metrics though? (my prometheus knowledge is pretty high level though) https://github.com/prometheus/prometheus/blob/c7d83b2b6a08048e1bfa046f9fd63125ae327e02/config/testdata/conf.good.yml#L56-L60

I have set honor_labels parameter ,but also display error log
has label dimensions inconsistent with previously collected metrics in the same metric family

I found that other projects had similar problems(https://github.com/prometheus/influxdb_exporter/issues/23).

Besides,the above error is not present on nomad 0.8.7 version :)

client_golang(0.9.0 / 2018-10-15) mentioned that inconsistent label dimensions are now allowed

I ran into this as well but just relabeled the job coming from nomad to job_name

I did a simple test, after I upgraded the client_go((0.9.0 / 2018-10-15)), the problem was solved.

We are running Nomad 0.9.1 and still see this issue. Nomad logs are flooded with a similar error.

I just upgraded to nomad 0.9.1 today from 0.8.4 and found that I am only getting this error in our environment where we are using periodic/batch jobs. In our other environments where we only have service type jobs, we do not encounter this error and resultant issues with prometheus metrics collection.
I did not get this error before the upgrade.
I am happy to provide logs or more information if it would be useful.

I'm also observing this issue when upgrading 0.8.3 -> 0.9.1. Some additional details:

  • This only appears to affect the nomad_nomad_job_summary_* metrics.
  • Temporarily setting the prometheus_metrics configuration to false does not resolve the issue.

An update about this issue -
If left running with prometheus_metrics = true, the cluster leader will eventually kill any running allocations on the cluster. Disabling prometheus_metrics and restarting all masters causes allocations to restart and jobs to recover.

Hi,
I think the problem is in the method iterateJobSummaryMetrics()
https://github.com/hashicorp/nomad/blob/master/nomad/leader.go#L648

Depending of task type, we inject different label but prometheus lib seems not compatible with this. All metrics with same name should have same labels.

As example service task labels are:

label: label: label: label:

And sync task labels are:

label: label: label: label:

For me, 2 solutions are possible,

  • add all labels for all metrics and add a new label "job_type" with value periodic, sync or
  • add a suffix/prefix for metrics name depending of type

What do you think ?

Up! Will we see this fix in the next release? :)

Hello,

The problem is still in place. When pushing data to prometheus pushgateway, the job label got rewritten.

Only renaming "job" to "job_name" helps us.

Hi @stremovsky. Sorry to hear that. You're on a version of Nomad that's 0.9.5 or later?

Was this page helpful?
0 / 5 - 0 ratings