Nomad: [improvement] Add prefix_filter option for telemetry

Created on 13 Jan 2018  路  6Comments  路  Source: hashicorp/nomad

Summary

Consul has the "prefix_filter" option to allow filtering of the telemetry, which is great for controlling costs and noise.

Background

From the mailing list:
https://groups.google.com/forum/#!topic/nomad-tool/a_JWDUzwQJg

When we deployed Nomad 0.7, we noticed that our metrics provider bill jumped up significantly. This particular metrics provider bills on the number of unique metrics submitted.

One of the scenarios I found was that the job name of periodic jobs was used in the metric name. Since the job name changes for every invocation, this created something like 45000 unique metrics streams in a single reporting period vs our standard ~2000.

They look like this generally:
"nomad.nomad.job_summary.complete.jobname-periodic-1513948800.jobname"

There are similar metrics emitted for the non-periodic jobs, but those are less of an issue because the name of the job and thus the name of the metric does not change with every invocation.

Those look like this:
"nomad.nomad.job_summary.complete.api.api-nginx"

This job_summary metric is also not documented with the other list of telemetry.

I would be able to use a prefix_filter option to turn off all of the summary metrics that are causing the problems.

themmetrics typenhancement

Most helpful comment

Have You tried to use some relay in the middle?
Try Telegraf with a proper filter configuration

All 6 comments

Have You tried to use some relay in the middle?
Try Telegraf with a proper filter configuration

@aconte76 Telegraf looks great! Thanks for the suggestion.

FWIW, I still think there is some value in having this option as part of Nomad (and Consul and likely Vault) as all are capable of emitting large numbers of metrics streams that we cannot control otherwise.

Regarding suggestion for relays and filtering: it adds _a lot_ of complexity. It is definitely much more effective and simple to have the agents control what type of metrics emitted.

Related to #4186

@roman-cnd I agree. We are still working through a solution using relays and sending a boatload of money to our old metrics provider. I ran into an issue where telegraf doesn't have a statsd output, so even using it as a filter for statsd streams isn't possible yet. There is an open issue where another telegraf community member is planning to build the statsd output plugin, but it's not ready today.

@roman-cnd @stevenscg perhaps is true, i don't agree because brings the possibility o catch different metrics formats and filtering without too much complexity, but killer application is on different levels of aggregation (local/remote) and a consequence to have thousands servers using very little bandwidth.

For those looking to save some $$, this works okey:

nomad metrics -> statsd (from collectd) + ugly collectd rewrites to strip node/host info -> cloudwatch collect agent (with filters what to publish) does the job.

Was this page helpful?
0 / 5 - 0 ratings