Hi, thanks! nice you like netdata!

So, you want prometheus to pull data from netdata? netdata has an API. Check: https://github.com/firehol/netdata/wiki/REST-API-v1

But, I also see that data can be pushed to prometheus: https://github.com/prometheus/pushgateway
This one seems very simple to be supported in netdata. Check: https://github.com/firehol/netdata/wiki/netdata-backends#adding-more-backends

ktsaou on 2 Jan 2017

Is this an embedded sockets server in C ?

yeap!

ktsaou on 2 Jan 2017

Prometheus developer here. If you want to integrate the best way would be to expose our text exposition format over HTTP. The spec is at https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details. Collectd and telegraf do this for example.

brian-brazil on 3 Jan 2017

👍2

@brian-brazil thanks for joining this discussion.

So, for prometheus, is it better to send:

prefix.host.chart{metric="dimension"} value timestamp

or

prefix.host.chart.dimension value timestamp

where:

prefix is user controlled (by default netdata).
host is the hostname of the machine (also user defined)
chart is the chart id (netdata also maintains names for charts)
dimension is the dimension id of the chart (netdata also maintains names for dimensions)
value is long double number
timestamp can be ms

all the above are alphanumeric, possibly with ., - and _.

Also, does prometheus respond something back?

ktsaou on 3 Jan 2017

@brian-brazil another question:

I see an HTTP header is needed. So, netdata will send an HTTP header on connect, and then stream metrics forever. The HTTP header has to be present only on socket connection. Is that right?

Which URL and HTTP method should be used?

ktsaou on 3 Jan 2017

I pushed some code to implement this. However, there is no HTTP header yet, so it won't work. Once we add the HTTP header we can test it.

ktsaou on 3 Jan 2017

@ktsaou

Are you leaning towards implementing a endpoint on the netdata API which will export these metrics?

I see exporters written as middleware for Prometheus but I believe integrating it into the native API is nicer to maintain its distributed nature

ldelossa on 3 Jan 2017

So, for prometheus, is it better to send:

prefix.chartname{dimensionname="dimensionvalue"} value

Prefix should be hardcoded. As you have a per-host daemon, Prometheus would optimally be hitting each of these so it'll be applying the instance label on its end.

Generally avoid timestamps, the way it's meant to work is that you'll fetch data when the request comes in.

Also, does prometheus respond something back?
I see an HTTP header is needed. So, netdata will send an HTTP header on connect, and then stream metrics forever. The HTTP header has to be present only on socket connection. Is that right?

Prometheus is a pull based system. You'll receive a GET, usually to /metrics. Setting the appropriate content-type header is encouraged in your response.

brian-brazil on 3 Jan 2017

Oh... I see. You want it the other way around. Prometheus to pull the data from netdata.

you'll fetch data when the request comes in

It is a bit more complex than that: netdata has an internal round-robin database with 1k-5k metrics per server, all updated every second. With the rate we add metrics, netdata will reach 10k metrics per server in a few months.

If prometheus hits netdata servers once every 10 seconds, it will get back the value of the latest 1 second out of 10. Prometheus will miss 9 seconds of data.

netdata can reduce the data. So, if you hit each netdata once every 10 seconds, you can ask netdata to give you the average of the last 10 seconds. Actually, you can give to netdata the starting and the ending timestamps and the grouping method (average, min, max, sum, etc).

netdata has already an API for all that. The netdata dashboards use it (this is how zooming out netdata charts is so fast). It is plain JSON and is documented here: https://github.com/firehol/netdata/wiki/REST-API-v1

So, can prometheus adapt to the netdata API?

If this too complex for prometheus, I would suggest to do it the other way around: netdata to push data to prometheus. The backend module in netdata already knows how to handle this properly, without stressing each server and without losing data if the backend is down for a short period of time. It also provides alarms in case the backend server loses data.

ktsaou on 3 Jan 2017

netdata can reduce the data. So, if you hit each netdata once every 10 seconds, you can ask netdata to give you the average of the last 10 seconds. Actually, you can give to netdata the starting and the ending timestamps and the grouping method (average, min, max, sum, etc).

We'd prefer to get the raw data, and handle the processing on our end. Do you have the notion of counters and gauges?

If this too complex for prometheus, I would suggest to do it the other way around: netdata to push data to prometheus.

Prometheus can't be operated in that fashion.

brian-brazil on 3 Jan 2017

We'd prefer to get the raw data, and handle the processing on our end. Do you have the notion of counters and gauges?

Yes of course, but netdata interpolates everything to provide a fixed step time-series database, ready to be used in charts.

netdata maintains also the last collected value, as collected, and the exact time in microseconds it was collected. I guess this is what interests you. Unfortunately, there is no API to get these...

So, what you need, is an API call to get all the metrics, as collected.

Let me check how hard that can be...

ktsaou on 3 Jan 2017

Almost done.

The URL will be:

http://netdata.ip:19999/api/v1/allmetrics?format=prometheus with a GET.

The output header will be:

HTTP/1.1 200 OK
Connection: keep-alive
Server: NetData Embedded HTTP Server
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Type: text/plain; charset=utf-8
Date: Tue, 03 Jan 2017 18:35:58 GMT
Cache-Control: no-cache
Expires: Tue, 03 Jan 2017 18:35:59 GMT
Content-Encoding: gzip
Transfer-Encoding: chunked

keep-alive and gzip are controlled by the caller.

The text output will be something like:

# HELP net_packets.enp4s0.received packets/s
# TYPE net_packets.enp4s0.received counter
net_packets.enp4s0.received 9768385.0000000 1483468558004
# HELP net_packets.enp4s0.sent packets/s
# TYPE net_packets.enp4s0.sent counter
net_packets.enp4s0.sent 7795059.0000000 1483468558004
# HELP net_packets.enp4s0.multicast packets/s
# TYPE net_packets.enp4s0.multicast counter
net_packets.enp4s0.multicast 1224642.0000000 1483468558004

# HELP net.enp4s0.received kilobits/s
# TYPE net.enp4s0.received counter
net.enp4s0.received 80013394.5546875 1483468558004
# HELP net.enp4s0.sent kilobits/s
# TYPE net.enp4s0.sent counter
net.enp4s0.sent 9992442.7734375 1483468558004

I haven't used the format you suggested, because I couldn't figure out what to put on the 2 values requested on each line. If you want it in a different way, please let know.

The HELP line states the netdata units. For counter metrics that rate is invalid (since you get raw data), but the rest is ok (in the above example, the values given are kilobits since boot, not kilobits/s). If this is misleading I can remove the entire line.

Any comments?

ktsaou on 3 Jan 2017

Content-Type: text/plain; charset=utf-8

This should be text/plain; version=0.0.4

# HELP net_packets.enp4s0.received packets/s
# TYPE net_packets.enp4s0.received counter
net_packets.enp4s0.received 9768385.0000000 1483468558004

If that's a counter then it's not /s. Periods are not permitted in metric names, you should use underscores instead. The host should also be in a label called "instance".

HELP net.enp4s0.received kilobits/s

Can you normalise to bytes?

brian-brazil on 3 Jan 2017

ok, you didn't read my comment about the units. Never mind I removed the units.

here is the body:

# TYPE net_packets_enp4s0_received counter
net_packets_enp4s0_received{instance="costa_pc"} 10056899 1483476360007
# TYPE net_packets_enp4s0_sent counter
net_packets_enp4s0_sent{instance="costa_pc"} 8006892 1483476360007
# TYPE net_packets_enp4s0_multicast counter
net_packets_enp4s0_multicast{instance="costa_pc"} 1248042 1483476360007

# TYPE net_enp4s0_received counter
net_enp4s0_received{instance="costa_pc"} 10553733016 1483476360007
# TYPE net_enp4s0_sent counter
net_enp4s0_sent{instance="costa_pc"} 1304670083 1483476360007

# TYPE system_load_load1 gauge
system_load_load1{instance="costa_pc"} 2530 1483476359000
# TYPE system_load_load5 gauge
system_load_load5{instance="costa_pc"} 2400 1483476359000
# TYPE system_load_load15 gauge
system_load_load15{instance="costa_pc"} 2330 1483476359000

Now the data are raw (as collected, whatever collected). I removed the HELP line because netdata does not have any means to know what it is (it knows how to evaluate it to something else - but it does not have a test representation of what it is).

Here is the HTTP response header:

HTTP/1.1 200 OK
Connection: keep-alive
Server: NetData Embedded HTTP Server
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Type: text/plain; version=0.0.4
Date: Tue, 03 Jan 2017 20:38:26 GMT
Cache-Control: no-cache
Expires: Tue, 03 Jan 2017 20:38:27 GMT
Content-Encoding: gzip
Transfer-Encoding: chunked

Is it ok now?

ktsaou on 3 Jan 2017

wait... a dot is left in the dimension names...

ktsaou on 3 Jan 2017

ok, updated it.
Please review it.

ktsaou on 3 Jan 2017

That looks mostly okay. enp4s0 should be a label I think.

brian-brazil on 4 Jan 2017

enp4s0 should be a label I think.

hm... netdata maintains a context for each metric and a family. Normally context + family = chart id (i.e. context net.bandwidth + family enp4s0 is the final chart). But, I can't commit that there will be no overlaps.

Do you want to try it?

ktsaou on 4 Jan 2017

If you've a full metrics output I can look at we can see what the tradeoff is, as I suspect many of the families wouldn't be labels.

brian-brazil on 4 Jan 2017

ok.

this is the original: http://pastebin.com/ThJWU3uv
this is with families: http://pastebin.com/9geNGH8m

keep in mind that netdata maintains these for each chart:

type is the first level of the chart menu
family is the second level of the chart menu
context is the prototype of the chart (i.e. all mysql bandwidth charts, have context = mysql.net)
type.id uniquely identifies the chart - there are variations on how this is formed. In a few cases it is type_family.id, on others it is type_id.family (yes, I know it is stupid - this is why we need to normalize them #807 )
type.name is an friendly alias for the chart (in most of the cases this is chart.id too)
units is the units of the chart
title is the title of the chart

Then, for dimensions:

id that uniquely identifies the dimension within the chart
name an alias for id, that can also be used to identify the dimension (an example where this is used is the interrupts charts - the id is the number interrupt number, while name is composed by the devices using this interrupt).

ktsaou on 4 Jan 2017

I have not committed the new code. Please let me know which one you prefer to merge it.

ktsaou on 4 Jan 2017

I had to merge it.
Let me know and I'll update it to your preference.

ktsaou on 5 Jan 2017

That's a mix of things that should and shouldn't be labels. Unless you've got some additional metadata to help sort that out, not having them as labels is probably best.

https://www.youtube.com/watch?v=KXq5ibSj2qA covers some of the theory.

brian-brazil on 5 Jan 2017

So, you suggest to leave it only with instance?

ktsaou on 5 Jan 2017

Yes.

brian-brazil on 5 Jan 2017

ok. thanks!
This is already merged.
Can anyone test it please?

ktsaou on 5 Jan 2017

This is great. The amount of data exposed is exciting.

I can can definitely help test within the next few days.

On Thu, Jan 5, 2017 at 7:14 AM Costa Tsaousis notifications@github.com
wrote:

Closed #1497 https://github.com/firehol/netdata/issues/1497.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#event-912559532, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFYalorjUvCTFCCCK-FOVDprbjxS-J-2ks5rPN6UgaJpZM4LZJew
.

ldelossa on 5 Jan 2017

Because this is coming from a concentrator that's setting an instance label, make sure to set honor_labels: true

brian-brazil on 5 Jan 2017

@ldelossa I would appreciate a few instructions to be added at the backends wiki page of netdata. Currently, prometheus support does not appear in netdata docs.

ktsaou on 5 Jan 2017

I can definitely give a hand with that, and a quick 'How to' tutorial on
getting it all setup. Just give me a few days the end of this week is
proving to be a little hectic. I'm eager to try this out.

It looks like most your metrics (from the examples) are going to be gauges
and not incrementing counters ? If so that just means we have to graph
these queries a little differently then the current Node Exporter that
prometheus officially supports.

On Thu, Jan 5, 2017 at 8:26 AM, Costa Tsaousis notifications@github.com
wrote:

@ldelossa https://github.com/ldelossa I would appreciate a few
instructions to be added at the backends wiki page of netdata. Currently,
prometheus support does not appear in netdata docs.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#issuecomment-270643313,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYalodqBYz7g27tdafK4p7X8CcvTxRgks5rPO9qgaJpZM4LZJew
.

ldelossa on 5 Jan 2017

nice!

It looks like most your metrics (from the examples) are going to be gauges
and not incrementing counters

no, this is not the case. netdata exposes the metrics to prometheus the same way it collects them. absolute netdata metrics are gauge in prometheus, incremental netdata metrics are counter in prometheus.

ktsaou on 5 Jan 2017

Okay awesome.

On Thu, Jan 5, 2017 at 9:53 AM, Costa Tsaousis notifications@github.com
wrote:

nice!

It looks like most your metrics (from the examples) are going to be gauges
and not incrementing counters

no, this is not the case. netdata exposes the metrics to prometheus the
same way it collects them. absolute netdata metrics are gauge in
prometheus, incremental netdata metrics are counter in prometheus.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#issuecomment-270662158,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYalgawaltOViQovRGq_2ngYNPS-ph7ks5rPQPugaJpZM4LZJew
.

ldelossa on 5 Jan 2017

Finally had some time to start testing. I have a test box up and everything hooked up. Give me a day or so to graph some stuff and come up with a small tutorial.

ldelossa on 14 Jan 2017

https://github.com/firehol/netdata/wiki/Using-Netdata-with-Prometheus

ldelossa on 17 Jan 2017

👍1

You'll need to use params to pass the format parameter.

brian-brazil on 17 Jan 2017

Oh really ? It seems to be working like this but let me adjust. Not at he computer at this moment.

ldelossa on 17 Jan 2017

If it does there's a bug on the netdata http requesting parsing side, as it's getting /api/v1/allmetrics%3Fformat=prometheus as the path.

brian-brazil on 17 Jan 2017

@ldelossa thank you for the wiki page! You may need to link it to the backends page.

Regarding the URL parsing, could you please check netdata's access.log? I think the question mark that separates the query string from the path should be escaped. But let's check what netdata receives...

ktsaou on 17 Jan 2017

@brian-brazil @ktsaou

Sorry about the huge delay there, had some stuff to take care of. I'm actually returning to this now and it's working rather well for me. I'm using it with consul at this point - this is the prometheus.yml file that works for me.

# my global config
global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['0.0.0.0:9090']

  - job_name: netdata
    metrics_path: "/api/v1/allmetrics?format=prometheus"

    consul_sd_configs:
    - server: 'moosive-consul-01:8500'
      token: "**************"
      services: ['netdata']

    relabel_configs:
    - source_labels: [__meta_consul_node]
      regex: (.+)
      target_label: instance
      replacement: '${1}'

I do not see any issues as of right now. Brian, where is the param directive you're saying would cause an issue?

ldelossa on 12 Mar 2017

The metrics_path should be just /api/v1/allmetrics and the format passed as a param.

brian-brazil on 12 Mar 2017

so attempting this:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['0.0.0.0:9090']

  - job_name: netdata
    metrics_path: "/api/v1/allmetrics"
    params: "format=prometheus"

    consul_sd_configs:
    - server: 'moosive-consul-01:8500'
      token: 
      services: ['netdata']

    relabel_configs:
    - source_labels: [__meta_consul_node]
      regex: (.+)
      target_label: instance
      replacement: '${1}'

I get an unmarshall error.

Mar 12 15:50:38 moosive-stats-01 prometheus[1544]: time="2017-03-12T15:50:38Z" level=error msg="Error loading config: couldn't load configuration (-config.file=/opt/prometheus/prometheus.yml): yaml: unmarshal errors:\n  line 31: cannot unmarshal !!str `format=...` into url.Values" source="main.go:150"

ldelossa on 12 Mar 2017

Params is a map of lists, so:

params:
  format: [prometheus]

brian-brazil on 12 Mar 2017

ahh you're totally right, just went to your docs

# Optional HTTP URL parameters.
params:
  [ <string>: [<string>, ...] ]

Thanks for your patients. We are back in business.

If that's the preferred configuration I'll make sure to do this moving forward ( it works fine ). But is there a pending issue with netdata's implementation if it's working fine the way I originally posted?

ldelossa on 12 Mar 2017

If the other way works with netdata, then your HTTP request parser is buggy.

brian-brazil on 12 Mar 2017

Okay @ktsaou let me know if I can assist in confirming this or not. I have an env I can spin up and spin down pretty quickly to help confirm if there's an issue or not.

ldelossa on 12 Mar 2017

Just been trying out the /api/v1/allmetrics?format=prometheus endpoint with netdata 1.6.0-634-g0322e79e_rolling, and I have a couple of comments:

It looks like the # TYPE annotations have gone, is this intentional? (I saw the note about removing # HELP)
The "instance" label value contains the hostname with dots converted to underscore. As far as I understand, this isn't required - or even a good idea.

According to the docs, metric names and label names cannot include dots, but "Label values may contain any Unicode characters"

I see this is also mentioned in https://github.com/firehol/netdata/issues/2410#issuecomment-312487448

Otherwise this looks very useful indeed, thank you!

Aside: as you probably know, there's also a native protocol-buffers based exposition format which would be more bandwidth friendly and probably quicker to encode too. This may be worth implementing once this has settled down a bit.

candlerb on 7 Jul 2017

Aside: as you probably know, there's also a native protocol-buffers based exposition format which would be more bandwidth friendly and probably quicker to encode too. This may be worth implementing once this has settled down a bit.

It's about the same once there's compression, the wins from proto are pretty small and at present it's unclear if Prometheus 2.0 will support proto.

brian-brazil on 7 Jul 2017

It looks like the # TYPE annotations have gone, is this intentional? (I saw the note about removing # HELP)

Check the docs, they are still there, but not enabled by default (to save bandwidth, since prometheus does not use them anyway): https://github.com/firehol/netdata/wiki/Using-Netdata-with-Prometheus#streaming-data-from-upstream-hosts

The "instance" label value contains the hostname with dots converted to underscore. As far as I understand, this isn't required - or even a good idea.

hm... I see in the code we convert all non alphanumeric characters to _. This is done for hostnames, charts and dimensions. Keep in mind this is not unicode text, but plain old ASCII. I don't mind to do the conversion differently. But I don't know which one is the correct. So, please open a github issue with the specs and I'll be glad to adapt it.

ktsaou on 7 Jul 2017

2432 raised, which links to the relevant specs.

Check the docs, they are still there, but not enabled by default (to save bandwidth, since prometheus does not use them anyway)

Thanks for the pointer, but the docs don't appear to agree with reality. Firstly, it says they are suppressed for format=prometheus_all_hosts, but I'm using format=prometheus. Furthermore, earlier in the page it gives this example:

So for example let's say we would like to query the metrics for system.cpu.user. We would held to the prometheus metrics at the link and find 'system_cpu_user' and we find:

~~~

TYPE system_cpu_user counter

system_cpu_user{instance="netdata_test"} 198838 1484610710070
~~~

candlerb on 7 Jul 2017

I would be grateful if you could edit the page to correct these!

ktsaou on 7 Jul 2017

Since you are editing it, I am adding another option names=yes|no to push chart and dimension names instead of IDs. This is useful if you want to get the metrics with meaningful names (QoS dimensions will have names instead of ids, disks will have device mapper names instead of ids, etc).

ktsaou on 7 Jul 2017

Have made the first edit. I'd need to see the change to understand what names=yes would do. Right now I get disk_ops_dm_8_reads, would it be something else?

candlerb on 7 Jul 2017

Yes, it would be disk_ops_NAME_reads where NAME is the dm-8 name as shown in /dev/mapper.

ktsaou on 7 Jul 2017

You can install https://github.com/ktsaou/netdata to see it in action.

ktsaou on 7 Jul 2017

Thanks, wiki updated.

Question: is a netdata chart name always of the form type.id ? If so, it could be split automatically, to generate output like this:

netdata_disk_ops{id="dm_8", dimension="reads", instance="foo.example.com"}

This is I think closer to the spirit of prometheus, although I would ask the prometheus experts here to confirm.

Furthermore, this would also make the names=yes|no option redundant, because you can include both the name and the id as separate labels:

netdata_disk_ops{id="dm_8", name="myvol", dimension="reads", instance="foo.example.com"}

Note that although this generates a little more network traffic, it doesn't require any more storage in prometheus, because this is still a single time series - a time series being defined as a series of values which all have the same metric name and identical set of label name/value pairs.

candlerb on 8 Jul 2017

netdata_disk_ops{id="dm_8", dimension="reads", instance="foo.example.com"}

Reads belongs in the metric name here.

Furthermore, this would also make the names=yes|no option redundant, because you can include both the name and the id as separate labels:

One of id and name would be seen as redundant here. Extra labels mean more complexity for the users every time they use the expression.

brian-brazil on 8 Jul 2017

Also question for prometheus experts: does netdata even need to include an instance label at all? Aren't the job and instance labels added automatically?

candlerb on 8 Jul 2017

Reads belongs in the metric name here

netdata_disk_ops:reads{...} then?

One of id and name would be seen as redundant here. Extra labels mean more complexity for the users every time they use the expression

So generate either id="dm_8" or name="myvol" dependent on the setting of names=no|yes, is that what you'd suggest?

candlerb on 8 Jul 2017

netdata_disk_ops:reads{...} then?

Colon has a different meaning by convention,netdata_disk_ops_read is probably best here.

Aren't the job and instance labels added automatically?

If this is information about a single host, then these labels should not be included.

So generate either id="dm_8" or name="myvol" dependent on the setting of names=no|yes, is that what you'd suggest?

I'd try to avoid a setting, as it means users can't easily share rules, alerts and dashboards.

If you are 100% certain the name is unique in all cases use that, otherwise use the id.

brian-brazil on 8 Jul 2017

If you are 100% certain the name is unique in all cases use that, otherwise use the id.

The default is now exactly that. I added the option because the default is now different to what used to be. So people will probably freak out and need an option to get back the old behaviour. So names=no is backwards compatibility.

netdata properties

netdata has the following properties per chart:

id - unfortunately it serves 3 purposes: defines the chart application (e.g. mysql), the application instance (e.g. mysql_local or mysql_db2) and the chart type mysql_local.io, mysql_db2.io). However, there is another format: disk_ops.sda (it should be disk_sda.ops). I know it is stupid, but unfortunately this is how it is today. The main menu of the dashboard is controlled by this and there are heuristics in javascript to parse them (my shame).

name is a more friendly id.

context - this is the same with above with the application instance removed. So it is mysql.io or disk.ops. Alarm templates use this and it is correct in all cases (or alarm templates would have been impossible).

family is the submenu of the dashboard. Unfortunately, this is again used differently in several cases. For example disks and network interfaces have the disk or the network interface. But mysql uses it just to group multiple chart together and postgres uses both (groups charts, and provide different sections for different databases).

units is the units of the chart

Then, for dimensions:

id that uniquely identifies the dimension within the chart
name a human friendly id (also unique)

what we could send to prometheus

We could send:

CONTEXT{dimension="DIMENSION" chart="CHART", family="FAMILY", instance="HOST"}

or

CONTEXT:DIMENSION{chart="CHART", family="FAMILY", instance="HOST"}

where DIMENSION is either dimension id or name and CHART is either chart id or name.

I don't know if these would be better though.

host tags

We have added a [backend].host tags = option for opentsdb. These are propagated with metrics streaming from netdata to netdata. Many people suggest that these tags are a nice solution to several problems. These are currently ignored for prometheus. Do you want me the send them?

ktsaou on 8 Jul 2017

id/name sounds like a single label to me. Family and units would be part of the metric name. Chart is redundant with id/name.

We have added a [backend].host tags = option for opentsdb.

Presuming this is a per-host thing, the way to do it would be to send a single time series called "netdata_host_tags" or similar with the value 1, and all the tags as labels.

brian-brazil on 8 Jul 2017

id/name sounds like a single label to me. Family and units would be part of the metric name. Chart is redundant with id/name.

Can you give a example format, like above? I'll post back a few examples to see how they look in practice.

Presuming this is a per-host thing, the way to do it would be to send a single time series called "netdata_host_tags" or similar with the value 1, and all the tags as labels.

The response may have multiple hosts.

ktsaou on 8 Jul 2017

Can you give a example format, like above? I'll post back a few examples to see how they look in practice.

CONTEXT_FAMILY_UNITS{name="NAME",instance="INSTANCE"}

The response may have multiple hosts.

You can distinguish them with instance labels if that's the case.

brian-brazil on 8 Jul 2017

CONTEXT_FAMILY_UNITS{name="NAME",instance="INSTANCE"}

NAME is the chart name or the dimension name?
UNITS might be "events/s". Are you sure you want this?

ktsaou on 8 Jul 2017

NAME is the chart name or the dimension name?

name as you described above.

UNITS might be "events/s". Are you sure you want this?

If the units differ from what it actually being sent it may be best not to send them.

brian-brazil on 8 Jul 2017

name as you described above.

hm... I don't follow. There are 2 names: chart and dimension. Their combination is unique across each host.

If the units differ from what it actually being sent it may be best not to send them.

Yes, it would for incremental dimensions. So:

CONTEXT_FAMILY{chart="NAME", dimension="NAME", instance="INSTANCE"}

?

ktsaou on 8 Jul 2017

Ah, you never defined what CHART was. What does it contain?

brian-brazil on 8 Jul 2017

hm... I think I did:

where DIMENSION is either dimension id or name and CHART is either chart id or name.

ktsaou on 8 Jul 2017

I totally understand your confusion however. netdata is a bit controversial, since it tries to merge time-series terminology and visualisation methodologies...

ktsaou on 8 Jul 2017

That's the bit that confused me. What's the difference between a DIMENSION and a CHART?

brian-brazil on 8 Jul 2017

ok. I thought this was the problem.

netdata organises metrics in collections called charts. Each chart has the properties I gave above (id, name, context, family, units).

Then each chart contains metrics called dimensions. All the dimensions of a chart have the same units of measurement and should be contextually in the same category (ie. the metrics for disk bandwidth are read and write and they are both in the same chart).

Each metric however could have a different algorithm (counter, gauge, etc), even in the same chart. Since the internal time-series database of netdata has a fixed step (ie. per second), netdata uses the algorithms to interpolate the collected values and find the exact value for each slot (so it normalizes them).

In all other backends we support, the user can select if he wants normalized metrics, or raw metrics (as collected). In prometheus, we send only raw, to emulate other data collectors (e.g. collectd). But we could send values from the netdata database (i.e. always gauge, even if the source is counter). EDIT: in this case we could also send the units.

ktsaou on 8 Jul 2017

Added a little note at the end of my previous post.

ktsaou on 8 Jul 2017

It sounds like CHART would be part of the metric name, and DIMENSION is a label.

brian-brazil on 8 Jul 2017

ok,

Check this by example:

mysql database 1

CHART: mysql_db1.io
CONTEXT: mysql.io
FAMILY: bandwidth
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write

mysql database 2

CHART: mysql_db2.io
CONTEXT: mysql.io
FAMILY: bandwidth
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write

disk 1

CHART: disk_io.sda
CONTEXT: disk.io
FAMILY: sda
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write

disk 2

CHART: disk_io.sdb
CONTEXT: disk.io
FAMILY: sdb
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write

ktsaou on 8 Jul 2017

Here is looks like the CHART is a label, DIMENSION is part of the metric name, CONTEXT is metric name.

FAMILY is a mix, so metric name.

brian-brazil on 8 Jul 2017

ok, so:

CONTEXT_FAMILY_DIMENSION{chart="CHART", instance="HOST" , tag1="TAG1", ...}

I have to admit that sending FAMILY as metric name bothers me. But I can't find the reason yet. I'll try to provide an example out of the whole stream before merging it.

Should I send instance="HOST" only when the response includes multiple hosts? I am not sure about any side effects on this. If I don't send it, how prometheus will know which one it is (where it will get the host name?)

I will send netdata_host_tags{tag1="TAG1", ...} if there is only one host in the response, otherwise I will append them to the labels.

I will continue to send metrics as collected. This means UNITS cannot be added. Do you believe I should provide an option to support also normalized metrics (all gauges, with units)?

ktsaou on 8 Jul 2017

Should I send instance="HOST" only when the response includes multiple hosts?

Send it when the response can return multiple hosts. Sometimes sending a label and sometimes not is difficult to deal with.

If I don't send it, how prometheus will know which one it

Service discovery&relabelling on our end takes care of that. We know before ever talking to a target what it is.

Do you believe I should provide an option to support also normalized metrics (all gauges, with units)?

I think the current way is best.

brian-brazil on 8 Jul 2017

Send it when the response can return multiple hosts. Sometimes sending a label and sometimes not is difficult to deal with.

ok. This means that for the same host, at t1 you may or may not receive it and at t2 you may or may not receive it. Example: you query host A. At t1 host A has multiple netdata databases hosted (e.g. its own and host's B), so you get it. At t2 the streaming netdata B has stopped sending metrics (administratively stopped) and you will not receive it any more. At t3 you will receive it again (streaming netdata B was restarted).

If the above is problematic for you, it may be better to always send it.

ktsaou on 8 Jul 2017

There are single and multi host modes; so my first thought was

format=prometheus never send the label
format=prometheus_all_hosts always sends the label

But then I thought it would maybe be better as:

format=prometheus_all_hosts sends the label, except when label = this host (i.e. only send label for remotely-collected data)

In that case, switching between the two will not affect data being sent for the local host.

candlerb on 8 Jul 2017

perfect. So localhost will never send instance=.

ktsaou on 8 Jul 2017

I have to admit that sending FAMILY as metric name bothers me. But I can't find the reason yet

Perhaps because of the duplication?

CHART: disk_io.sda
CONTEXT: disk.io
FAMILY: sda

In this case, it seems the FAMILY is really behaving as a label which duplicates information from the CHART label. If it's just a logical grouping of graphs to display, but the graphs themselves already have unique CHART names, then there's no need to include the FAMILY at all.

But are there other cases where the FAMILY is needed to uniquely identify a data series?

candlerb on 8 Jul 2017

Everything is in PR #2436

Regarding the family, check this:

2x docker instances:

disk is the family. I am sure it is not needed there.

# HELP netdata chart "cgroup_graphite.throttle_serviced_ops", context "cgroup.throttle_serviced_ops", family "disk", dimension "read", value * 1 / 1 delta gives operations/s (counter)
# TYPE cgroup_throttle_serviced_ops_disk_read counter
cgroup_throttle_serviced_ops_disk_read{chart="cgroup_graphite.throttle_serviced_ops"} 8598 1499544850459

and

# HELP netdata chart "cgroup_condescending_panini.throttle_serviced_ops", context "cgroup.throttle_serviced_ops", family "disk", dimension "read", value * 1 / 1 delta gives operations/s (counter)
# TYPE cgroup_throttle_serviced_ops_disk_read counter
cgroup_throttle_serviced_ops_disk_read{chart="cgroup_condescending_panini.throttle_serviced_ops"} 0 1499544975615

2x disks

families are /home and fedora-swap. I am not sure if it provides anything useful. You know.

# HELP netdata chart "disk.fedora_home", context "disk.io", family "/home", dimension "reads", value * 512 / 1024 delta gives kilobytes/s (counter)
# TYPE disk_io__home_reads counter
disk_io__home_reads{chart="disk.fedora_home"} 1345066 1499544771455

# HELP netdata chart "disk.fedora_swap", context "disk.io", family "fedora-swap", dimension "reads", value * 512 / 1024 delta gives kilobytes/s (counter)
# TYPE disk_io_fedora_swap_reads counter
disk_io_fedora_swap_reads{chart="disk.fedora_swap"} 4584 1499544771455

2x network interfaces

# HELP netdata chart "net_packets.wlp2s0", context "net.packets", family "wlp2s0", dimension "received", value * 1 / 1 delta gives packets/s (counter)
# TYPE net_packets_wlp2s0_received counter
net_packets_wlp2s0_received{chart="net_packets.wlp2s0"} 372685 1499544975450

# HELP netdata chart "net_packets.vpn0", context "net.packets", family "vpn0", dimension "received", value * 1 / 1 delta gives packets/s (counter)
# TYPE net_packets_vpn0_received counter
net_packets_vpn0_received{chart="net_packets.vpn0"} 61666 1499544975450

So, do we need the family in the metric? I understand that all the information exists in the chart, so it seems redundant. But you know...

ktsaou on 8 Jul 2017

families are /home and fedora-swap. I am not sure if it provides anything useful.

Again I would defer to prometheus experts; but to me these are really the same metric disk_io_reads, just different instances of it on different devices. Ditto for network interfaces: these are instances of net_packets_received, just looking at different interfaces. Putting the interface in the metric name detracts from the uniformity of the metric.

Aside: I believe that the # HELP lines should give the metric name as the next token, in the same way as # TYPE.

So I am thinking:

# HELP net_packets_received netdata context "net.packets", dimension "received", value * 1 / 1 delta gives packets/s (counter)
# TYPE net_packets_received counter
net_packets_received{chart="net_packets.wlp2s0"} 372685 1499544975450

I don't really like the duplication of net_packets. Does the chart name always have the context as prefix? If so it would be nice to strip it.

net_packets_received{chart="wlp2s0"} 372685 1499544975450

candlerb on 8 Jul 2017

Just pushed a commit in PR #2436 to fix the HELP line, so that it starts with the metric name.

Unfortunately, chart ids/names are not uniform. I can only extract the application instance with heuristics (which unfortunately may not work for custom plugins).

ktsaou on 8 Jul 2017

Unfortunately, chart ids/names are not uniform. I can only extract the application instance with heuristics (which unfortunately may not work for custom plugins).

I'd be happy to wait for #807 and have a flag day when all the metrics change name.

I would also be happy with internal metrics getting the new labels in prometheus now, even if custom plugins didn't fit the heuristics, and custom plugins changing later.

candlerb on 9 Jul 2017

format=prometheus_all_hosts sends the label, except when label = this host (i.e. only send label for remotely-collected data)

That'd mean that in a multi-host setup, the host you scraped would end up with the Prometheus instance label rather than the netdata one. I can see that causing inconsistencies.

brian-brazil on 9 Jul 2017

That'd mean that in a multi-host setup, the host you scraped would end up with the Prometheus instance label rather than the netdata one. I can see that causing inconsistencies.

ok, I am fixing that now. When multiple hosts are sent, instance="HOSTNAME" will always be set.

ktsaou on 9 Jul 2017

fixed it.

instance="HOSTNAME" and HOST TAGS now follow the same rule:

if netdata is called with format=prometheus_all_hosts the response has them embedded on each metric.

if netdata is called with format=prometheus, instance="HOSTNAME" is not sent at all and HOST TAGS are expressed as netdata_host_tags{HOST_TAGS} at the top of the response.

ktsaou on 9 Jul 2017

if netdata is called with format=prometheus_all_hosts the response has them embedded on each metric.

HOST TAGS should not be on every metric in this case, only in netdata_host_tags.

This is so there's consistent behaviour across the two modes.

brian-brazil on 9 Jul 2017

HOST TAGS should not be on every metric in this case, only in netdata_host_tags.

This is so there's consistent behaviour across the two modes.

ok, but how should I send host tags of multiple hosts? Each host has its own host tags.

ktsaou on 9 Jul 2017

ok, but how should I send host tags of multiple hosts? Each host has its own host tags.

They'd have a netdata_host_tags each distinguished by instance label.

brian-brazil on 9 Jul 2017

They'd have a netdata_host_tags each distinguished by instance label

done.

ktsaou on 9 Jul 2017

@ktsaou Begging to review the netdata changes now. Will be creating a tutorial for using Netdata and Prometheus with the new formats. Can you please provide me with all the url parameters applicable to the /api/v1/allmetrics?format=prometheus endpoint and their meanings? Would like this to go into the tutorial as well.

ldelossa on 28 Jul 2017

@ldelossa nice!

I think I have documented everything at the wiki page: https://github.com/firehol/netdata/wiki/Using-Netdata-with-Prometheus#netdata-support-for-prometheus

The source code that is parsing the URL parameters is this: https://github.com/firehol/netdata/blob/2d86d96378ab3320d1c5e47fd1c1b1290795b63c/src/web_api_v1.c#L224-L262

ktsaou on 29 Jul 2017

@ldelossa if something is not clear enough, just ask. I'll be glad to help...

ktsaou on 29 Jul 2017

Great! Thanks a lot. Excited about more support. Will be compiling this tutorial soon.

ldelossa on 29 Jul 2017

@ktsaou https://docs.google.com/document/d/1PRj7ov2A47EVc2YDtwCE3bZVEc6kGVMlHh3i6dX9kgM/edit?usp=sharing Tutorial for netdata\prometheus\grafana. This is being reviewed by my company editors (vimeo) but here's the first draft.

ldelossa on 4 Aug 2017

👍1

nice!

It would be great if we can host this on the netdata wiki.
Would you like that?

ktsaou on 6 Aug 2017

No problem with me.

On Sun, Aug 6, 2017 at 3:22 PM Costa Tsaousis notifications@github.com
wrote:

nice!

It would be great if we can host this on the netdata wiki.
Would you like that?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#issuecomment-320526796,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYalvg_RevEU1Slgw2PagvDYGskxJ4Iks5sVhKIgaJpZM4LZJew
.

ldelossa on 6 Aug 2017

ok, would you like to turn it to a wiki page?
Then, we can link it to the wiki main menu.

ktsaou on 7 Aug 2017

@ktsaou https://github.com/firehol/netdata/wiki/Netdata,-Prometheus,-and-Grafana-Stack

ldelossa on 7 Aug 2017

Linked it:

screenshot from 2017-08-07 22-29-00

ktsaou on 7 Aug 2017

Thanks for sharing your work!

ktsaou on 7 Aug 2017

@brian-brazil sorry for bringing up an old issue. But if tags are only sent in netdata_host_tags, how are we when supposed to use these tags when doing queries in Prometheus?

Let's say we have 3 servers called db01 as instance name, as a tag we set the environment (prod, dev, staging), how are we able to query Prometheus then for only servers with environment prod?

I fail to find this in the Prometheus docs.

lucasRolff on 3 Apr 2020

i think target specific labels will do. (prom adds labels for all metrics series for a specific target/host).

I guess what you are trying to do is impossible.

ilyam8 on 3 Apr 2020

@ilyam8 that's a bummer - will need to find another way then :( Since e.g. servers could be split up by either environment or even a type (could be db, web, proxy etc) - UNLESS netdata has a way I can make each metric contain additional stuff like environment (Despite it's not what is "right" according to Prometheus devs).

I'm curious if someone else knows how to archive this, because I don't think I'm the only one who works with different types of environments, clusters, or types.

lucasRolff on 4 Apr 2020

so adding labels on rx doesnt work?

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config

ilyam8 on 4 Apr 2020

Netdata: Prometheus Support

Most helpful comment

All 107 comments

HELP net.enp4s0.received kilobits/s

2432 raised, which links to the relevant specs.

TYPE system_cpu_user counter

netdata properties

what we could send to prometheus

host tags

mysql database 1

mysql database 2

disk 1

disk 2

2x docker instances:

2x disks

2x network interfaces

Related issues