Hey guys,
I recently started using prometheus and I enjoy the simplicity. I want to begin to understand what it would take to implement prometheus support within Netdata. I think this is a great idea because it allows the distributed fashion of netdata to exist along with having persistence at prometheus. Centralized graphing (not monitoring) can now happen with grafana. Netdata is a treasure trove of metrics already - making this a worth wild project.
Prometheus expects a rest end point to exist which publishes a metric, labels, and values. It will poll this endpoint at a desired time frame and ingest the metrics during that poll.
To get the ball rolling, how are you currently serving http in Netdata? Is this an embedded sockets server in C ?
Hi, thanks! nice you like netdata!
So, you want prometheus to pull data from netdata? netdata has an API. Check: https://github.com/firehol/netdata/wiki/REST-API-v1
But, I also see that data can be pushed to prometheus: https://github.com/prometheus/pushgateway
This one seems very simple to be supported in netdata. Check: https://github.com/firehol/netdata/wiki/netdata-backends#adding-more-backends
Is this an embedded sockets server in C ?
yeap!
Prometheus developer here. If you want to integrate the best way would be to expose our text exposition format over HTTP. The spec is at https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details. Collectd and telegraf do this for example.
@brian-brazil thanks for joining this discussion.
So, for prometheus, is it better to send:
prefix.host.chart{metric="dimension"} value timestamp
or
prefix.host.chart.dimension value timestamp
where:
prefix is user controlled (by default netdata).host is the hostname of the machine (also user defined)chart is the chart id (netdata also maintains names for charts)dimension is the dimension id of the chart (netdata also maintains names for dimensions)value is long double numbertimestamp can be msall the above are alphanumeric, possibly with ., - and _.
Also, does prometheus respond something back?
@brian-brazil another question:
I see an HTTP header is needed. So, netdata will send an HTTP header on connect, and then stream metrics forever. The HTTP header has to be present only on socket connection. Is that right?
Which URL and HTTP method should be used?
I pushed some code to implement this. However, there is no HTTP header yet, so it won't work. Once we add the HTTP header we can test it.
@ktsaou
Are you leaning towards implementing a endpoint on the netdata API which will export these metrics?
I see exporters written as middleware for Prometheus but I believe integrating it into the native API is nicer to maintain its distributed nature
So, for prometheus, is it better to send:
prefix.chartname{dimensionname="dimensionvalue"} value
Prefix should be hardcoded. As you have a per-host daemon, Prometheus would optimally be hitting each of these so it'll be applying the instance label on its end.
Generally avoid timestamps, the way it's meant to work is that you'll fetch data when the request comes in.
Also, does prometheus respond something back?
I see an HTTP header is needed. So, netdata will send an HTTP header on connect, and then stream metrics forever. The HTTP header has to be present only on socket connection. Is that right?
Prometheus is a pull based system. You'll receive a GET, usually to /metrics. Setting the appropriate content-type header is encouraged in your response.
Oh... I see. You want it the other way around. Prometheus to pull the data from netdata.
you'll fetch data when the request comes in
It is a bit more complex than that: netdata has an internal round-robin database with 1k-5k metrics per server, all updated every second. With the rate we add metrics, netdata will reach 10k metrics per server in a few months.
If prometheus hits netdata servers once every 10 seconds, it will get back the value of the latest 1 second out of 10. Prometheus will miss 9 seconds of data.
netdata can reduce the data. So, if you hit each netdata once every 10 seconds, you can ask netdata to give you the average of the last 10 seconds. Actually, you can give to netdata the starting and the ending timestamps and the grouping method (average, min, max, sum, etc).
netdata has already an API for all that. The netdata dashboards use it (this is how zooming out netdata charts is so fast). It is plain JSON and is documented here: https://github.com/firehol/netdata/wiki/REST-API-v1
So, can prometheus adapt to the netdata API?
If this too complex for prometheus, I would suggest to do it the other way around: netdata to push data to prometheus. The backend module in netdata already knows how to handle this properly, without stressing each server and without losing data if the backend is down for a short period of time. It also provides alarms in case the backend server loses data.
netdata can reduce the data. So, if you hit each netdata once every 10 seconds, you can ask netdata to give you the average of the last 10 seconds. Actually, you can give to netdata the starting and the ending timestamps and the grouping method (average, min, max, sum, etc).
We'd prefer to get the raw data, and handle the processing on our end. Do you have the notion of counters and gauges?
If this too complex for prometheus, I would suggest to do it the other way around: netdata to push data to prometheus.
Prometheus can't be operated in that fashion.
We'd prefer to get the raw data, and handle the processing on our end. Do you have the notion of counters and gauges?
Yes of course, but netdata interpolates everything to provide a fixed step time-series database, ready to be used in charts.
netdata maintains also the last collected value, as collected, and the exact time in microseconds it was collected. I guess this is what interests you. Unfortunately, there is no API to get these...
So, what you need, is an API call to get all the metrics, as collected.
Let me check how hard that can be...
Almost done.
The URL will be:
http://netdata.ip:19999/api/v1/allmetrics?format=prometheus with a GET.
The output header will be:
HTTP/1.1 200 OK
Connection: keep-alive
Server: NetData Embedded HTTP Server
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Type: text/plain; charset=utf-8
Date: Tue, 03 Jan 2017 18:35:58 GMT
Cache-Control: no-cache
Expires: Tue, 03 Jan 2017 18:35:59 GMT
Content-Encoding: gzip
Transfer-Encoding: chunked
keep-alive and gzip are controlled by the caller.
The text output will be something like:
# HELP net_packets.enp4s0.received packets/s
# TYPE net_packets.enp4s0.received counter
net_packets.enp4s0.received 9768385.0000000 1483468558004
# HELP net_packets.enp4s0.sent packets/s
# TYPE net_packets.enp4s0.sent counter
net_packets.enp4s0.sent 7795059.0000000 1483468558004
# HELP net_packets.enp4s0.multicast packets/s
# TYPE net_packets.enp4s0.multicast counter
net_packets.enp4s0.multicast 1224642.0000000 1483468558004
# HELP net.enp4s0.received kilobits/s
# TYPE net.enp4s0.received counter
net.enp4s0.received 80013394.5546875 1483468558004
# HELP net.enp4s0.sent kilobits/s
# TYPE net.enp4s0.sent counter
net.enp4s0.sent 9992442.7734375 1483468558004
I haven't used the format you suggested, because I couldn't figure out what to put on the 2 values requested on each line. If you want it in a different way, please let know.
The HELP line states the netdata units. For counter metrics that rate is invalid (since you get raw data), but the rest is ok (in the above example, the values given are kilobits since boot, not kilobits/s). If this is misleading I can remove the entire line.
Any comments?
Content-Type: text/plain; charset=utf-8
This should be text/plain; version=0.0.4
# HELP net_packets.enp4s0.received packets/s
# TYPE net_packets.enp4s0.received counter
net_packets.enp4s0.received 9768385.0000000 1483468558004
If that's a counter then it's not /s. Periods are not permitted in metric names, you should use underscores instead. The host should also be in a label called "instance".
HELP net.enp4s0.received kilobits/s
Can you normalise to bytes?
ok, you didn't read my comment about the units. Never mind I removed the units.
here is the body:
# TYPE net_packets_enp4s0_received counter
net_packets_enp4s0_received{instance="costa_pc"} 10056899 1483476360007
# TYPE net_packets_enp4s0_sent counter
net_packets_enp4s0_sent{instance="costa_pc"} 8006892 1483476360007
# TYPE net_packets_enp4s0_multicast counter
net_packets_enp4s0_multicast{instance="costa_pc"} 1248042 1483476360007
# TYPE net_enp4s0_received counter
net_enp4s0_received{instance="costa_pc"} 10553733016 1483476360007
# TYPE net_enp4s0_sent counter
net_enp4s0_sent{instance="costa_pc"} 1304670083 1483476360007
# TYPE system_load_load1 gauge
system_load_load1{instance="costa_pc"} 2530 1483476359000
# TYPE system_load_load5 gauge
system_load_load5{instance="costa_pc"} 2400 1483476359000
# TYPE system_load_load15 gauge
system_load_load15{instance="costa_pc"} 2330 1483476359000
Now the data are raw (as collected, whatever collected). I removed the HELP line because netdata does not have any means to know what it is (it knows how to evaluate it to something else - but it does not have a test representation of what it is).
Here is the HTTP response header:
HTTP/1.1 200 OK
Connection: keep-alive
Server: NetData Embedded HTTP Server
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Type: text/plain; version=0.0.4
Date: Tue, 03 Jan 2017 20:38:26 GMT
Cache-Control: no-cache
Expires: Tue, 03 Jan 2017 20:38:27 GMT
Content-Encoding: gzip
Transfer-Encoding: chunked
Is it ok now?
wait... a dot is left in the dimension names...
ok, updated it.
Please review it.
That looks mostly okay. enp4s0 should be a label I think.
enp4s0 should be a label I think.
hm... netdata maintains a context for each metric and a family. Normally context + family = chart id (i.e. context net.bandwidth + family enp4s0 is the final chart). But, I can't commit that there will be no overlaps.
Do you want to try it?
If you've a full metrics output I can look at we can see what the tradeoff is, as I suspect many of the families wouldn't be labels.
ok.
this is the original: http://pastebin.com/ThJWU3uv
this is with families: http://pastebin.com/9geNGH8m
keep in mind that netdata maintains these for each chart:
type is the first level of the chart menufamily is the second level of the chart menucontext is the prototype of the chart (i.e. all mysql bandwidth charts, have context = mysql.net)type.id uniquely identifies the chart - there are variations on how this is formed. In a few cases it is type_family.id, on others it is type_id.family (yes, I know it is stupid - this is why we need to normalize them #807 )type.name is an friendly alias for the chart (in most of the cases this is chart.id too)units is the units of the charttitle is the title of the chartThen, for dimensions:
id that uniquely identifies the dimension within the chartname an alias for id, that can also be used to identify the dimension (an example where this is used is the interrupts charts - the id is the number interrupt number, while name is composed by the devices using this interrupt).I have not committed the new code. Please let me know which one you prefer to merge it.
I had to merge it.
Let me know and I'll update it to your preference.
That's a mix of things that should and shouldn't be labels. Unless you've got some additional metadata to help sort that out, not having them as labels is probably best.
https://www.youtube.com/watch?v=KXq5ibSj2qA covers some of the theory.
So, you suggest to leave it only with instance?
Yes.
ok. thanks!
This is already merged.
Can anyone test it please?
This is great. The amount of data exposed is exciting.
I can can definitely help test within the next few days.
On Thu, Jan 5, 2017 at 7:14 AM Costa Tsaousis notifications@github.com
wrote:
Closed #1497 https://github.com/firehol/netdata/issues/1497.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#event-912559532, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFYalorjUvCTFCCCK-FOVDprbjxS-J-2ks5rPN6UgaJpZM4LZJew
.
Because this is coming from a concentrator that's setting an instance label, make sure to set honor_labels: true
@ldelossa I would appreciate a few instructions to be added at the backends wiki page of netdata. Currently, prometheus support does not appear in netdata docs.
I can definitely give a hand with that, and a quick 'How to' tutorial on
getting it all setup. Just give me a few days the end of this week is
proving to be a little hectic. I'm eager to try this out.
It looks like most your metrics (from the examples) are going to be gauges
and not incrementing counters ? If so that just means we have to graph
these queries a little differently then the current Node Exporter that
prometheus officially supports.
On Thu, Jan 5, 2017 at 8:26 AM, Costa Tsaousis notifications@github.com
wrote:
@ldelossa https://github.com/ldelossa I would appreciate a few
instructions to be added at the backends wiki page of netdata. Currently,
prometheus support does not appear in netdata docs.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#issuecomment-270643313,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYalodqBYz7g27tdafK4p7X8CcvTxRgks5rPO9qgaJpZM4LZJew
.
nice!
It looks like most your metrics (from the examples) are going to be gauges
and not incrementing counters
no, this is not the case. netdata exposes the metrics to prometheus the same way it collects them. absolute netdata metrics are gauge in prometheus, incremental netdata metrics are counter in prometheus.
Okay awesome.
On Thu, Jan 5, 2017 at 9:53 AM, Costa Tsaousis notifications@github.com
wrote:
nice!
It looks like most your metrics (from the examples) are going to be gauges
and not incrementing countersno, this is not the case. netdata exposes the metrics to prometheus the
same way it collects them. absolute netdata metrics are gauge in
prometheus, incremental netdata metrics are counter in prometheus.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#issuecomment-270662158,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYalgawaltOViQovRGq_2ngYNPS-ph7ks5rPQPugaJpZM4LZJew
.
Finally had some time to start testing. I have a test box up and everything hooked up. Give me a day or so to graph some stuff and come up with a small tutorial.
You'll need to use params to pass the format parameter.
Oh really ? It seems to be working like this but let me adjust. Not at he computer at this moment.
If it does there's a bug on the netdata http requesting parsing side, as it's getting /api/v1/allmetrics%3Fformat=prometheus as the path.
@ldelossa thank you for the wiki page! You may need to link it to the backends page.
Regarding the URL parsing, could you please check netdata's access.log? I think the question mark that separates the query string from the path should be escaped. But let's check what netdata receives...
@brian-brazil @ktsaou
Sorry about the huge delay there, had some stuff to take care of. I'm actually returning to this now and it's working rather well for me. I'm using it with consul at this point - this is the prometheus.yml file that works for me.
# my global config
global:
scrape_interval: 5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first.rules"
# - "second.rules"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['0.0.0.0:9090']
- job_name: netdata
metrics_path: "/api/v1/allmetrics?format=prometheus"
consul_sd_configs:
- server: 'moosive-consul-01:8500'
token: "**************"
services: ['netdata']
relabel_configs:
- source_labels: [__meta_consul_node]
regex: (.+)
target_label: instance
replacement: '${1}'
I do not see any issues as of right now. Brian, where is the param directive you're saying would cause an issue?
The metrics_path should be just /api/v1/allmetrics and the format passed as a param.
so attempting this:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['0.0.0.0:9090']
- job_name: netdata
metrics_path: "/api/v1/allmetrics"
params: "format=prometheus"
consul_sd_configs:
- server: 'moosive-consul-01:8500'
token:
services: ['netdata']
relabel_configs:
- source_labels: [__meta_consul_node]
regex: (.+)
target_label: instance
replacement: '${1}'
I get an unmarshall error.
Mar 12 15:50:38 moosive-stats-01 prometheus[1544]: time="2017-03-12T15:50:38Z" level=error msg="Error loading config: couldn't load configuration (-config.file=/opt/prometheus/prometheus.yml): yaml: unmarshal errors:\n line 31: cannot unmarshal !!str `format=...` into url.Values" source="main.go:150"
Params is a map of lists, so:
params:
format: [prometheus]
ahh you're totally right, just went to your docs
# Optional HTTP URL parameters.
params:
[ <string>: [<string>, ...] ]
Thanks for your patients. We are back in business.
If that's the preferred configuration I'll make sure to do this moving forward ( it works fine ). But is there a pending issue with netdata's implementation if it's working fine the way I originally posted?
If the other way works with netdata, then your HTTP request parser is buggy.
Okay @ktsaou let me know if I can assist in confirming this or not. I have an env I can spin up and spin down pretty quickly to help confirm if there's an issue or not.
Just been trying out the /api/v1/allmetrics?format=prometheus endpoint with netdata 1.6.0-634-g0322e79e_rolling, and I have a couple of comments:
It looks like the # TYPE annotations have gone, is this intentional? (I saw the note about removing # HELP)
The "instance" label value contains the hostname with dots converted to underscore. As far as I understand, this isn't required - or even a good idea.
According to the docs, metric names and label names cannot include dots, but "Label values may contain any Unicode characters"
I see this is also mentioned in https://github.com/firehol/netdata/issues/2410#issuecomment-312487448
Otherwise this looks very useful indeed, thank you!
Aside: as you probably know, there's also a native protocol-buffers based exposition format which would be more bandwidth friendly and probably quicker to encode too. This may be worth implementing once this has settled down a bit.
Aside: as you probably know, there's also a native protocol-buffers based exposition format which would be more bandwidth friendly and probably quicker to encode too. This may be worth implementing once this has settled down a bit.
It's about the same once there's compression, the wins from proto are pretty small and at present it's unclear if Prometheus 2.0 will support proto.
It looks like the # TYPE annotations have gone, is this intentional? (I saw the note about removing # HELP)
Check the docs, they are still there, but not enabled by default (to save bandwidth, since prometheus does not use them anyway): https://github.com/firehol/netdata/wiki/Using-Netdata-with-Prometheus#streaming-data-from-upstream-hosts
The "instance" label value contains the hostname with dots converted to underscore. As far as I understand, this isn't required - or even a good idea.
hm... I see in the code we convert all non alphanumeric characters to _. This is done for hostnames, charts and dimensions. Keep in mind this is not unicode text, but plain old ASCII. I don't mind to do the conversion differently. But I don't know which one is the correct. So, please open a github issue with the specs and I'll be glad to adapt it.
Check the docs, they are still there, but not enabled by default (to save bandwidth, since prometheus does not use them anyway)
Thanks for the pointer, but the docs don't appear to agree with reality. Firstly, it says they are suppressed for format=prometheus_all_hosts, but I'm using format=prometheus. Furthermore, earlier in the page it gives this example:
So for example let's say we would like to query the metrics for system.cpu.user. We would held to the prometheus metrics at the link and find 'system_cpu_user' and we find:
~~~
TYPE system_cpu_user counter
system_cpu_user{instance="netdata_test"} 198838 1484610710070
~~~
I would be grateful if you could edit the page to correct these!
Since you are editing it, I am adding another option names=yes|no to push chart and dimension names instead of IDs. This is useful if you want to get the metrics with meaningful names (QoS dimensions will have names instead of ids, disks will have device mapper names instead of ids, etc).
Have made the first edit. I'd need to see the change to understand what names=yes would do. Right now I get disk_ops_dm_8_reads, would it be something else?
Yes, it would be disk_ops_NAME_reads where NAME is the dm-8 name as shown in /dev/mapper.
You can install https://github.com/ktsaou/netdata to see it in action.
Thanks, wiki updated.
Question: is a netdata chart name always of the form type.id ? If so, it could be split automatically, to generate output like this:
netdata_disk_ops{id="dm_8", dimension="reads", instance="foo.example.com"}
This is I think closer to the spirit of prometheus, although I would ask the prometheus experts here to confirm.
Furthermore, this would also make the names=yes|no option redundant, because you can include both the name and the id as separate labels:
netdata_disk_ops{id="dm_8", name="myvol", dimension="reads", instance="foo.example.com"}
Note that although this generates a little more network traffic, it doesn't require any more storage in prometheus, because this is still a single time series - a time series being defined as a series of values which all have the same metric name and identical set of label name/value pairs.
netdata_disk_ops{id="dm_8", dimension="reads", instance="foo.example.com"}
Reads belongs in the metric name here.
Furthermore, this would also make the names=yes|no option redundant, because you can include both the name and the id as separate labels:
One of id and name would be seen as redundant here. Extra labels mean more complexity for the users every time they use the expression.
Also question for prometheus experts: does netdata even need to include an instance label at all? Aren't the job and instance labels added automatically?
Reads belongs in the metric name here
netdata_disk_ops:reads{...} then?
One of
idandnamewould be seen as redundant here. Extra labels mean more complexity for the users every time they use the expression
So generate either id="dm_8" or name="myvol" dependent on the setting of names=no|yes, is that what you'd suggest?
netdata_disk_ops:reads{...} then?
Colon has a different meaning by convention,netdata_disk_ops_read is probably best here.
Aren't the job and instance labels added automatically?
If this is information about a single host, then these labels should not be included.
So generate either id="dm_8" or name="myvol" dependent on the setting of names=no|yes, is that what you'd suggest?
I'd try to avoid a setting, as it means users can't easily share rules, alerts and dashboards.
If you are 100% certain the name is unique in all cases use that, otherwise use the id.
If you are 100% certain the name is unique in all cases use that, otherwise use the id.
The default is now exactly that. I added the option because the default is now different to what used to be. So people will probably freak out and need an option to get back the old behaviour. So names=no is backwards compatibility.
netdata has the following properties per chart:
id - unfortunately it serves 3 purposes: defines the chart application (e.g. mysql), the application instance (e.g. mysql_local or mysql_db2) and the chart type mysql_local.io, mysql_db2.io). However, there is another format: disk_ops.sda (it should be disk_sda.ops). I know it is stupid, but unfortunately this is how it is today. The main menu of the dashboard is controlled by this and there are heuristics in javascript to parse them (my shame).
name is a more friendly id.
context - this is the same with above with the application instance removed. So it is mysql.io or disk.ops. Alarm templates use this and it is correct in all cases (or alarm templates would have been impossible).
family is the submenu of the dashboard. Unfortunately, this is again used differently in several cases. For example disks and network interfaces have the disk or the network interface. But mysql uses it just to group multiple chart together and postgres uses both (groups charts, and provide different sections for different databases).
units is the units of the chart
Then, for dimensions:
id that uniquely identifies the dimension within the chart
name a human friendly id (also unique)
We could send:
CONTEXT{dimension="DIMENSION" chart="CHART", family="FAMILY", instance="HOST"}
or
CONTEXT:DIMENSION{chart="CHART", family="FAMILY", instance="HOST"}
where DIMENSION is either dimension id or name and CHART is either chart id or name.
I don't know if these would be better though.
We have added a [backend].host tags = option for opentsdb. These are propagated with metrics streaming from netdata to netdata. Many people suggest that these tags are a nice solution to several problems. These are currently ignored for prometheus. Do you want me the send them?
id/name sounds like a single label to me. Family and units would be part of the metric name. Chart is redundant with id/name.
We have added a [backend].host tags = option for opentsdb.
Presuming this is a per-host thing, the way to do it would be to send a single time series called "netdata_host_tags" or similar with the value 1, and all the tags as labels.
id/name sounds like a single label to me. Family and units would be part of the metric name. Chart is redundant with id/name.
Can you give a example format, like above? I'll post back a few examples to see how they look in practice.
Presuming this is a per-host thing, the way to do it would be to send a single time series called "netdata_host_tags" or similar with the value 1, and all the tags as labels.
The response may have multiple hosts.
Can you give a example format, like above? I'll post back a few examples to see how they look in practice.
CONTEXT_FAMILY_UNITS{name="NAME",instance="INSTANCE"}
The response may have multiple hosts.
You can distinguish them with instance labels if that's the case.
CONTEXT_FAMILY_UNITS{name="NAME",instance="INSTANCE"}
NAME is the chart name or the dimension name?
UNITS might be "events/s". Are you sure you want this?
NAME is the chart name or the dimension name?
name as you described above.
UNITS might be "events/s". Are you sure you want this?
If the units differ from what it actually being sent it may be best not to send them.
name as you described above.
hm... I don't follow. There are 2 names: chart and dimension. Their combination is unique across each host.
If the units differ from what it actually being sent it may be best not to send them.
Yes, it would for incremental dimensions. So:
CONTEXT_FAMILY{chart="NAME", dimension="NAME", instance="INSTANCE"}
?
Ah, you never defined what CHART was. What does it contain?
hm... I think I did:
where DIMENSION is either dimension id or name and CHART is either chart id or name.
I totally understand your confusion however. netdata is a bit controversial, since it tries to merge time-series terminology and visualisation methodologies...
That's the bit that confused me. What's the difference between a DIMENSION and a CHART?
ok. I thought this was the problem.
netdata organises metrics in collections called charts. Each chart has the properties I gave above (id, name, context, family, units).
Then each chart contains metrics called dimensions. All the dimensions of a chart have the same units of measurement and should be contextually in the same category (ie. the metrics for disk bandwidth are read and write and they are both in the same chart).
Each metric however could have a different algorithm (counter, gauge, etc), even in the same chart. Since the internal time-series database of netdata has a fixed step (ie. per second), netdata uses the algorithms to interpolate the collected values and find the exact value for each slot (so it normalizes them).
In all other backends we support, the user can select if he wants normalized metrics, or raw metrics (as collected). In prometheus, we send only raw, to emulate other data collectors (e.g. collectd). But we could send values from the netdata database (i.e. always gauge, even if the source is counter). EDIT: in this case we could also send the units.
Added a little note at the end of my previous post.
It sounds like CHART would be part of the metric name, and DIMENSION is a label.
ok,
Check this by example:
CHART: mysql_db1.io
CONTEXT: mysql.io
FAMILY: bandwidth
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write
CHART: mysql_db2.io
CONTEXT: mysql.io
FAMILY: bandwidth
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write
CHART: disk_io.sda
CONTEXT: disk.io
FAMILY: sda
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write
CHART: disk_io.sdb
CONTEXT: disk.io
FAMILY: sdb
UNITS: MB/s
DIMENSION1: read
DIMENSION2: write
Here is looks like the CHART is a label, DIMENSION is part of the metric name, CONTEXT is metric name.
FAMILY is a mix, so metric name.
ok, so:
CONTEXT_FAMILY_DIMENSION{chart="CHART", instance="HOST" , tag1="TAG1", ...}
I have to admit that sending FAMILY as metric name bothers me. But I can't find the reason yet. I'll try to provide an example out of the whole stream before merging it.
Should I send instance="HOST" only when the response includes multiple hosts? I am not sure about any side effects on this. If I don't send it, how prometheus will know which one it is (where it will get the host name?)
I will send netdata_host_tags{tag1="TAG1", ...} if there is only one host in the response, otherwise I will append them to the labels.
I will continue to send metrics as collected. This means UNITS cannot be added. Do you believe I should provide an option to support also normalized metrics (all gauges, with units)?
Should I send instance="HOST" only when the response includes multiple hosts?
Send it when the response can return multiple hosts. Sometimes sending a label and sometimes not is difficult to deal with.
If I don't send it, how prometheus will know which one it
Service discovery&relabelling on our end takes care of that. We know before ever talking to a target what it is.
Do you believe I should provide an option to support also normalized metrics (all gauges, with units)?
I think the current way is best.
Send it when the response can return multiple hosts. Sometimes sending a label and sometimes not is difficult to deal with.
ok. This means that for the same host, at t1 you may or may not receive it and at t2 you may or may not receive it. Example: you query host A. At t1 host A has multiple netdata databases hosted (e.g. its own and host's B), so you get it. At t2 the streaming netdata B has stopped sending metrics (administratively stopped) and you will not receive it any more. At t3 you will receive it again (streaming netdata B was restarted).
If the above is problematic for you, it may be better to always send it.
There are single and multi host modes; so my first thought was
format=prometheus never send the labelformat=prometheus_all_hosts always sends the labelBut then I thought it would maybe be better as:
format=prometheus_all_hosts sends the label, except when label = this host (i.e. only send label for remotely-collected data)In that case, switching between the two will not affect data being sent for the local host.
perfect. So localhost will never send instance=.
I have to admit that sending
FAMILYas metric name bothers me. But I can't find the reason yet
Perhaps because of the duplication?
CHART: disk_io.sda
CONTEXT: disk.io
FAMILY: sda
In this case, it seems the FAMILY is really behaving as a label which duplicates information from the CHART label. If it's just a logical grouping of graphs to display, but the graphs themselves already have unique CHART names, then there's no need to include the FAMILY at all.
But are there other cases where the FAMILY is needed to uniquely identify a data series?
Everything is in PR #2436
Regarding the family, check this:
disk is the family. I am sure it is not needed there.
# HELP netdata chart "cgroup_graphite.throttle_serviced_ops", context "cgroup.throttle_serviced_ops", family "disk", dimension "read", value * 1 / 1 delta gives operations/s (counter)
# TYPE cgroup_throttle_serviced_ops_disk_read counter
cgroup_throttle_serviced_ops_disk_read{chart="cgroup_graphite.throttle_serviced_ops"} 8598 1499544850459
and
# HELP netdata chart "cgroup_condescending_panini.throttle_serviced_ops", context "cgroup.throttle_serviced_ops", family "disk", dimension "read", value * 1 / 1 delta gives operations/s (counter)
# TYPE cgroup_throttle_serviced_ops_disk_read counter
cgroup_throttle_serviced_ops_disk_read{chart="cgroup_condescending_panini.throttle_serviced_ops"} 0 1499544975615
families are /home and fedora-swap. I am not sure if it provides anything useful. You know.
# HELP netdata chart "disk.fedora_home", context "disk.io", family "/home", dimension "reads", value * 512 / 1024 delta gives kilobytes/s (counter)
# TYPE disk_io__home_reads counter
disk_io__home_reads{chart="disk.fedora_home"} 1345066 1499544771455
# HELP netdata chart "disk.fedora_swap", context "disk.io", family "fedora-swap", dimension "reads", value * 512 / 1024 delta gives kilobytes/s (counter)
# TYPE disk_io_fedora_swap_reads counter
disk_io_fedora_swap_reads{chart="disk.fedora_swap"} 4584 1499544771455
# HELP netdata chart "net_packets.wlp2s0", context "net.packets", family "wlp2s0", dimension "received", value * 1 / 1 delta gives packets/s (counter)
# TYPE net_packets_wlp2s0_received counter
net_packets_wlp2s0_received{chart="net_packets.wlp2s0"} 372685 1499544975450
# HELP netdata chart "net_packets.vpn0", context "net.packets", family "vpn0", dimension "received", value * 1 / 1 delta gives packets/s (counter)
# TYPE net_packets_vpn0_received counter
net_packets_vpn0_received{chart="net_packets.vpn0"} 61666 1499544975450
So, do we need the family in the metric? I understand that all the information exists in the chart, so it seems redundant. But you know...
families are
/homeandfedora-swap. I am not sure if it provides anything useful.
Again I would defer to prometheus experts; but to me these are really the same metric disk_io_reads, just different instances of it on different devices. Ditto for network interfaces: these are instances of net_packets_received, just looking at different interfaces. Putting the interface in the metric name detracts from the uniformity of the metric.
Aside: I believe that the # HELP lines should give the metric name as the next token, in the same way as # TYPE.
So I am thinking:
# HELP net_packets_received netdata context "net.packets", dimension "received", value * 1 / 1 delta gives packets/s (counter)
# TYPE net_packets_received counter
net_packets_received{chart="net_packets.wlp2s0"} 372685 1499544975450
I don't really like the duplication of net_packets. Does the chart name always have the context as prefix? If so it would be nice to strip it.
net_packets_received{chart="wlp2s0"} 372685 1499544975450
Just pushed a commit in PR #2436 to fix the HELP line, so that it starts with the metric name.
Unfortunately, chart ids/names are not uniform. I can only extract the application instance with heuristics (which unfortunately may not work for custom plugins).
Unfortunately, chart ids/names are not uniform. I can only extract the application instance with heuristics (which unfortunately may not work for custom plugins).
I'd be happy to wait for #807 and have a flag day when all the metrics change name.
I would also be happy with internal metrics getting the new labels in prometheus now, even if custom plugins didn't fit the heuristics, and custom plugins changing later.
format=prometheus_all_hosts sends the label, except when label = this host (i.e. only send label for remotely-collected data)
That'd mean that in a multi-host setup, the host you scraped would end up with the Prometheus instance label rather than the netdata one. I can see that causing inconsistencies.
That'd mean that in a multi-host setup, the host you scraped would end up with the Prometheus instance label rather than the netdata one. I can see that causing inconsistencies.
ok, I am fixing that now. When multiple hosts are sent, instance="HOSTNAME" will always be set.
fixed it.
instance="HOSTNAME" and HOST TAGS now follow the same rule:
if netdata is called with format=prometheus_all_hosts the response has them embedded on each metric.
if netdata is called with format=prometheus, instance="HOSTNAME" is not sent at all and HOST TAGS are expressed as netdata_host_tags{HOST_TAGS} at the top of the response.
if netdata is called with format=prometheus_all_hosts the response has them embedded on each metric.
HOST TAGS should not be on every metric in this case, only in netdata_host_tags.
This is so there's consistent behaviour across the two modes.
HOST TAGS should not be on every metric in this case, only in netdata_host_tags.
This is so there's consistent behaviour across the two modes.
ok, but how should I send host tags of multiple hosts? Each host has its own host tags.
ok, but how should I send host tags of multiple hosts? Each host has its own host tags.
They'd have a netdata_host_tags each distinguished by instance label.
They'd have a netdata_host_tags each distinguished by instance label
done.
@ktsaou Begging to review the netdata changes now. Will be creating a tutorial for using Netdata and Prometheus with the new formats. Can you please provide me with all the url parameters applicable to the /api/v1/allmetrics?format=prometheus endpoint and their meanings? Would like this to go into the tutorial as well.
@ldelossa nice!
I think I have documented everything at the wiki page: https://github.com/firehol/netdata/wiki/Using-Netdata-with-Prometheus#netdata-support-for-prometheus
The source code that is parsing the URL parameters is this: https://github.com/firehol/netdata/blob/2d86d96378ab3320d1c5e47fd1c1b1290795b63c/src/web_api_v1.c#L224-L262
@ldelossa if something is not clear enough, just ask. I'll be glad to help...
Great! Thanks a lot. Excited about more support. Will be compiling this tutorial soon.
@ktsaou https://docs.google.com/document/d/1PRj7ov2A47EVc2YDtwCE3bZVEc6kGVMlHh3i6dX9kgM/edit?usp=sharing Tutorial for netdata\prometheus\grafana. This is being reviewed by my company editors (vimeo) but here's the first draft.
nice!
It would be great if we can host this on the netdata wiki.
Would you like that?
No problem with me.
On Sun, Aug 6, 2017 at 3:22 PM Costa Tsaousis notifications@github.com
wrote:
nice!
It would be great if we can host this on the netdata wiki.
Would you like that?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/firehol/netdata/issues/1497#issuecomment-320526796,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYalvg_RevEU1Slgw2PagvDYGskxJ4Iks5sVhKIgaJpZM4LZJew
.
ok, would you like to turn it to a wiki page?
Then, we can link it to the wiki main menu.
@ktsaou https://github.com/firehol/netdata/wiki/Netdata,-Prometheus,-and-Grafana-Stack
Linked it:

Thanks for sharing your work!
@brian-brazil sorry for bringing up an old issue. But if tags are only sent in netdata_host_tags, how are we when supposed to use these tags when doing queries in Prometheus?
Let's say we have 3 servers called db01 as instance name, as a tag we set the environment (prod, dev, staging), how are we able to query Prometheus then for only servers with environment prod?
I fail to find this in the Prometheus docs.
i think target specific labels will do. (prom adds labels for all metrics series for a specific target/host).
I guess what you are trying to do is impossible.
@ilyam8 that's a bummer - will need to find another way then :( Since e.g. servers could be split up by either environment or even a type (could be db, web, proxy etc) - UNLESS netdata has a way I can make each metric contain additional stuff like environment (Despite it's not what is "right" according to Prometheus devs).
I'm curious if someone else knows how to archive this, because I don't think I'm the only one who works with different types of environments, clusters, or types.
so adding labels on rx doesnt work?
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config
Most helpful comment
Prometheus developer here. If you want to integrate the best way would be to expose our text exposition format over HTTP. The spec is at https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details. Collectd and telegraf do this for example.