telegraf 🚀 - Support netflow data

what about sflow?

haomingz on 12 Sep 2015

👍7

Good idea +1

JulienChampseix on 17 Sep 2015

Bear in mind that if src/dest address and src/dest port are to be stored as tags (which would be useful, to allow e.g. GROUP BY dest_port), the cardinality of this data will be very high. A colleague of mine recently managed to bring InfluxDB to its knees by inserting such data.

Even just the (worst case) src/dest port combinations alone add up to around 2.1 billion (65536 * 65537 / 2). Now throw in some random src/dest addresses, and you have a recipe for a _lot_ of series.

In reality, the selection of destination ports is going to be a much smaller set than the full 65536 range, and the source ports are most likely not going to be outside typical ephemeral port ranges. Let's say we have a range of ~30000 source ports, and range of ~1000 dest ports. This is still 30 million combinations.

Ports and IP addresses are not the only thing that NOC admins are likely to want to group by. Consider also AS numbers, protocol numbers, TCP flags.

dswarbrick on 1 Feb 2016

👍3

+1 ( cisco, juniper, huawei, mikrotik, etc)

jcmartins on 5 Aug 2016

👍2

Speaking as an employee of a large MSP, I share the concerns of @dswarbrick

What information do we want Telegraf to gather exactly? Are we looking to create a series for every flow which NetFlow advises of? That's going to generate a lot of short series.

Could someone present a mock-up of how a Line Protocol entry would look for the sort of data they want logged?

rossdotpink on 5 Aug 2016

There is some support for NetFlow v9 in golang via https://github.com/fln/nf9packet

rossdotpink on 5 Aug 2016

I would like analyze my traffic to know what I need priority/QoS
Ex:
more http or smtp?
more torrent or ftp?
And BGP peers and DDOS analyze
see https://github.com/pavel-odintsov/fastnetmon

jcmartins on 5 Aug 2016

+1

skares on 18 Aug 2016

https://github.com/wimtie/sflow-go

Here an sflow implentation in golang

aderumier on 30 Aug 2016

I have written a separate netflow/sflow collector in go which offers a bit more config per sender.

I'll refit it for telegraph, but there are some things to consider:

1) sFlow is not an issue: has static fields
2) NetFlow 9/IPFIX use templates:

we have lots of Junipers that don't fill in sampling intervals and thus we have to store the sampling rate per host in the collector and multiply ourselves. So you'd need to specify config overrides per host.

Gonna try to do this over the weekend.

fbettag on 16 Sep 2016

A couple thoughts here:

Because of the high cardinality of the data, it may not be prudent to dump sflow data directly into influxdb for the until that's addressed. A compromise would be to use something like nfdump for capture and then query that for summarized data.
There's an argument to be made that one should focus on ipfix rather than sflow: https://www.plixer.com/blog/sflow/sflow-vs-ipfix/

paulstuart on 16 Sep 2016

@paulstuart don't see this as a flame, nor do i want to open a discussion here about SFlow vs IPFIX. But here are some pointers out of my own experience of running my own ISP (AS34868) since 2005:

All major IXPs (Internet Exchange Points - like AMSIX were we're connected to, or DECIX) use SFlow. Why? Because SFlow incurs almost 0 overhead to CPUs/Hardware it samples on. Compare this to NetFlow which takes 30-50% of CPU of a Router just to sample 1 out of N packets, that's HUGE.

Also NetFlow suffers from Implementation fragmentation. As i wrote before, some implementations export their sampling rate, while others don't. This is especially nasty if you rely on auto-discovery where you don't know the sampling rate of the fresh IP that just started sending you their Flows. In SFlow there is a fixed set of fields (much like NetFlow 1,2,5,6), so there can't be a missing sampling interval.

Us for example, we use both. We use SFlow to monitor statistics from all switches as to incur fewest CPU Overhead possible, while running NetFlow on our Border-Routers. Why do we do that? Because Juniper Switches can do SFlow and NetFlow, while their big border routers (mx-series) only can do NetFlow..

Given all these, there are very good reasons why people might use SFlow over NetFlow or vice versa. All depending on your vendor. IMHO the project that unifies collection shouldn't force a decision on some people.

fbettag on 17 Sep 2016

I concur that netflow data fields (such as source IP, dest IP, source port, dest port) should not be stored as tags, due to the explosion of time series that will result and the possibility of DoS. I think the tags should be limited to things like the source device which generated the flow record. [^1]

This implies that all the useful fields of the netflow record need to be stored as a field set [^2]

Queries which group by address or port will effectively be full table scans for the time period being interrogated. This seems acceptable to me - it's no worse than nfdump anyway.

It needs to deal with IPv4 and IPv6 addresses. Seems like influxdb's integer type isn't up to handling 128-bit addresses:

~~~

insert netflow1,source=16909060 srcip=338288524927261089654018896841347694593,dstip=338288524927261089654018896841347694594,bytes=1000
select * from netflow1
name: netflow1
time bytes dstip source srcip
---- ----- ----- ------ -----
2017-05-07T12:47:57.939570956Z 1000 3.382885249272611e+38 16909060 3.382885249272611e+38
~~~

But using hex strings works fine, and indeed is much easier to view and manipulate by humans.

~~~

insert netflow2,source=c00002c8 srcip="fe800000000000000000000000000001",dstip="fe800000000000000000000000000002",bytes=1000
select * from netflow2
name: netflow2
time bytes dstip source srcip
---- ----- ----- ------ -----
2017-05-07T12:49:40.828415341Z 1000 fe800000000000000000000000000002 c00002c8 fe800000000000000000000000000001
~~~

IP range queries can be transformed as appropriate. e.g. "source within 192.0.2.128/28" becomes `source >= 'c0000280' and source <= 'c00002ff'", without needing an IP data type in influxdb. Unfortunately, greater-than/less-than operators don't appear to work on strings (tested with influxdb v1.2.3)

~~~

select count(*) from netflow2 where srcip='fe800000000000000000000000000001'
name: netflow2
time count_bytes count_dstip count_srcip
---- ----------- ----------- -----------
1970-01-01T00:00:00Z 1 1 1

select count(*) from netflow2 where srcip>'fe800000000000000000000000000000'

select count(*) from netflow2 where source!='ffffffff'
name: netflow2
time count_bytes count_dstip count_srcip
---- ----------- ----------- -----------
1970-01-01T00:00:00Z 1 1 1

select count(*) from netflow2 where source>'c0000000'

~~~

[^1] Possibly the IP protocol (TCP/UDP/ICMP/ESP/OSPF...) could be a tag, as that has a cardinality no more than 256, and usually only 4 or 5.

[^2] Or else, the whole netflow record is stored as some sort of structured blob like JSON. But that's not very useful since influxdb doesn't offer any filtering or continuous query aggregation on JSON fields

candlerb on 7 May 2017

@candlerb A couple of things:

You need to append an "i" suffix to numerical values if you want to insert them as integers, otherwise InfluxDB will store them as 64 bit floats, which are even less suitable for IPv6 than 64 bit integers ;-)
Have you read this yet? https://www.influxdata.com/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/

dswarbrick on 8 May 2017

You need to append an "i" suffix to numerical values if you want to insert them as integers,

Thanks for the correction. I had already been through the InfluxQL reference where it says

~
int_lit = ( "1" … "9" ) { digit } .
~

But I eventually found it in the line protocol reference

Have you read this yet?

Interesting, thanks again. The main benefit here seems to be with time series churn.

Netflow creates one record for each "flow" of traffic, roughly defined as packets sharing the same src ip, dst ip, src port, dst port and protocol. The flow record contains the total number of packets and number of bytes.

Now, I still don't believe that the right way to handle netflow data is to record (src ip, dst ip, src port, dst port) as tags.

Consider each separate DNS query or web request, even between the same client/server pair: it has a different src port. This would mean essentially one time series per flow! So in a 5 minute period with (say) 10,000 flows - common for a small network - you'll create 10,000 distinct time series, each containing one data point. That seems mad.

I do note that the article says: "We’ll have to continue to optimize the query language and engine to work with large sets of series." But one time series per data point seems a very extreme case of that.

You could store (say) only src port and dst port as tags, which would at least limit you to 2^32 distinct time series (still larger than your target of one billion). For a very high cost, you've optimised one small class of netflow queries - those by port.

But to be honest, most of the netflow queries I do are based on addresses. A typical query is: "summarise all flows between time T1 and T2 grouped by dest IP address and sorted by number of bytes descending". That identifies which hosts are the largest consumers of bandwidth.

Even with an inverted index on dest IP address (cardinality 2^32), unless that index contains the field being summed (i.e. bytes) then this will require a full table scan over the range T1 to T2 anyway. So the dest IP may as well be a field rather than a tag.

I don't actually see this as a problem. If you break your time into 5 minute windows, then a continuous query could calculate this at the end of each window - a feature which influxdb already has. Do this for the queries you most frequently run, and you'll have instant answers to them.

candlerb on 9 May 2017

👍2

Is this dead? I have a significant performance issue with Logstash and its Netflow codec plugin that thus far I have been unable to solve, and I was hoping a Go plugin in Telegraf would be faster. Also seeing as how we now have a native Elasticsearch output in Telegraf (along with a now-GA release of TSI in InfluxDB), makes questions of storage of this high cardinality data somewhat less pressing.

jasonkeller on 4 Aug 2017

It someone is interested in pushing this forward we first need to design a concrete data model (measurements, tags, fields) and a list of sample queries. TSI should help significantly especially if you are keeping only a limited history.

With respect to the different protocols: netflow, sflow, ipfix, etc. It is probably best to design the plugin around just one of these, we can always add a different plugin for the others.

danielnelson on 7 Aug 2017

There is already a very good, mature NetFlow / sFlow / IPFIX collector called pmacct (http://pmacct.net/), which can aggregate and output to a variety of targets including PostgreSQL / MySQL, MongoDB, Kafka, AMQP etc.

IMHO, it would make more sense to add an InfluxDB target to pmacct, than to add NetFlow / sFlow / IPFIX inputs to Telegraf. The thing that has been holding this back in the past was InfluxDB's lack of support for high cardinality data, not lack of a plugin in Telegraf.

dswarbrick on 7 Aug 2017

👍4

What might be a better approach than an InfluxDB target in pmacct (nfacctd) would be a Telegraf target plugin into pmacctd.

Hank45 on 9 Nov 2017

If there were an InfluxDB target in pmacct you could send to Telegraf using the http_listener input, or directly to InfluxDB. Still I think many would prefer if Telegraf could collect this data directly without the need for additional tools, so I would be happy to accept a pull request.

danielnelson on 9 Nov 2017

Why not just use the "print" functionality in pmacct to write to a FIFO pipe, from which a Python script or Go app would read, parse the JSON, then write the data to InfluxDB using the official client libs? I've previously seen this method work in production, albeit a couple of years before InfluxDB's TSI index type made it possible to handle such high cardinality data.

print_output_file: /path/to/fifo_pipe
print_output: json

It's a little bit of a detour, but much less so than requiring Telegraf, and probably sufficient until pmacct supports writing to InfluxDB directly.

I think the biggest problem remaining with storing this type of data in InfluxDB is that the DB can neither natively store IP addresses, nor really understand them. Most network admins are going to want to aggregate on CIDR netmask or IPv6 prefix length boundaries, and there is no obvious way how InfluxDB could do that. The aggregation would have to be done by pmacct prior to writing the data, which may not always be desirable.

dswarbrick on 9 Nov 2017

@jcmartins thanks for noticing FastNetMon! Yes, it work pretty well and could export data into Clickhouse like pmacct/nfsen

pavel-odintsov on 30 Nov 2017

I guess that's not happening huh? :(

dels78 on 2 Sep 2018

@dels78, If you have the means and time, we welcome pull requests. We do still need to "design a concrete data model" as stated in this comment.

glinton on 5 Sep 2018

There is a fairly mature project called Elastiflow that does a very good job of this type of telemetry data which uses the Elastic stack: https://github.com/robcowart/elastiflow

jeremycohoe on 27 Nov 2019

👍1

@danielnelson here is the tool that we use in our environment: https://github.com/VerizonDigital/vflow

littlespace on 23 Jan 2020

👀1 👍1

Thanks for the ElastiFlow shout out @jeremycohoe !

Just an FYI... I am nearing completion of an entirely new ElastiFlow collector, which happens to also be written in Go. So in the very near future there will be NO MORE LOGSTASH!!! I will also be expanding the solution to additional data stores beyond Elasticsearch. This will eventually include InfluxDB as well.

Stay tuned to the ElastiFlow repo for more information later next month.

robcowart on 29 May 2020

🎉3 👍2

@robcowart: Sounds great! How will the new ElastiFlow collector differ from the existing filebeat netflow module?

candlerb on 9 Jun 2020

@candlerb There will be a lot of differences...

Flow support. Filebeat does not support sFlow, and the new ElastiFlow collector will support both sFlow v5 and v2 (useful in the SDN space).
Filebeat only supports a little over 1200 different information elements (i.e. "fields"). The new ElastiFlow collector will support 6200+. These aren't just a bunch of useless IEs either. Filebeat only supports 66 Cisco IEs. The new ElastiFlow collector will support 1066 different Cisco IEs.
Then comes the question of what is actually meant by "support"? Filebeat provides almost zero translation or enrichment of the raw data. This leaves it up to the user looking at the data to know and remember that "waasoptimization_segment": 16 actually means... Pass-Through or that "tcp_control_bits": 19 means that the flags SYN, ACK, FIN were set, indicating a normally terminated connection. Currently 183 "translators" are already completed for the new ElastiFlow collector, and I am not even finished yet. This will provide users with a greatly more insightful experience.

I could go on and on. I prefer not to attack the work of others, but IMO the Filebeat Netflow input is not even as good the Logstash Netflow codec when it comes to the quality of the data produced. This leaves it way behind the current Logstash-based ElastiFlow. However this new collector will take it to yet another level.

Perhaps even more important, especially for this thread, will be the addition of additional outputs other than just Elasticsearch. Initially support will be Elasticsearch, Splunk and Kafka. To be followed by InfluxDB, TimescaleDB and ClickHouse. Grafana will be used to provide dashboards for this second bunch of data sources. I am trying hard to have something beta-ready by the end of the month, but it may be July instead.

robcowart on 9 Jun 2020

Thanks for the info. It was unclear to me in the Elastic Stack whether the filebeat netflow input was intended to replace the logstash netflow input. I saw that the logstash netflow module has been deprecated, which kind-of implies they expect people to migrate over to filebeat - but if it has less functionality than the logstash input then that doesn't make much sense.

I'm looking forward to seeing your new project in action. Logstash is definitely the hoariest part of the Elastic stack and I'll be pleased to see it gone :-)

candlerb on 9 Jun 2020

FYI we've since added sflow support. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/sflow

ssoroka on 6 Jul 2020

👍2

We haven't seen too much action around this since our release of SFlow. If anyone would like to build a plugin for NetFlow we would love to have it.

sjwang90 on 19 Sep 2020

If there is interest around a Netflow plugin, we would love someone to champion a Netflow plugin and try it on different devices. Every implementation we've seen is very different.

There are a couple starting points:

PR #4808
Copying the plugin code(_this is in not a tested, working plugin_) that was removed from #6427.
1. We would only have this as an external plugin since this plugin uses code generation. Telegraf does not do that elsewhere and it's not part of the build process.

sjwang90 on 16 Oct 2020

Telegraf: Support netflow data

Most helpful comment

All 33 comments

Related issues