Influxdb: Allow DISTINCT function to operate on tags

Created on 28 Aug 2015  ·  80Comments  ·  Source: influxdata/influxdb

I would like to have following feature:

Since the new release 0.9.3 tags are resulting as own columns if you use SELECT * FROM measurement

Currently, it's not possible to use commands on this columns. An example:

SELECT * FROM measurements

returns:

time tagA tagB value
xxx M N 0.3
xxx M O 0.4
xxx P R 0.2

I want to do a query like:

SELECT count(distinct(tagA)) FROM measurements

The result is

2 (M+P)

Anyone else need this feature?

1.x arequeries flutriaged in progress kinfeature-request

Most helpful comment

@jsternberg I don't think the proposal from #7195 accurately addresses the issue here... The desired fix (highlighted in this issue) should be part of the SELECT syntax, not an entirely new query.

For instance, I have measurements that use device hardware addresses as the tag, and I want to get a count of all unique devices along with averages of a few fields. For instance, if I wanted to get the total count of connected wireless clients and how much bandwidth was used in the last day in 5 minute intervals:

    COUNT(DISTINCT(macAddress)), <-- This is a tag
    SUM(downloadBytes),
    SUM(uploadBytes)
FROM "wirelessClients"
WHERE time > NOW() - 1d
GROUP BY time(5m)

This is what we are hoping to achieve. The proposal in #7195 will not address this use case.

All 80 comments

Idea also exist in #1815 at "Not currently implemented (might in the future, but no promises)"

@The-Nik are you asking for DISTINCT to support tags, or are you asking for the same functionality that SELECT * used to do? To get the similar SELECT * behavior, just include a GROUP BY *

I ask for DISTINCT to support tags for counting the different tag values :+1:

@The-Nik you can use SHOW TAG VALUES plus some shell to get what you want:

$ influx -execute 'show tag values with key = a' -database mydb will print a list of all tag values associated with the key a on the database mydb. The output has two header lines, so if you pass it to wc -l just subtract 2 for the actual count:

$ influx -execute 'show tag values with key = a' -database mydb | wc -l and then subtract 2 from the output.

Yeah, this is a good way. But in my case, I need the number in Grafana in a single stat panel. In Grafana, there are some aggregate fuctions but no "count". So InfluxDB has to serve the exact value or I build something in my Grafana like a count-function ;-)

+1; being able to quickly summarise the distinct number of datasets / tags directly from the influx SQL would be very handy; e.g. a grafana panel of the number of sensors I have reporting data over time.

+1: I have a similar use case to yee379 in mind.

+1 on being able to count distinct tags.

I also feel like this speaks to the deeper issue of providing guidance on what should be a tag vs. a value. For a schema-less DB there's sure a lot of subtlety around defining your schema! :)

I would also like to see this. We use the cpu and load plugins which themselves don't explicitly provide the cpu count. They do provide each cpu as an "instance" tag e.g. a box with 32 cpus will collect metrics on 32 individual cpus tagging them with with their instance number. If I could get the total count of cpus from the tags, then the load numbers would have a little more context in our graphs Grafana and Chronograph.

+1, any progress on it ?

+1. this would makes things quite a bit simpler for some tasks at hand.

+1

+1 Really need this. Tag should also support a kind of normal SELECT search, which can be handled by Grafana.

+1
Desperately need this

+ 1

Anybody as a solution to count my hosts in Grafana through Influxdb query language ?

+1

+1

+1

++++++1 This would really help pulling some of our metrics much much easier

+1

Perhaps a better way to accomplish the same goals: https://github.com/influxdata/influxdb/issues/5668

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1
The use case where I need this feature : to monitor the number of sources (in my case wireless temperature sensors) that have submitted at least one value in -say- the last 30 minutes, i.e. that are alive.

+1. You can fake this feature by duplicating the tag value in a field, but that's a pretty bogus solution.

+1

+1

+1

+1

+1

+1 we really need it !

+1

+1 , it would be really helpful.

+1

+1 @beckettsean any progress on this feature?

+1

+1

+1

+1
To clarify my support. We have a fixed but changing number of 'systems' identified by a TAG. We would like to count the number of active systems within time period and display that number in grafana.

We're going to implement this with #7195.

@jsternberg I don't think the proposal from #7195 accurately addresses the issue here... The desired fix (highlighted in this issue) should be part of the SELECT syntax, not an entirely new query.

For instance, I have measurements that use device hardware addresses as the tag, and I want to get a count of all unique devices along with averages of a few fields. For instance, if I wanted to get the total count of connected wireless clients and how much bandwidth was used in the last day in 5 minute intervals:

    COUNT(DISTINCT(macAddress)), <-- This is a tag
    SUM(downloadBytes),
    SUM(uploadBytes)
FROM "wirelessClients"
WHERE time > NOW() - 1d
GROUP BY time(5m)

This is what we are hoping to achieve. The proposal in #7195 will not address this use case.

I'm going to reopen this and think about it over the weekend. I'm still not sure we can do what you're requesting, but it sounds different enough from SHOW CARDINALITY that it deserves some discussion about the issue before closing.

This would be valuable. +1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1 Definitely a limiting factor on one of our key usecases.

+1

+1

+1

+1

+1 (as described by @cnelissen)

+1

+1

Please leave +1 comments to adding a 👍 to the top post using a reaction. Leaving a message notifies everybody who is participating in this conversation and doesn't add anything to the discussion.

My need is to count number of unique tags with tag filters in Grafana. I can count fields but it gives incorrect answer. SHOW SERIES cannot be limited enough like return only one TAG which I could the distinct + count.

SELECT count("Incoming_Answers_2xxx") FROM "Realm-day" WHERE "INSTANCE" =~ /IPXDEA/ AND "ANSWERHOST" =~ /dtag/ AND "REALM" =~ /mcc2/ AND time > '2017-02-09T00:00:00Z';

name: Realm-day

time count
1486598400000000001 204

There I would like to have
SELECT count(distict(REALM)) FROM "Realm-day" WHERE "INSTANCE" =~ /IPXDEA/ AND "ANSWERHOST" =~ /dtag/ AND "REALM" =~ /mcc2/ AND time > '2017-02-09T00:00:00Z';

name: Realm-day

time count
1486598400000000001 34

Or someting like SHOW SERIES COUNT(DISTICT(TAG("REALM"))) FROM "Realm-day" WHERE "INSTANCE" =~ /IPXDEA/ AND "ANSWERHOST" =~ /dtag/ AND "REALM" =~ /mcc2/ AND time > '2017-02-09T00:00:00Z';

I managed to achieve this by using subqueries in influxdb 1.2

Eg. getting number of hosts from telegraf in grafana:

select count(tot) from (SELECT mean("used") as tot FROM "mem" WHERE $timeFilter GROUP BY "host" fill(null))

I'm using a measurement and a field I know it will always be present, it could be anything.

If you get no data for a host for the time period won't it be missed ? I don't think this can be 100% relied upon ?

On 10 Feb 2017, at 16:04, lpic notifications@github.com wrote:

I managed to achieve this by using subqueries in influxdb 1.2

Eg. getting number of hosts from telegraf in grafana:

select count(tot) from (SELECT mean("used") as tot FROM "mem" WHERE $timeFilter GROUP BY "host" fill(null))

I'm using a measurement and a field I know it will always be present, it could be anything.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

Yes, but that's what I expected. If there is no data for a particular host during the selected query period I don't want to consider it. You can remove or maybe increase this time restriction in the WHERE clause, but then I guess the query can be quite slow.

Ok so it's slightly different use case, I think most want a distinct list of tag key values regardless of time period.

i.e. I'd want all time across a year of data for example potentially peta bytes of data where the series count might be 2m cardinality.

On 10 Feb 2017, at 16:25, lpic notifications@github.com wrote:

Yes, but that's what I expected. If there is no data for a particular host during the selected query period I don't want to consider it. You can remove or maybe increase this time restriction in the WHERE clause, but then I guess the query can be quite slow.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

+1

In my case I needed to display the number of sensors which reported within a time interval (to indicate confidence of the mean). I managed to work around it with a subquery, but it's a bit filthy:

SELECT count("first") FROM (
  SELECT first("value") FROM "temperature"
  WHERE "topic" =~ /hub0[1234567]\/sensors\/\d+\/temperature/ AND $timeFilter
  GROUP BY time($interval), topic
)
WHERE $timeFilter
GROUP BY time($interval)

+1

+1

I'm locking this to prevent further 👍 messages. We will be discussing this to figure out the feasibility of the request and create a timeline. Please push the "Subscribe" button instead to get any updates about this feature.

WIP: there's some work completed to allow distinct / count against a tag key and tag value.

> select distinct(_tagKey) from httpd
name: httpd
time distinct
---- --------
0    bind
0    hostname

> select count(distinct(_tagKey)) from httpd
name: httpd
time count
---- -----
0    2

But there are still wrong answers being resolved:

> select _tagKey,_tagValue from tsm1_wal
name: tsm1_wal
time _tagKey         _tagValue
---- -------         ---------
0    database        _internal
0    engine          tsm1
0    hostname        nuc
0    id              1
0    path            /home/rbetts/.influxdb/data/_internal/monitor/1
0    retentionPolicy monitor
0    walPath         /home/rbetts/.influxdb/wal/_internal/monitor/1
> select _tagKey,_tagValue from tsm1_wal^C
> select _tagKey,_tagValue from tsm1_wal where _tagKey=engine
name: tsm1_wal
time _tagKey         _tagValue
---- -------         ---------
0    database        
0    engine          
0    hostname        
0    id              
0    path            
0    retentionPolicy 
0    walPath    
Was this page helpful?
0 / 5 - 0 ratings