Influxdb: Allow DISTINCT function to operate on tags

Created on 28 Aug 2015 · 80Comments · Source: influxdata/influxdb

I would like to have following feature:

Since the new release 0.9.3 tags are resulting as own columns if you use SELECT * FROM measurement

Currently, it's not possible to use commands on this columns. An example:

SELECT * FROM measurements

returns:

time tagA tagB value
xxx M N 0.3
xxx M O 0.4
xxx P R 0.2

I want to do a query like:

SELECT count(distinct(tagA)) FROM measurements

The result is

2 (M+P)

Anyone else need this feature?

1.x arequeries flutriaged in progress kinfeature-request

Source

TechniclabErdmann

👍76

Most helpful comment

@jsternberg I don't think the proposal from #7195 accurately addresses the issue here... The desired fix (highlighted in this issue) should be part of the SELECT syntax, not an entirely new query.

For instance, I have measurements that use device hardware addresses as the tag, and I want to get a count of all unique devices along with averages of a few fields. For instance, if I wanted to get the total count of connected wireless clients and how much bandwidth was used in the last day in 5 minute intervals:

    COUNT(DISTINCT(macAddress)), <-- This is a tag
    SUM(downloadBytes),
    SUM(uploadBytes)
FROM "wirelessClients"
WHERE time > NOW() - 1d
GROUP BY time(5m)

This is what we are hoping to achieve. The proposal in #7195 will not address this use case.

cnelissen on 10 Sep 2016

👍9

All 80 comments

Idea also exist in #1815 at "Not currently implemented (might in the future, but no promises)"

TechniclabErdmann on 28 Aug 2015

@The-Nik are you asking for DISTINCT to support tags, or are you asking for the same functionality that SELECT * used to do? To get the similar SELECT * behavior, just include a GROUP BY *

beckettsean on 28 Aug 2015

I ask for DISTINCT to support tags for counting the different tag values :+1:

TechniclabErdmann on 31 Aug 2015

@The-Nik you can use SHOW TAG VALUES plus some shell to get what you want:

$ influx -execute 'show tag values with key = a' -database mydb will print a list of all tag values associated with the key a on the database mydb. The output has two header lines, so if you pass it to wc -l just subtract 2 for the actual count:

$ influx -execute 'show tag values with key = a' -database mydb | wc -l and then subtract 2 from the output.

beckettsean on 1 Sep 2015

Yeah, this is a good way. But in my case, I need the number in Grafana in a single stat panel. In Grafana, there are some aggregate fuctions but no "count". So InfluxDB has to serve the exact value or I build something in my Grafana like a count-function ;-)

TechniclabErdmann on 3 Sep 2015

👍7 ❤1

+1; being able to quickly summarise the distinct number of datasets / tags directly from the influx SQL would be very handy; e.g. a grafana panel of the number of sensors I have reporting data over time.

yee379 on 30 Sep 2015

+1: I have a similar use case to yee379 in mind.

edennis-sge on 2 Oct 2015

+1 on being able to count distinct tags.

I also feel like this speaks to the deeper issue of providing guidance on what should be a tag vs. a value. For a schema-less DB there's sure a lot of subtlety around defining your schema! :)

jakefoster on 2 Oct 2015

👍2

I would also like to see this. We use the cpu and load plugins which themselves don't explicitly provide the cpu count. They do provide each cpu as an "instance" tag e.g. a box with 32 cpus will collect metrics on 32 individual cpus tagging them with with their instance number. If I could get the total count of cpus from the tags, then the load numbers would have a little more context in our graphs Grafana and Chronograph.

morganda on 7 Oct 2015

+1, any progress on it ?

JulienChampseix on 19 Nov 2015

+1. this would makes things quite a bit simpler for some tasks at hand.

RobertAtomic on 20 Nov 2015

rafael84 on 10 Dec 2015

+1 Really need this. Tag should also support a kind of normal SELECT search, which can be handled by Grafana.

ohmystack on 27 Jan 2016

+1
Desperately need this

thepolina on 1 Feb 2016

+ 1

Anybody as a solution to count my hosts in Grafana through Influxdb query language ?

Guibod on 6 Feb 2016

tomhallam on 11 Feb 2016

selzoc on 11 Feb 2016

mosoto on 11 Feb 2016

++++++1 This would really help pulling some of our metrics much much easier

davidgardner11 on 11 Feb 2016

brumfb on 11 Feb 2016

Perhaps a better way to accomplish the same goals: https://github.com/influxdata/influxdb/issues/5668

beckettsean on 13 Feb 2016

supershal on 18 Feb 2016

matt-snider on 3 Mar 2016

cmasekera on 12 Mar 2016

gvohra on 7 Apr 2016

cnelissen on 12 Apr 2016

jwestboston on 13 Apr 2016

gabtastic on 14 Apr 2016

skv2602 on 18 Apr 2016

drewdavies on 21 Apr 2016

+1
The use case where I need this feature : to monitor the number of sources (in my case wireless temperature sensors) that have submitted at least one value in -say- the last 30 minutes, i.e. that are alive.

remipannequin on 27 Apr 2016

👍4

+1. You can fake this feature by duplicating the tag value in a field, but that's a pretty bogus solution.

bricsuc on 12 May 2016

bertpig on 14 May 2016

jckwon on 6 Jul 2016

saboteurkid on 7 Jul 2016

gregorg on 7 Jul 2016

smeapng on 11 Jul 2016

+1 we really need it !

J-Mx on 12 Jul 2016

erowan on 19 Jul 2016

+1 , it would be really helpful.

sachinrase on 20 Jul 2016

bmundt on 20 Jul 2016

+1 @beckettsean any progress on this feature?

andyxning on 26 Jul 2016

xo4n on 26 Aug 2016

bruce-jin on 29 Aug 2016

cloudnull on 7 Sep 2016

+1
To clarify my support. We have a fixed but changing number of 'systems' identified by a TAG. We would like to count the number of active systems within time period and display that number in grafana.

donnut on 8 Sep 2016

👍4

We're going to implement this with #7195.

jsternberg on 9 Sep 2016

    COUNT(DISTINCT(macAddress)), <-- This is a tag
    SUM(downloadBytes),
    SUM(uploadBytes)
FROM "wirelessClients"
WHERE time > NOW() - 1d
GROUP BY time(5m)

This is what we are hoping to achieve. The proposal in #7195 will not address this use case.

cnelissen on 10 Sep 2016

👍9

I'm going to reopen this and think about it over the weekend. I'm still not sure we can do what you're requesting, but it sounds different enough from SHOW CARDINALITY that it deserves some discussion about the issue before closing.

jsternberg on 10 Sep 2016

This would be valuable. +1

yoyomikeyc on 30 Sep 2016

dineshvenkat on 4 Oct 2016

ahvetskovich on 5 Oct 2016

Tiinusen on 14 Oct 2016

jregovic on 17 Oct 2016

stevenh on 26 Oct 2016

aaskey on 4 Nov 2016

laggyone on 15 Nov 2016

gloyka on 24 Nov 2016

biker73 on 24 Nov 2016

leforsman on 29 Nov 2016

+1 Definitely a limiting factor on one of our key usecases.

strongpauly on 13 Dec 2016

lobando on 3 Jan 2017

bbala-github on 5 Jan 2017

orangle on 6 Jan 2017

damarnez on 20 Jan 2017

+1 (as described by @cnelissen)

SimSimY on 23 Jan 2017

lucadistefano on 26 Jan 2017

JamesClonk on 29 Jan 2017

Please leave +1 comments to adding a 👍 to the top post using a reaction. Leaving a message notifies everybody who is participating in this conversation and doesn't add anything to the discussion.

jsternberg on 29 Jan 2017

👍2

My need is to count number of unique tags with tag filters in Grafana. I can count fields but it gives incorrect answer. SHOW SERIES cannot be limited enough like return only one TAG which I could the distinct + count.

SELECT count("Incoming_Answers_2xxx") FROM "Realm-day" WHERE "INSTANCE" =~ /IPXDEA/ AND "ANSWERHOST" =~ /dtag/ AND "REALM" =~ /mcc2/ AND time > '2017-02-09T00:00:00Z';

name: Realm-day

time count
1486598400000000001 204

There I would like to have
SELECT count(distict(REALM)) FROM "Realm-day" WHERE "INSTANCE" =~ /IPXDEA/ AND "ANSWERHOST" =~ /dtag/ AND "REALM" =~ /mcc2/ AND time > '2017-02-09T00:00:00Z';

name: Realm-day

time count
1486598400000000001 34

Or someting like SHOW SERIES COUNT(DISTICT(TAG("REALM"))) FROM "Realm-day" WHERE "INSTANCE" =~ /IPXDEA/ AND "ANSWERHOST" =~ /dtag/ AND "REALM" =~ /mcc2/ AND time > '2017-02-09T00:00:00Z';

joriws on 10 Feb 2017

I managed to achieve this by using subqueries in influxdb 1.2

Eg. getting number of hosts from telegraf in grafana:

select count(tot) from (SELECT mean("used") as tot FROM "mem" WHERE $timeFilter GROUP BY "host" fill(null))

I'm using a measurement and a field I know it will always be present, it could be anything.

lpic10 on 10 Feb 2017

If you get no data for a host for the time period won't it be missed ? I don't think this can be 100% relied upon ?

On 10 Feb 2017, at 16:04, lpic notifications@github.com wrote:

I managed to achieve this by using subqueries in influxdb 1.2

Eg. getting number of hosts from telegraf in grafana:

select count(tot) from (SELECT mean("used") as tot FROM "mem" WHERE $timeFilter GROUP BY "host" fill(null))

I'm using a measurement and a field I know it will always be present, it could be anything.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

biker73 on 10 Feb 2017

Yes, but that's what I expected. If there is no data for a particular host during the selected query period I don't want to consider it. You can remove or maybe increase this time restriction in the WHERE clause, but then I guess the query can be quite slow.

lpic10 on 10 Feb 2017

Ok so it's slightly different use case, I think most want a distinct list of tag key values regardless of time period.

i.e. I'd want all time across a year of data for example potentially peta bytes of data where the series count might be 2m cardinality.

On 10 Feb 2017, at 16:25, lpic notifications@github.com wrote:

Yes, but that's what I expected. If there is no data for a particular host during the selected query period I don't want to consider it. You can remove or maybe increase this time restriction in the WHERE clause, but then I guess the query can be quite slow.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

biker73 on 10 Feb 2017

juddgaddie on 10 Mar 2017

In my case I needed to display the number of sensors which reported within a time interval (to indicate confidence of the mean). I managed to work around it with a subquery, but it's a bit filthy:

SELECT count("first") FROM (
  SELECT first("value") FROM "temperature"
  WHERE "topic" =~ /hub0[1234567]\/sensors\/\d+\/temperature/ AND $timeFilter
  GROUP BY time($interval), topic
)
WHERE $timeFilter
GROUP BY time($interval)

samjetski on 15 Mar 2017

lin-credible on 21 Mar 2017

ampersand8 on 27 Mar 2017

I'm locking this to prevent further 👍 messages. We will be discussing this to figure out the feasibility of the request and create a timeline. Please push the "Subscribe" button instead to get any updates about this feature.

jsternberg on 28 Mar 2017

WIP: there's some work completed to allow distinct / count against a tag key and tag value.

> select distinct(_tagKey) from httpd
name: httpd
time distinct
---- --------
0    bind
0    hostname

> select count(distinct(_tagKey)) from httpd
name: httpd
time count
---- -----
0    2

But there are still wrong answers being resolved:

> select _tagKey,_tagValue from tsm1_wal
name: tsm1_wal
time _tagKey         _tagValue
---- -------         ---------
0    database        _internal
0    engine          tsm1
0    hostname        nuc
0    id              1
0    path            /home/rbetts/.influxdb/data/_internal/monitor/1
0    retentionPolicy monitor
0    walPath         /home/rbetts/.influxdb/wal/_internal/monitor/1
> select _tagKey,_tagValue from tsm1_wal^C
> select _tagKey,_tagValue from tsm1_wal where _tagKey=engine
name: tsm1_wal
time _tagKey         _tagValue
---- -------         ---------
0    database        
0    engine          
0    hostname        
0    id              
0    path            
0    retentionPolicy 
0    walPath