Influxdb: [feature request] MIN(time), MAX(time), FIRST(time), LAST(time), query only time field

Created on 23 Feb 2016 · 31Comments · Source: influxdata/influxdb

These all return null when I tried them, but it would be very useful to be able to query for these.

Also "at least 1 non-time field must be queried" seems like a strange limitation.

arequeries need more info

Source

ericremoreynolds

👍13

Most helpful comment

Please reconsider the close issue decision:
How can I get the min(time) and max(time) if I query a time range with grouping, e.g. GROUP BY time(1d)?
The query result will always use startOfDay as returned timestamp of the daily aggregated entries, thus I have no possibility to know the time of my first and last item within the aggregation.

DaRoli on 10 Aug 2017

👍22

All 31 comments

+1
I need to keep statistics to check data quality, min?time and max_time are crucial for that.

dx034 on 9 Mar 2016

@ericremoreynolds can you expand? Can you give an example of finding the maximum time? I think the maximum time is just last() of some point. I think #5926 might be what you need.

jsternberg on 15 Mar 2016

I guess you might want to do a query like

SELECT min(time) FROM whatever_measurement WHERE some_condition

without having to explicitly reference some non-time column in the select clause.

I don't know if that makes sense?

ericremoreynolds on 15 Mar 2016

Wouldn't the minimum time just be the first time for a field? In situations where you don't have a GROUP BY time(...), you can do this to find the minimum time of a field.

SELECT first(value) FROM cpu

jsternberg on 15 Mar 2016

Yeah sure but maybe in some situations you don't want to reference a
specific field but just query the timestamp of the first/last point
regardless of which fields it has values for.

Just a thought, i guess there are indeed approximate workarounds as you say.

ericremoreynolds on 15 Mar 2016

@pauldix what do you think about this? I've seen a few other issues where people don't really seem to care about which field they're selecting on. While it will be slower than a single column access, do we want to try and support this kind of query?

jsternberg on 15 Mar 2016

@jsternberg Thanks for considering this. But why should the query be slower? In worst case, a "SELECT MIN(dt) FROM cpu" would perform exactly the same internal work as "SELECT FIRST(value) FROM cpu", which also returns the timestamp. But as the value is not returned, it should theoretically even be (marginally) faster (less values to be transferred and thus less encoding/decoding of results). However, it would make code better readable as the query states what we want to select.

So without knowing a lot about the internal handling of queries, I would guess that converting MIN(dt),MAX(dt) to equivalent/existing FIRST/LAST queries should not result in additional overhead or performance loss.

dx034 on 15 Mar 2016

@plinyGo because InfluxDB is a columnar database. So to do this you'd be doing hitting every field and then taking the aggregates against those. Because we don't store these statistics as a whole, you'd end up doing range scans for every field under that measurement.

@jsternberg doing this as a query would be a horrible experience because of this. The only tractable way I can think of to do this would be to keep these summary statistics, but we don't have anything on the roadmap for that yet.

I think it's difficult enough to implement that we'd really need to think about what the need is here and how we could engineer this to be performant.

pauldix on 15 Mar 2016

I want to measure lag time itself, so having time, aggregates of time, and time arithmetic expressions in my SELECT clause all make sense. There is literally no other field I'm interested in. Completely different visualizations will concern themselves with the other fields. But my use case is visualizing the change in lag over time.

So how do I even get lag? Let's say I have several sensors as clients that report measured fields and their metadata tags once per minute. However, if there is memory, computational, or network latency contentions, my process puts backpressure on chiming in to Influx. The result for a sensor might look like reporting in at 1:02:03, 1:03:04, 1:06:11, 1:08:45, 1:18:12, 11:19:13, etc.

My console wants to visualize lag over time and alert when any sensor is behind by more than 15 minutes. A SingleStat in Grafana would be nice. Green when <10m, Orange when >=10m, and Red when >=15m. When my alarm goes off, a nice thing would be a graph to see if lag on that sensor is spiky vs. creeping up over time vs. how it fares vs. other sensors. This can help me differentiate between a troublesome sensor vs. network-to-Influx issues for all sensors.

What I don't want is the sensor itself querying Influx, doing the time logic, and then reporting those values into a different measurement to Influx. The lag detection has to be decoupled from the sensor itself. I can't detect a troublesome sensor if I'm depending on it to reliably report its own lag! I can have a dedicated client/process do this for all sensors on their behalf, but I'm having trouble justifying this when I feel Grafana or a Continuous Query should be able to do this.

What I need is: SELECT last(now() - time) FROM sensors WHERE sensor_id =~ /$sensor_id$/ AND time < now() - 4h

This lets me Template and Repeat on sensor_id.

I can't do SELECT last(some_field) FROM sensors ... because I want to visualize the value of the time difference (lag size/duration) across time, not the value of some_field. You can't cheat and declare victory just because there's a timestamp _somewhere_ in the JSON result. I need the _difference_ returned as a first-class citizen so tools can visualize on it.

toddiuszho on 28 Apr 2016

👍14

TaylanT on 15 Jul 2016

We really need this type of query. Unability to get the primary key - this is an absurd.

dondublon on 28 Nov 2016

👍3

joelmarty on 19 Dec 2016

I'm going to close this out. There is no time without a corresponding field. Under the covers Influx is a columnar database where a column is:

measurement,tagset,field

# for example
cpu,host=serverA,region=west,usage_user

That combination represents a series or a column. Within that range are value, time tuples. This isn't like a row oriented database where time is stored as a column value per row. Time is stored for each series (or column).

Thus it makes no sense to grab the time without a field. If you want to get the first or last time, which is equivalent to min and max respectively, do it by asking for a specific field like this:

-- first value and time returned
select usage_user from cpu limit 1
-- last value and time returned
select usage_user from cpu order by time desc limit 1

You can also add a where time > ... and time < ... clause to find the first and last times in a given range. So you can already get the summary statistics you're looking for.

pauldix on 19 Dec 2016

👎2

DaRoli on 10 Aug 2017

👍22

is this issue still on go? I need to know the period of time in a group by query

select min(time), max(time), sum(item_price) as price, sum(view) as view from statistics group by item_invoice

But it doesn't it sum price or return max and min of time!

name: statistics
tags: item_invoice=
time min max price view
---- --- --- ----- ----
0             7000     83

name: statistics
tags: item_invoice=067a1302-bac4-42cb-409f-c047b14417c3
time min max price view
---- --- --- ----- ----
0              2500    7

name: statistics
tags: item_invoice=cb9ed298-e355-4f6f-52ba-41b15f4afca7
time min max price view
---- --- --- ----- ----
0               6500   4

Edit: I had to add a timestamp to values to get min and max working !!! (not the best way to achieve a simple thing)

mdaliyan on 11 Apr 2018

👍6

I have a similar issue relating to time, when doing a query that is time bounded I don't want the beginning timeframe timestamp but to select which timestamp (first/last) for the series to be reported.

select last(value) from table where time >= now() - 3h group by "host"

will always give me a time as now-3h. It's not showing me the first or last timestamp but its repeating the now()-3h value for each host. When looking at the data programmatically this does nothing to help understand the true timeframe the data represents.

rickalm on 9 May 2018

👍2

@rickalm: the query you show works for me (influxdb 1.5.2, Ubuntu 16.04)

In the following example, I stopped collecting SNMP data from a particular host by IP address (10.12.255.1) and started collecting it by name instead (gw.example.com), so the "agent_host" tag changed.

> select last(ifHCInOctets) from interface where time >= now() - 3h group by agent_host
name: interface
tags: agent_host=10.12.255.1
time                 last
----                 ----
2018-05-26T14:35:40Z 41823387483

name: interface
tags: agent_host=gw.example.com
time                 last
----                 ----
2018-05-26T15:09:40Z 41827702059

name: interface
tags: agent_host=sw1.example.com
time                 last
----                 ----
2018-05-26T15:08:41Z 0

The switchover was made about 40 minutes ago. If I reduce the time range then that host drops out completely:

> select last(ifHCInOctets) from interface where time >= now() - 30m group by agent_host
name: interface
tags: agent_host=gw.example.com
time                 last
----                 ----
2018-05-26T15:09:40Z 41827702059

name: interface
tags: agent_host=sw1.example.com
time                 last
----                 ----
2018-05-26T15:08:41Z 20981708378

In your case, maybe there are additional time series which are messing up the query. I suggest you try group by * instead of group by "host"; this may show you additional time series that you weren't expecting. group by "host" will give you the last() entry across all time series which have the same "host" tag.

candlerb on 26 May 2018

cby016 on 11 Dec 2018

just use Clickhouse

Tonsofattraction on 2 Feb 2019

😄1

Having the same issue as well with aggregate queries. Without being able to get the min or max of the timestamps (or even just the first or last of them), I would need to do the aggregation manually outside of influx to preserve that data.

As it is now, aggregate queries completely mangle the original timestamps:

> select last(time), count(count) as count from count_progress_list where time >= 1564689300000000000 and time <= 1564689600000000000
name: count_progress_list
-------------------------
time                    last    count
1564689300000000000             1281

chuckination on 1 Aug 2019

I have a problem of similar matter.
I want to watch for first datapoint occurence in time range, to calculate e.g. when user switched the monitoring device on today.
Returned value may be a number of minutes between start of time range and timestamp of first catched datapoint.
Solution may involve ELAPSED(), but both seems to be impossible:

calling ELAPSED on series having only one point
combining two single-point series (merging FIRST() of the series and FIRST() of timerange filled by zeros)

I thing a lot of people have similar problems.

ccucumber on 8 Aug 2019

I would also like to be able to work with min and max of the timestamps of a query.

I use influx to log statistics from machine lines in manufacturing, such as machine cycle time per minute. The KPI's machine speed over time, total machine cycles per day and total run-time per day, and displaying the result on a Grafana dashboard.

Getting the total machine cycles is fairly easy but I cant figure out a good way to display the time span between the first and last registrations per day. Maybe working directly with Influx isn't the best approach. Pulling the data into a SQL database and then do the calculations from there.
But I would much rather be able to do proper statistical analysis directly on the measurements.

JC-85 on 13 Aug 2019

I also make statistics in manufacturing :-)
I think a solution that follows the spirit of InfluxQL the most is extending the ELAPSED() function,
allowing calculation of time elapsed not from datapoint to datapoint, but also from start of time range to datapoint and from datapoint to end of time range.

A parameter may be used - one is already used to define the time unit, why not make a second one? Maybe, it even may be the default behaviour of ELAPSED() when a single data point is passed to it.

ccucumber on 13 Aug 2019

thepycoder on 7 Oct 2019

ArtruneTranspara on 4 Nov 2019

Use case: I want to be able to select the first and last timestamps per day of my solar panel measurements. The panels obviously don't produce at night, and thus the converter shuts off and stops sending data. For checking availability of the data through a cron-job I want to know at what time the solar panel switches on for each day in the year (and then using this to do some sinus-fitting).

I find it interesting behavior that the time is truncated to the start of the group, although I can see there are arguments for this. Maybe it is possible to have two time-columns? One for the start of the group and one for the actual select data point (be it first(), last(), min(), max() or perhaps others)

For my use-case I found a bit of a work-around, only valid in case there is a data column which is only increasing within the time group (for example the kWh produced that day): use the bottom(field_key, 1) or top(field_key, 1) functions. Sharing for those interested!

amfasis on 29 Nov 2019

any update?...

let4be on 22 Jan 2020

I really think minimum stored time and maximum stored time (or first and last) is very much required for time series database without having to know the filed name- it is quite frustrating not to get this feature HIGHEST priority. This is the reason I switched to promotheus