Influxdb: Support lag variables

Created on 18 Dec 2013  路  24Comments  路  Source: influxdata/influxdb

We should support bringing in lag variables so people can calculate changes and things like that. For example you have the following data:

time, value:
1, 6
2, 7
3, 4

And you run this query: select value, lag(value, 1) from some_series you would get:

time, value, value_lag_1
1, 6, null
2, 7, 6
3, 4, 7

So then you could do a query like: select value - lag(value, 1) as change from some_series:

time, change
1, 6
2, 1
3, -3

The argument to the lag function tells it how many lagging points you want to include. For example if you do select value, lag(value, 2) you'd get:

time, value, value_lag_1, value_lag_2
1, 6, null, null
2, 7, 6, null
3, 4, 7, 6
arefunctions kinfeature-request

Most helpful comment

This can be used as stop gap solution
https://github.com/maxsivanov/influxdb-timeshift-proxy

All 24 comments

One important thing to include is handling counter wrap-around, e.g. when you record packets sent on an interface, it may eventually wrap around from MAXINT to 0. It should also handle resets nicely, i.e. when a system reboots counters will reset.

@pmenglund can you elaborate. How is resetting counters and overflowing relevant to the lag function ? lag conceptually is a join of a series with a delayed version of itself.

it isn't relevant to the lag per se - however, the lag function would be used to calculate, e.g. bytes/s through a network interface, and the byte counter in the Linux kernel will eventually flip from MAXINT to 0, which will show up as a huge negative rate change.

If you alert on big changes in bytes/s through a network interface, this will generate a false positive.

I'm just suggesting to take this into account, when designing the lag function so it can be dealt with in influx.

If you have this series (using a unsigned byte):

1, 0
2, 100
3, 200
4, 45
5, 145

then the lag would be

1, 0, null
2, 100, 100
3, 200, 100
4, 45, -155
5, 145, 100

while the actual lag for 4 is 100.

@pauldix - delta >>> lag ?

Which makes it select value - delta(value, 5) from some_series group by time(2m)

time, delta_10m
1, null
3, null
5, null
7, null
9, null
11, 1
13, 3
etc..

Perhaps the column name can be generated by including the units and auto-calculating its size (10m above)?

@pmenglund - Would a new function CounterResets() be useful to cover the MAXINT-crossing case? CounterResets could return a count of the number of "zero"-crossings in a series of points.

If CounterResets returns nonzero, it can be used as a signal to suppress alerts.

This type of function should also be able to handle counters that return to zero for other reasons, like a reset/reboot of equipment, or a binary was restarted.

A function delta_absolute could compose the delta concept with CounterResets: only accumulate positive delta values between points, even for intervals that crossed zero. It would need to assume "value is non-decreasing, except across CounterResets".

@tamsky yes, that should work

Has lag been implemented already as difference ?
http://influxdb.com/docs/v0.7/api/aggregate_functions.html#difference

Just chiming in: I also see a lot of value in implementing lag variables. It's an important feature to have when trying to implement InfluxDB as a replacement for things such as rrdtool.

i presume this is for series where you have 1 record at each given time interval.
for event timeseries this probably only make sense after you do a group by time, effectively turning it into a the above.

one thing that often annoys me about graphite is that it's hard to know how long ago a previous record was, due to retention periods. different series have different settings and it also differs on which timeframe you're querying for. in influxdb shard spaces, this problem surface is smaller but still existent
solutions:
A) maybe lag(value, 1m) makes more sense. but it gets ugly when the interval in the stored series doesn't match up.
B) for fixed-interval series the interval could be known in the series metadata (there's some other uses for this too). and so the client can always know the interval and that point the datapoint notation works too. I'm in favor of this.

@tamsky: no, not really because it only supports the difference between the first and the last values per interval.

I'm also in dire need of this feature. But just to clarify (not yet accustomed to thinking in sets and SQL), take the OP's query but with addition instead of subtraction: select value + lag(value, 1), you'd then get (original values were 6, 7, 4):

time, change
1, 6
2, 13
3, 17

Correct? I'm currently doing customer payment analysis and would need this to find out the cumulated number of customers (new ones in the time period plus the ones from the previous period).

:+1:

Yes please! :-)

+1 for this. Would be excellent for being able to look at metrics compared to previous (day|week|etc).

Just curious, What's the difference between this and difference(value) function?

Also, will either the lag(value) function or the difference(value) function support extrapolating to regular intervals?

so if my data is

Time, Value
1,5
3,3

Is it possible to normalize the time values:
1,5
2,4.5
3,3
etc...

It would be great to have a way to calculate 'real delta' using lag or any other similar function that only calculate the increment between a sample and it predecessor

I don't like to use difference(value) because you are forced to aggregate using groups, and you miss information when aggregate

For instance, in the following time series, each minute it increases 10, so the series looks like

Time, Value,
0, 10,
1, 20,
2, 30,
3, 40,

When doing 'difference(value) and group(2m)' the result looks like

Time, Value,
0, 10, <--- This is the difference between samples at min 1 and 0
2, 10, <--- This is the difference between samples at min 2 and 3

Which seems ok, but you are missing the delta between min 2 and min 1, and in some scenarios this is not acceptable :(

Does anyone have any other idea about how to perform 'delta' between each 2 consecutive samples? or do we have any timeline when 'lag' function will be available?

Thanks in advance.

+1

@jwilder does the derivative function without a GROUP BY effectively give us the lag feature as described?

@beckettsean No. Lag looks different at first glance.

Will this lag function work on the Time column? I need to calculate the time difference between events so I can calculate the length of time a metric stays in a certain state.

At the moment my best solution is to write this time difference back to the previous record, BUT a lag function would be more elegant

Would love to see full support of windowing functions (lag, lead, rolling average/sum, etc) since they would all likely utilize similar internals. Full-fledged syntax would definitely be preferred (e.g. SELECT time, cluster, node, val, LAG(val, 1) OVER(PARTITION BY cluster, node ORDER BY time) FROM series) where partitioning is optional and the ordering defaults to time but allows other tags.

+1

As mentioned in my post to the mailing list we are experimenting with simplifying our open GitHub Issues. This feature request has been rolled into an aggregate issue for all function requests, so that we can close this issue until we are ready to work on it.

You may continue to make comments here. Closing the issue does not mean we are rejecting this idea.

This can be used as stop gap solution
https://github.com/maxsivanov/influxdb-timeshift-proxy

+1

Was this page helpful?
0 / 5 - 0 ratings