Influxdb: Calculating total increase of a counter that keeps being reset

Created on 27 Jul 2016  Â·  32Comments  Â·  Source: influxdata/influxdb

Feature Request

Proposal: A new aggregate function, say, SUM_INCREASE or TOTAL_INCREASE or … . Used for calculating correct total increase for counters that are reset over the time span of the calculation.

Current behavior: None.

Desired behavior: TOTAL_INCREASE("value") would sum differences between consecutive values in the series only if their difference is positive. Example:

Series:      101, 102, 103, 104,    0,  1,  2, 10, 15,  7,  8, 10
Differences:       +1,  +1,  +1, -104, +1, +1, +8, +5, -8, +1, +2
Pos. diffs.:        1,   1,   1,    0,  1,  1,  8,  5,  0,  1,  2

TOTAL_INCREASE: 21 // sum of only positive differences

Use case: Currently, to calculate total bytes sent/received/written/read, we’re using LAST("value") - FIRST("value"), and this works generally — as long as the series is monotous. The problem arises when a system counter gets reset back to 0.

1.x arefunctions arequeries flutriaged kinfeature-request wontfix

Most helpful comment

You likely want to use a subquery for this.

SELECT cumulative_sum(non_negative_difference) FROM (SELECT non_negative_difference(value) FROM cpu))

Something like that will probably get you what you want. It might be worth combining these into an increase() function just for convenience though.

All 32 comments

It’s exactly increase() in Prometheus → https://prometheus.io/docs/querying/functions/#increase.

Well there is NON_NEGATIVE_DERIVATIVE(). While not exactly what you're after, it may be more useful.
One of the issues with LAST("value") - FIRST("value") is that when grouping by time, you're comparing within the grouping, not between groupings. For example if you end up with the groups [[101,102,103],[104,0,1],[2,10,15]], you'll get [2,-103,13], when you might actually want [-102,14]. The DIFFERENCE() function does this instead, but it still has the negative issue. The other problem with LAST() - FIRST() is that it only works if your data points are exactly the same time apart. If you're missing data points, or the time is slightly off, the resulting values become skewed. This is where DERIVATIVE() comes in, as it calculates the difference in value by the difference in time. And then if DERIVATIVE() works for you, you can use NON_NEGATIVE_DERIVATIVE(), which solves the negative issue you describe.
However, NON_NEGATIVE_DERIVATIVE doesn't show 0 for the periods where it would be negative. It instead drops the data point. I would also argue that this is the correct action, as logging a 0 would indicate you had 0 change, when that is not actually true. In these cases, you had change, but it can't be calculated because the counter reset, so the point is dropped.

For example:
The grouping issue with last() - first():

> select last("value")-first("value") from test where time > now() - 1h group by time(30s)
1469709000000000000 2
1469709030000000000 -103
1469709060000000000 13
1469709090000000000 3

The issue solved by difference

> select difference(last("value")) from test where time > now() - 1h group by time(10s)
name: test
----------
time            difference
1469709010000000000 1
1469709020000000000 1
1469709030000000000 1
1469709040000000000 -104
1469709050000000000 1
1469709060000000000 1
1469709070000000000 8
1469709080000000000 5
1469709090000000000 -8
1469709100000000000 1
1469709110000000000 2

Using derivative (shows the same thing as difference since our metrics are exactly 10s apart)::

> select derivative(last("value")) from test where time > now() - 1h group by time(10s)

name: test
----------
time            derivative
1469709010000000000 1
1469709020000000000 1
1469709030000000000 1
1469709040000000000 -104
1469709050000000000 1
1469709060000000000 1
1469709070000000000 8
1469709080000000000 5
1469709090000000000 -8
1469709100000000000 1
1469709110000000000 2

The working solution:

> select non_negative_derivative(last("value")) from test where time > now() - 1h group by time(10s)
name: test
----------
time            non_negative_derivative
1469709010000000000 1
1469709020000000000 1
1469709030000000000 1
1469709050000000000 1
1469709060000000000 1
1469709070000000000 8
1469709080000000000 5
1469709100000000000 1
1469709110000000000 2
> select non_negative_derivative(last("value")) from test where time > now() - 1h group by time(20s)

name: test
----------
time            non_negative_derivative
1469709020000000000 2
1469709060000000000 9
1469709100000000000 3

> select non_negative_derivative(last("value"),1s) from test where time > now() - 1h group by time(20s)
name: test
----------
time            non_negative_derivative
1469709020000000000 0.1
1469709060000000000 0.45
1469709100000000000 0.15

(that last one shows change per second grouped into 20 second batches)

@phemmer, thank you for your input. :smiley_cat: These differences and positive differences were really an implementation detail. What I really want in the end is 21. For a better description you can look at https://prometheus.io/docs/querying/functions/#increase:

increase() calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.

The following example expression returns the number of HTTP requests as measured over the last 5 minutes, per time series […]

So, SELECT increase("value") FROM test WHERE … would return 21.

Kay, I implemented it myself. Patch in https://github.com/michalrus/dotfiles/commit/b07e45e6a20f750ac43b98a3ae4e759a1d673180.

Side note: why would you choose this language for the project? It’s insane. Copy and paste all over the place. :confounded:

@michalrus I think this is similar to non_negative_derivative(), but we don't have a non_negative_difference(). I think increase() is fine though as a name so it's similar to Prometheus.

@jsternberg the way it's implemented in the PR isn't the same as how difference() _(and derivative(), and non_negative_derivative())_ operates. difference() will take the first point, and the last point, and get the difference. The PR for increase() (#7093) walks through each point between first and last, and tallies the total increments between each point. If the points are constantly increasing, the two would have the same result. But if any point is less than the previous, the results differ. We'll need to provide very clear documentation on this, as because of the similarity, the behavior is very likely to trip people up.

Though I think it might be good to provide a real-world use case for such a function. Nothing really against it, but I can't think of any use case that isn't satisfied, or better suited by non_negative_derivative().

In case I was misunderstood, I was saying derivative() -> difference() is the same as non_negative_derivative() -> increase(). I'll be sure to look at that PR soon, but I think that's what this issue is about, right?

I get exactly what you were saying, but I don't think that's what this issue is asking for, and is not what the PR implements.

@jsternberg, @phemmer, hmm, I would be a bit similar to non_negative_difference, if the latter existed, yes.

But!

Having non_negative_difference, how would you get 21 in my first example? You would have to do something like:

SELECT sum(non_negative_difference("value")) FROM "test" WHERE time > now() - 1h

And that is not possible, as sum() expects a real data field, not another function’s result on the data field. Try that with SELECT sum(difference("value")). The error is:

{
  "error": "error parsing query: expected field argument in sum()",
  "message": "error parsing query: expected field argument in sum()"
}

Now, the most clear use case is the one I presented:

  • you’ve got an increasing counter in your system, e.g. bytes transferred out of a network interface, e.g. eth0,
  • the counter is sometimes reset back to zero,
  • you want to get a total value sent by the interface in some period of time,
  • with the patch/PR you do:

SELECT increase("value") FROM "interface_tx" WHERE "type" = 'if_octets' AND "instance" = 'eth0' AND $timeFilter

And _this_ works correctly. If you have another way to obtain this value, either now, or after implementing non_negative_difference, I’d be more than happy to use your way.

Cheers!

Bump, https://github.com/influxdata/influxdb/pull/7093 is closed, why?

How do you display network interface rates, people? :P Without spikes in the derivative of counter sum resulting from resets. It’s amazing that nobody else needs this.

@michalrus I think you're looking for non_negative_difference() in #8235.

Wow, nice, thanks! I’ll try that and maybe get rid of my patch finally.

Actually if you want "rate", you likely want derivative. Difference gives you amount, but not rate.

Ah, forgive my not being careful. I’ve got rate already.

I need increase() as defined by Prometheus to display total bytes sent/received of a resetting counter for a given time span. How to do this?

Maybe sum(non_negative_difference( would work…

You likely want to use a subquery for this.

SELECT cumulative_sum(non_negative_difference) FROM (SELECT non_negative_difference(value) FROM cpu))

Something like that will probably get you what you want. It might be worth combining these into an increase() function just for convenience though.

Wow, nice! (Again!)

I’ll try that out. :) Thank you!

Even if you’ve worked around this being invalid expression with a recording rule, the real problem is what happens when one of the servers restarts. The counters from the restarted server will reset to 0, the sum will decrease. I watch this vedeo to an article Bluestacks TextNow Photomath

@mounk That's what the non_negative_* functions are for. non_negative_difference() and non_negative_derivative()

Totally agree with the need for this function.
I understand the use of combined "cumulative_sum" and "non_negative_difference", but I'm not able to use with Group by time. I think is not possible.
What I want to achieve, for any counter variable, is the sum of positive increments for each time interval.

@tecmatia-dp I recommend asking those types of questions on https://community.influxdata.com/ . The solution to your problem is to use a subquery.

So I tried the suggestion with the subquery. My query is the following:

SELECT difference(cumulative_sum) FROM (SELECT cumulative_sum(non_negative_difference) FROM (SELECT non_negative_difference(value) FROM "totalBytesReceived" WHERE $timeFilter))

which works. Yet, the ultimate goal is to be able to see different time periods, so something along the lines of

SELECT difference(last(value)) FROM "totalBytesReceived" WHERE $timeFilter GROUP BY time($myInterval)

yet

SELECT difference(last(cumulative_sum)) FROM (SELECT cumulative_sum(non_negative_difference) FROM (SELECT non_negative_difference(value) FROM "totalBytesReceived" WHERE $timeFilter)) GROUP BY time($myInterval)

leads to the error message "aggregate function required inside the call to cumulative_sum" which does not make sense to me since query 1 works and produces similar output as a query like

SELECT difference(value) FROM "totalBytesReceived" WHERE $timeFilter

yet the addition of last() and GROUP BY time($myInterval)` leads to the error message. Any suggestions?

I am also facing a similar issue.

Based on @tobiasisenberg use case, I've tried to actually add an aggregate function inside the call to cumulative_sum, as the error indicated, but afterwards I got the same error, this time referring to the nested non_negative_difference function. Isn't the aforementioned condition enforced due to the GROUP BY clause? Is the GROUP BY clause propagating inwards in case there's a subquery, even though it's actually defined as part of the superquery? Is this expected? If so, why's that?

Is https://github.com/influxdata/platform/pull/459 going to solve the above issues? Will it permit the use of GROUP BY clauses?

My end goal is to be able to get the sum of non-negative differences per time bucket.

Hello,

I have noticed a change has been merged is there an example of how to use it?

last("value")-first("value")
equals
spread("value")

see:
https://docs.influxdata.com/influxdb/v1.7/query_language/functions/#spread

Well, this does not really help since we want the total increase over a time period, not the difference between a maximum and minimum value that is returned by the spread function.

Just use TimeScale… :roll_eyes:

@tobiasisenberg
I think the following query could possibly address the original question.This takes the 'net' measurement as example from telegraf and we're trying to compute the total bytes received in the interface "every hour" for the last 5 hours.

select sum(bytes_used) from (select (non_negative_difference(bytes_recv)) as bytes_used from net where time > now() - 5h and host='myhost' and interface='wlan0') group by time(1h);

I think whats needed is sum of non_negative_difference and not cumulative_sum to get the total bytes received/sent .

Sorry for the long silence, I only now got to testing your suggestion. And I am amazed: it works! I adjusted it to my queries in Grafana as follows, and it works exactly as expected:

SELECT sum(bytes_used) FROM (SELECT non_negative_difference(value) AS bytes_used FROM totalBytesReceived WHERE $timeFilter) GROUP BY time($myInterval)

SELECT sum(bytes_used) FROM (SELECT non_negative_difference(value) AS bytes_used FROM totalBytesSent WHERE $timeFilter) GROUP BY time($myInterval)

Thanks a lot, your help is much appreciated!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

You likely want to use a subquery for this.

SELECT cumulative_sum(non_negative_difference) FROM (SELECT non_negative_difference(value) FROM cpu))

Something like that will probably get you what you want. It might be worth combining these into an increase() function just for convenience though.

This only works if counter always resets to zero. If the counter resets to other non-zero value (less than previous) this value gets lost. See example in influxdata/ifql#217.

My use case in embedded world:

  • Microcontroller counts number of _some low level events_ with high frequency.
  • Collecting application on PC reads this counter from time to time and writes it to InfluxDB.
  • If MCU gets reset or counter overflows, it will start from zero again.
  • It is very likely that on next read-out counter will not be zero anymore.

Is it possible now (after 4 years) to handle this situation with InfluxQL?

after 4 years

A sign to switch to other software :P

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dtouzeau picture dtouzeau  Â·  3Comments

dandv picture dandv  Â·  3Comments

robinjha picture robinjha  Â·  3Comments

jayannah picture jayannah  Â·  3Comments

Witee picture Witee  Â·  3Comments