Kibana: [lens] Add "counter rate" for monotonically increasing numbers

Created on 25 Sep 2019  路  18Comments  路  Source: elastic/kibana

Goal statement

Lens should support the "positive growth rate" aggregation, used for showing the rate of increase for a monotonic counter such as network traffic, with time scaling in date units such as 1s 60s, etc.

Example

Visualizing network traffic per second, scaled down from 1 hour intervals. It would show 2Mb/s.

Decisions to be made

The most correct way of calculating this value is by implementing a new aggregation in Elasticsearch. Should we wait for this new aggregation, or implement a workaround in Kibana? The workaround will not produce the same numbers in all cases, because Elasticsearch is able to handle more edge cases than we can.

Decision:
For our first version we will implement the logic client side using the same approach as TSVB uses today.

UI

Rate will be a separate operation which can be chosen like sum/min/max/avg/... It allows to pick a field and a time unit.
The operation should be called "Rate". The unit always has to be picked.

Screenshot 2020-10-12 at 09 31 49

As this operation requires a date histogram to make sense, we need to handle the case if there is no date histogram available (both for the case when we have a rate oepration already and if we don't):

Screen Shot 2020-10-07 at 10 30 49 AM
Screen Shot 2020-10-07 at 10 30 57 AM

Implementation

This behavior will be implemented as a separate Lens-private expression function which is calculating the derivative/"positive only" value based on the Elasticsearch max metric of the selected field.

Prior discussion

This is how the Infra UI calculates rates for monotonically increasing numbers like system.network.out.bytes for both the Inventory View and the Metrics Explorer. An added bonus is if we had rate as an option, the Metrics Explorer could link to Lens instead of TSVB:

{
    rate_max: { max: { field: '<field goes here>' } },
    rate_deriv: {
      derivative: {
        buckets_path: 'rate_max',
        gap_policy: 'skip',
        unit: '<user defined, defaults to 1s>',
      },
    },
    rate: {
      bucket_script: {
        buckets_path: { value: 'rate_deriv[normalized_value]' },
        script: {
          source: 'params.value > 0.0 ? params.value : 0.0',
          lang: 'painless',
        },
        gap_policy: 'skip',
      },
    },
  }
Lens LensDefault KibanaApp enhancement

Most helpful comment

I was talking with a colleague last week about how we should just add "rate" to TSVB that essentially does max, derivative, and positive_only since that's the TSVB formula we (Observability) use for Metrics Explorer.

All 18 comments

Pinging @elastic/kibana-app

@simianhacker Can you show an example of what this looks like when visualized?

I was able to set up the Infra UI with my Metricbeat data and use the rate aggregation to get this chart:

Screenshot 2019-10-02 14 28 54

Unlike the normal "Bytes" formatter, this is showing bytes per user defined interval.

Copying over details of https://github.com/elastic/kibana/issues/58189

Kibana's basic visualizations, or Lens, should have a way to convert your data count to a rate per unit of time, e.g. (request per second). This is the usual way of thinking of metrics ("each of my instances can do 500 RPS") and it should be made easily available.

As mentioned to @AlonaNadler, many metrics we look at and compare with other systems is typically expressed as a rate, e.g. "requests per second" or "clients per hour". Usually data in ES is of discrete form, you get 1 document per event (e.g. logs). How do you plot your "logs per second"?

There are some answers out there in Discuss but most of them are wrong:

  • use Derivative agg: this only works if your data is a counter that tallies up a value over time, which is not often the case.
  • use moving avg or bucket agg: the parameter is a fixed number of entries or time unit to average over. It just does smoothing, so you need to have a rate in the first place, it does not convert your count to a rate.
  • adjust kibana's time window: yea, you can get lucky and pick a window so that kibana decides to precisely split your count by that unit of time. Not workable :)
  • change the time interval in TSVB: this seems to work but is actually unusable because it makes data granularity extremely small. Looking at more than 30min will actually return an error.
  • the only way I found was to use scale_interval in Timelion.
    Thanks!

@agirbal Side note... For TSVB, If you set the interval to something like >=10s then you still get all the zooming capabilities but it prevents the buckets from going smaller then 10s. The minimum bucket size should be the same size as your event/sample rate of your data. I always use this feature and it would be the default behavior if there was a programmatic way to determine the event rate (starting in 7.5 we do this in the Metrics UI via the metricset.period field).

There are some other tricks you can employ like using cumlative_sum and derivative together to scale the data. This is a good trick for "log rate", do a cumlative_sum on doc_count then use a derivative to scale it down to 1 second.

The tricky part is abstracting all of this away from the user, using "rate" on a number that is not monotonically increasing will almost always produce something the user doesn't want, except when they know what they are doing. Until we have a concept of "number types" (ie counters, gauges, etc) in Elasticseach/Kibana it's going to be difficult to guide the user in the right direction. In Observability, we plan on using the field metadata to store information like this from the Metricbeat modules.

@simianhacker thanks for the info! I didn't think of that trick "cumulative sum + derivative", that's a good one, is it pretty much how it could get implemented under the hood anyway?

I think the feature of "sample rate" could cleanly be attached to the count function no? A user already has to decide whether to do count (which implies counter) or some field function like avg of field X which would work on gauge or rate field. Since you already have specialized fields for each Y-axis value type, you could just have this for the count one.

From there you can have another general setting smooth or step that would do the moving average and smooth out your graphs, and that can be applied to any function. I think one difficulty today is that we're mixing the 2 concepts which makes it more confusing to the user maybe.

It's a good idea to abstract / automate it all from the user if possible in the future, just I don't think this one is a very complex concept, would love to be able to simply do RPS in TSVB :)

I was talking with a colleague last week about how we should just add "rate" to TSVB that essentially does max, derivative, and positive_only since that's the TSVB formula we (Observability) use for Metrics Explorer.

Just created PR for adding rate to TSVB: #59843

@agirbal Your request might not actually be solved by this issue, but I think it is still important to track. I wrote up an issue describing what I would call Average event rate, which is different from the kind of rate that @simianhacker is describing in this request.

Tracking this in Elasticsearch because it might be possible to get a more-correct implementation in Elasticsearch. The main reason is that counter resets in the middle of a bucket should be handled, and the client-side implementation is only able to throw it away.

https://github.com/elastic/elasticsearch/issues/60619

@AlonaNadler @cchaos We have been calling this the "positive rate" in TSVB, but with sentence indicating that this should only be used for monotonically increasing numbers. I would expect us to have exactly the same name and descriptive text.

Screen Shot 2020-08-20 at 12 10 08 PM

There are several decisions that we are not finalized on:

  • Naming and description: Needs input from @AlonaNadler
  • Expected form interface: Needs input from @cchaos
  • Technical question: Should we use the workaround with edge cases, or wait for Elasticsearch to implement the correct aggregation?

More questions will probably be raised once we get these first few.

The elasticsearch team is investigating this feature now: https://github.com/elastic/elasticsearch/issues/60619

For our first version we will implement the logic client side using the same approach as TSVB uses today.

In terms of naming, my preference is "Positive rate" or "Counter rate", because these names clearly indicate that there is something unusual about this function. I am opposed to calling it "Rate" because it's a clearly confusing name (evidence is that we discussed the name and meaning for weeks). Using the word "growth" is also confusing because it means something different in a business context than what it means here.

Can we settle on the name "Positive rate"?

I believe we've settled on "Counter rate" after discussion. Updating the title.

Based on a suggestion by @exekias in the parallel Elasticsearch issue, I think we should slightly tweak the algorithm that TSVB is using. The main tweak is that when the value decreases, we should use the new value instead of resetting it to zero. This is expressed in pseudocode the following way:

rate = 0
loop_over_values(lambda (current, previous):
  if current >= previous:
    rate = rate + (current - previous)
  else:
    rate = rate + current
)

For the types of counters that we've considered so far this algorithm is going to appear more correct by avoiding sudden drops to zero.

Reopening as the UI part is still missing

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bradvido picture bradvido  路  3Comments

tbragin picture tbragin  路  3Comments

stacey-gammon picture stacey-gammon  路  3Comments

Ginja picture Ginja  路  3Comments

MaartenUreel picture MaartenUreel  路  3Comments