Cumulative distribution function, e.g.:
This function is invertible, e.g. you can swap the axes:
I am unable to visualise neither function in Kibana build 6998, commit d029b34. I understand that the axes depend on each other, ie. must be about the same field and must be a pair of percentiles and percentile ranks. I'm aware of the fact that e.g. the Line Chart visualisation isolates the axes from each other.
This is why I propose a new visualisation type: Cumulative Distribution. This is very similar to #2704 which would also need a separate visualisation type. Maybe it can be generalised into a Distribution visualisation type.
Both of them only _need_ a single field as an input. Both would benefit greatly from Split Lines and Split Charts.
If ElasticSearch doesn't give such capabilities, please let me know, I'll raise an issue there.
This is not something we want to introduce a new visualization for, rather this is a transformation on existing data as applied to a line chart. Can you explain some use cases? Give some examples on where you'd use this? Concrete questions it would solve?
E.g. A/B testing. These are actual comparisons we did using JMeter:
We need to compare response times for different configs/versions. Quantiles are the most meaningful.
We care about the entire spectrum, so picking a single percentile is not enough. We might give up some of the completeness and only pick a subset, e.g. P1, P25, P50, P75, P90, P99, but for 4 splits (per config/version) it would result in 6脳4 = 24 lines, which would be absolutely unreadable.
Do we have any update or method or plugin to plot Cumulative distribution function (CDF) or probability density function(PDF) plot for the KPI?
+1. Much of the analysis we do is based on percentile distribution, exactly like @dagguh shows. Basically a histogram where the X-Axis are the bucketed percentiles (e.g. p25, p50, p75) of a field, and the Y-Axis uses some number function like average of that same field or median of some other field (counts would be equal between percentiles). This lets you answer questions like "what is the gain on the 25% of users who have the worst latency to our service." It'd be super powerful.
This is older ticket, any chance this is now doable with pipeline agg, and maybe Vega / Canvas visualizations?
I believe a CDF chart should be doable with the Percentiles aggregation in Elasticsearch. A CDF is just the "continuous" function describing percentiles at any arbitrary position.
So Kibana could ask the percentiles agg for 0-100 percentile in small increments (0, 5, 10, 15, ... 100) that will approximate the CDF. Smaller increments == better approximation. Asking for more percentiles is essentially free other than some minor computation and a larger response size. The percentile sketches collect all the information from the shards, and when we construct the response Elasticsearch interrogates the CDF of the sketches to generate specific percentiles. So asking for more percentiles just interrogates the sketch a bit more, which is mostly neglibible (within reason) compared to building the sketch itself.
It could also be done with PercentileRanks agg (which is basically the inverted chart), but that requires you to know the extents of data ahead of time. Would be easier to use Percentiles since you know the data is always 0-100, then invert the graph client-side if desired.
I agree that a complete plot of the full CDF is very useful in many analysis.
@polyfractal cc @AlonaNadler the issues with the bare Percentiles
agg is that Kibana would still need to do a 2 step request to ES, since it applies to your X-axis value X. The Y-axis is just getting the average of the Y value for that percentile interval of X (however granular it is).
Would there be a way for Kibana to choose to do a "quantile histogram", give a few parameters like the granularity or specific percentile value to bucket at, and then ES would do the full aggregation in 1 go?
I'm not sure I follow? A request like this basically gives you the CDF:
GET /test/_search
{
"size": 0,
"aggs": {
"cdf": {
"percentiles": {
"field": "value",
"percents": [ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ]
}
},
"stats": {
"stats": {
"field": "value"
}
}
}
}
{
"aggregations" : {
"cdf" : {
"values" : {
"10.0" : 2.6,
"20.0" : 8.0,
"30.0" : 15.0,
"40.0" : 15.0,
"50.0" : 15.0,
"60.0" : 19.499999999999996,
"70.0" : 20.8,
"80.0" : 41.300000000000004,
"90.0" : 67.99999999999999,
"100.0" : 80.0
}
},
"stats" : {
"count" : 9,
"min" : 1.0,
"max" : 80.0,
"avg" : 24.666666666666668,
"sum" : 222.0
}
}
}
All Kibana needs to do is convert that into a line chart. E.g. a point at (10.0, 2.6)
, (20.0, 8.0)
, etc It doesn't work with the current kibana visualization setup because Vizualization assumes you have to build the X axis out of bucketing aggs (which is accurate in most cases, just not here). I don't know the internal details about how hard that would be to adjust, but all the data is available in the percentiles
response to build a CDF.
We can't make a "bucket" version of percentiles because it's one of those operations that you don't know the real percentile values until all the shards have been merged together. And at that point it's too late to collect documents into buckets because we're merging on the coordinating node. If we had multi-pass aggs it is theoretically possible, but would still require two passes (it'd just happen in ES)
If a "bucketed" percentiles are needed today, it could be done by Kibana with two passes: one to get the percentiles, second to setup a range
agg on those returned percentiles. But that's no longer really a CDF imo :)
@polyfractal right your last description is what I mean. Drawing a pure CDF is one thing and you are right that it would answer the original premise of this ticket. But I think it'd be very limiting in what you can do with it - I attempted to describe a more generic approach that would let you do more interesting things here https://github.com/elastic/elasticsearch/issues/50386
You could draw the CDF 2 ways:
A) as you describe: get a whole bunch of percentile points for the value and extrapolate into a line. Your Y-axis would probably select "avg of field A" and then X-axis a new "percentile histogram" that does not select a field since it doesn't need one (just 0-100).
B) allow to select what you want on Y-axis, say "avg of field A" and then on X-axis "Histogram" with a new "percentile of field B" option (instead of typical range). With this solution you can achieve CDF too (by picking same field for both) but it's much more interesting because it lets you do any histogram as you would normally do, but with values that are not X-axis friendly due to their distribution (typical long tail prod system metrics).
Most helpful comment
Do we have any update or method or plugin to plot Cumulative distribution function (CDF) or probability density function(PDF) plot for the KPI?