Time Series Visual Builder (and Kibana) needs a "fit" pipeline aggregation similar to Timelion's "fit" function that allows the user to configure which fit algorithm to use. This functionality needs to exist as a pipeline aggregation so it can be composed with other pipeline aggs. Currently Timelion supports the following algorithms:
For reference here is the request for the "fit" feature for Time Series Visual Builder in Kibana https://github.com/elastic/kibana/issues/11793
Here is a link to the Timelion fit functions: https://github.com/elastic/kibana/tree/master/src/core_plugins/timelion/server/fit_functions
CC: @eprothro
We discussed it in FixitFriday. We do not like adding made-up data to the results of aggregations, so it feels important to us to add a flag to buckets to say whether they are made of actual data or just interpolated. However this would be an intrusive change as all other pipeline aggregations would need to make sure they propagate this flag correctly. Given that it seems that the only benefit of this change would be to make charts look nicer, we don't think it is worth implementing.
This means that TSVB or anything that use pipeline aggregations will never have a fit function. The issue being is that the "fit" needs to happen before you run a derivative (or other pipelines). This is a disappointing stance for the ES team to take because without this ES is really only good for running basic metrics (avg, max, min, etc... like Timelion does) and then doing everything else on your own.
If you say "interpolated" instead of "made-up" it makes things sound scientific and fancy.
Yeah, lemme explain a bit more.
In the original linked issue, the explicit goal is to prevent gaps in a chart... which feels like something that can be done client-side. But for additional processing you're right, it would have to be implemented in ES.
Unfortunately, additional processing on made up interpolated data is really dangerous. This basically boils down to Nyquist-Shannon sampling theory. To resolve a certain frequency, we have to sample at twice the bandwidth of the target frequency. If you want to resolve a 1hz signal, you have to sample at 2hz.
For example, if data we collected was the black dots:

The low frequency blue sine wave is obviously fitting our data. But we can also fit the higher frequency red sine wave to the same data points. The higher frequencies are aliases of the low frequency sine wave, and there are infinitely many more that we _could_ fit. But because they are undersampled, we can't determine if any of them were actually present or not in the original signal.
Getting back to the request: if you attempt to calculate a derivative on these two sine waves at a granularity of 0.5, you'll get very different answers. Just eyeballing it shows the derivatives will be vastly different. So it isn't just filling in data for display, it would lead to fundamentally different results if used as part of the "processing pipeline".
Basically, once you start making up data you can't use that data for anything valid anyway, because it's all just fairy tales and pixie dust unfortunately :( If the gaps are sporadic, I don't think it's a big problem because you'll just have a few NaN and minor gaps. And if you're trying to zoom in past the sampling rate, we probably shouldn't encourage that behavior anyway.
That's just justification for not connecting dots blindly on a chart. Most users have domain knowledge and understands how their data should be interpreted (which is why we want the feature to include the ability to specify the method). Users also know whether or not linear interpolation between the black dots is appropriate.
I think there are two very different issues here.
Upsampling/interpolating for visual consumption is totally acceptable imo. It's just making a chart more pleasant to look at, or to smooth out gaps so as to not distract the user. That's fine, and also totally doable client-side after the final aggregation is done running.
Then there's upsampling/interpolating as a stage in the analysis process. That is inherently dangerous, and I don't think it matters how much domain knowledge you have... you simply cannot "fill in the blanks" with any reasonable accuracy. The whole point of the Nyquist rate is that you cannot know what's going on below the sampling rate threshold. And if the gaps are due to actually missing samples (not just from zooming into a graph)... it's missing, and can't be treated as present for analysis imo. A derivative of made up data is a made up derivative.
I was originally pro-interpolation and would be happy to be convinced otherwise, but after chatting with the core group we had a hard time determining a valid use-case outside of purely visual charting.
Line charts with connected dots and dots connected through interpolation of the data are the same thing to the user (regardless if there is a derivative applied or not). They both suffer from the problem of not being able to see if it's just missing data. If the user says, "hey, I don't really care about what's happening in the middle I just want to assume the average between the two." That should be their choice.
I guess the way I see it is either Elasticsearch is the data model for Kibana or it's not and we need to do our post metric aggregation operations (otherwise know as pipeline aggs) elsewhere like Timelion does. Time Series Visual Builder was built on top of the Elasticsearch data model / aggregation system, any features we need to change the data (even for visual consumption) should be done at the Elasticsearch layer.
I understand the whole discussion as "we don't see valid use-cases" and "it's dangerous, users don't know what they're doing" when talking about interpolation/fill the blank.
Therefore, I'd like to present you my use case, to fill the discussion (maybe you'll have a better way to achieve it?) :
My purpose is to make sum (or average) of various time series (coming from multiple sensors), as the sum has domain significance. You could imagine the sum of energy consumption provided by 50 individual sensors in a city represent the whole consumption of the city, used to be displayed, in other applications, etc...
Some sensors, for various reasons outside my control, have a different resolution (either because they are configured at different resolutions or to send points only in the case where the value is changing, which might not be the case for hours). If I want to be able to make a coherent sum of those points (and further complex processing), I need to have a kind of interpolation. I'd also like to fill the blank first, as an attempt of correction is probably better than nothing in the aggregated result.
In my domain (energy consumption), I don't care about high-frequency, I know that important variations cannot hide below my sampling rates (except outages, but I can definitely live with it for my purposes).
I understand the complexity that it involves, especially if you want to flag the "made-up" data (I'd like to have it, but frankly I could live without), but not having it would require me to fetch all the data and do all the processing client-side, or having previously interpolated data stored, which are things (especially the first one) I'd rather avoid.
Another argument for filling gaps with "invented" data is if you have not sampled data, but event based data, i.e. something happens which will add a document to your index. If you want to know the count of events per interval (say, per day or per hour), then you will have gaps, even though there were 0 events. In such cases, "missing" data is equivalent to "0". Not sure if such data (non-uniform intervals between samples) will qualify as time-series?
This complicates matters, if you want to calculate the average of your events across several buckets. A single event on the first day over a time range of 10 days will give as "overall average" the result 1/day instead of 0.1/day. Which is very unintuitive, if not to say incorrect.
Another use case is filtering of documents. Suddenly, buckets will not affect the average anymore, if all documents of that bucket are excluded by your filter.
For instance:
Day 1: [ 2, 3, 5, 7 ]
Day 2: [ ]
Day 3: [ 1, 1, 2, 3, 5 ]
Day 4: [ 42 ]
With the current behavior, the average number of documents per day will give you 3.3 (== (4+5+1) / 3), when I'd expect it to be 2.5 (== (4+0+5+1) / 4).
Or, if you want to know how many documents on average per day are greater than 6, the result will be 1 (== (1+1) / 2; 7 and 42), not 0.5 (== (1+0+0+1) / 4). Especially with filtering this can be very confusing, because the number of buckets changes depending on the query.
But maybe this is also not something to handle in an aggregation, but in a step before. So if you use "count", then missing values/empty buckets should be counted as 0, not skipped. If you use other metrics, then maybe skipping them or interpolating would be the better option (e.g. min or max). It depends.
@GeoffreyOnRails Hard to say without more details (maybe we can explore this more in a Discuss forum thread and see if we can work something out?). Are you looking for a total sum from a time-period? A running-sum/cumulative sum? Or like a sum of a moving window? It feels like something that can be done without interpolation... in your case, the discrete events are all the data you have/need/want, so interpolating would actually pollute the calculations.
But I may be way off the mark. If you want to post up a Discuss thread with more details we can dive into it there, see if it's workable and maybe circle back to this issue.
@knittl For the case where missing data == "0", you can use the insert_zero gap_policy. That does exactly as you want; insert a zero when a gap is found.
The original intention of the gap policy was exactly this; some kind of basic interpolation. We had plans to extend it with further policies (linear, avg, etc) but in the end it made everything super complicated and we had the above-mentioned concerns about "made up" data.
For the filtering issue, this is usually doable with various combinations of pipeline aggs, _count and _bucket_count (and min_doc_cout: 0, which is default these days). E.g. the first example can be done with:
GET /test/_search
{
"size": 0,
"aggs": {
"histo": {
"date_histogram": {
"field": "timestamp",
"interval": 1
}
},
"avg_docs_day": {
"avg_bucket": {
"buckets_path": "histo._count"
}
}
}
}
yielding:
{
"aggregations": {
"histo": {
"buckets": [
{
"key": 1,
"doc_count": 4
},
{
"key": 2,
"doc_count": 0
},
{
"key": 3,
"doc_count": 5
},
{
"key": 4,
"doc_count": 1
}
]
},
"avg_docs_day": {
"value": 2.5
}
}
}
The second situation can probably be done with a filter on the bucket for > 6, and then it should work fine.
These are obviously simplified examples, so if you want to open a discuss thread with examples closer to your troubles we can work through that too :)
@polyfractal Thanks. Yes, gap_policy could make this work. But there's at least one other ticket which discusses deprecation of the insert_zero gap policy: #24893.
I am unsure if a filter on the bucket would yield the correct result. My use-case is tracking purchases of a certain item – which can be searched from the webserver logs – buckets without a matching item are simply ignored in the average calculation.
And maybe I should have mentioned, I am unable to find a way to specify the gap policy in Kibana ('s Visual Builder). There's also ticket discussing that missing feature: #11793
@polyfractal : The discret events represents the power value. If I don't have an event, it's because it's not read, not because it didn't happen. I'm not trying to count a number of discret events here.
Let's say I'm watching TV. It consumes 100W. My sensor says at 10:00 : 100W, and at 10:30 : 100W, I'm still consuming at 10:15, it's continuous, but the sampling doesn't send me an event for that. I can save what I've received in documents. I have two of them.
If my neighbor has a different sensor, and he does the same thing that I do, I'll receive its consumption of 100W at 10:00, 100W at 10:15, and 100W at 10:30. I can save what I've received in documents. I have three of them
Now let's say I'm interested in knowing the consumption in my street (We have a street of two houses in this example, it's a small city :D ), .
If I want to have the sum of the consumption of the two houses, adding the documents would lead to :
10:00 -> 200MW
10:15 -> 100MW
10:30 -> 200MW
Obviously not what I want to visualize / send to other systems.
This is a very simple case here, but we would have the same issues in more complex calculations (and with more data)
Most helpful comment
Line charts with connected dots and dots connected through interpolation of the data are the same thing to the user (regardless if there is a derivative applied or not). They both suffer from the problem of not being able to see if it's just missing data. If the user says, "hey, I don't really care about what's happening in the middle I just want to assume the average between the two." That should be their choice.
I guess the way I see it is either Elasticsearch is the data model for Kibana or it's not and we need to do our post metric aggregation operations (otherwise know as pipeline aggs) elsewhere like Timelion does. Time Series Visual Builder was built on top of the Elasticsearch data model / aggregation system, any features we need to change the data (even for visual consumption) should be done at the Elasticsearch layer.