Elasticsearch: Query documents before rollup

Created on 13 Feb 2019  路  9Comments  路  Source: elastic/elasticsearch

Describe the feature:
When rolling up data, it would be nice to filter documents with a query. That is, instead of rolling up all documents on an index (or index pattern), aggregate only those that match the query.

The reason behind this is that once you rollup data, you cannot query it, and it would be probably too complex to store aggregated data in a way that supports certain queries. But filtering during the rollup job should be "easier" (I hope!) and that would be really useful.
For example, if we are storing HTTP requests on an index, we could create a few rollup jobs:

  • all documents, to see the overall traffic and maybe an average response time
  • documents matching q=status:500 to track errors over time
  • documents matching q=url:checkout to track interesting endpoints, maybe with a sum for the "amount" field too, to see the evolution of sales I don't know
  • ...

(Each of these rollup jobs would go to a different rollup index of course)

This would be so powerful! What do you think?
Thank you!

_I have found another issue here that sounds similar to me, but I am not sure so please feel free to close this one if the idea behind is the same:_ https://github.com/elastic/elasticsearch/issues/34921
_I also posted this on your discourse:_ https://discuss.elastic.co/t/filtering-documents-for-rollup/167417

:AnalyticRollup >enhancement

Most helpful comment

I'm going to re-open this ticket as a placeholder. We're working on a big refactor of Rollup (changing how search works, integrating with ILM, etc) so this request is something we can reconsider in light of the new framework. It's a fairly common request so far over the lifetime of Rollup v1.

That said, I think a lot of the difficulties remain; could be trappy for the "consumer" of the rollup data if they don't know it has been filtered, and I'm not sure how it would work/look under the new setup. But now's the time to think through those things, hence the re-open :)

All 9 comments

Pinging @elastic/es-analytics-geo

@TheBronx thanks for opening this issue. From what I understand so far, what you are trying to do can already be achieved using Filtered Aliases. You would define different aliases for your subset of documents and then point the rollup job to those. I haven't tried this in practice though, maybe @polyfractal has ideas about this or knows alternative approaches?

Okay, it actually works!
Creating the alias is a bit less "dummy friendly" than filling an input field in kibana haha but on the other hand it works, now :smile:
The rollup job seems to be working fine, and the "overhead" of aliases is pretty much none right?
I didn't know about index aliases, thank you so much @cbuescher

Great to hear, maybe there are even simpler ways that @polyfractal knows about, so lets wait a bit for his thoughts but I think after that we can close then.

Filtered aliases would be the best (and I think only) way to do it right now. We made a decision to not allow filtering on the rollup job itself, to prevent a "mismatch" between the input data and the output rollup data. E.g. it might be confusing for a user consuming rollup data to see data missing, if they aren't aware that the job itself was filtered.

We may loosen that restriction in the future. But until then, a filtered alias would be the best way to do it.

the "overhead" of aliases is pretty much none right?

That's correct, the alias itself is essentially free, so the only extra cost is adding the filter itself :)

It is me again, I just found a problem with this approach :cry:
The docu for ElasticSearch aliases says:

In this case, the alias is a point-in-time alias that will group all current indices that match, it will not automatically update as new indices that match this pattern are added/removed.

And that is exactly what I did, cause I am using logstash:
alias
The alias is matching all indices that existed when I created it (2019.02.13), but no more data is being aggregated after that. The rollup job runs every hour but it is not finding anything new of course.

I would have to recreate the alias everyday for this approach to work right? Maybe this is not the best way to do it :laughing:
So if you are going to use filtered aliases in combination with rollup jobs, be careful, you cannot match indices before they are created, even though you use a pattern (logstash-*) that would matches those new indices.

It was too good to be true. Any other ideas?

I agree with @TheBronx, this is really a missing feature, very useful.
Aliases are stored in indices, so it is not flexible enough.

BTW, in Data Transforms, we can define a query. So it would be coherent to have also this option in rollup jobs.

I'm going to re-open this ticket as a placeholder. We're working on a big refactor of Rollup (changing how search works, integrating with ILM, etc) so this request is something we can reconsider in light of the new framework. It's a fairly common request so far over the lifetime of Rollup v1.

That said, I think a lot of the difficulties remain; could be trappy for the "consumer" of the rollup data if they don't know it has been filtered, and I'm not sure how it would work/look under the new setup. But now's the time to think through those things, hence the re-open :)

Thanks for re-open!
For the difficulty you quote, first, people that does the rollup is often the same than the one that consumes it. Then, consumer is surprised, he can still communicate with producer :)
To me, this is not a difficulty, just a fact.

Was this page helpful?
0 / 5 - 0 ratings