Elasticsearch: Filtering on results of an aggregation

Created on 10 Dec 2013 · 38Comments · Source: elastic/elasticsearch

The case here can best be described by referring to the HAVING clause in SQL. Basically we need to be able do an aggregation but limit the results of the query based on the value of the aggregation. For example if I wanted to look at what age bands have an average height greater that 5 feet, my query would look something like this:

 "aggs": {
        "genders": {
            "terms": {
                "field": "gender"
            },
            "aggs": {
                "age_groups" : {
                    "range" : {
                        "field" : "age",
                        "ranges" : [
                            { "to" : 5 },
                            { "from" : 5, "to" : 10 },
                            { "from" : 10, "to" : 15 },
                            { "from" : 15}
                        ]
                    },
                    "aggs" : {
                        "avg_height" : { 
                            "avg" : { "field" : "height" } ,
                            "having" : { "from" : 60 }
                        }
                    }
                }
            }
        }
    }

Source

jrick1977

Most helpful comment

A filter similar to "having" would be useful, I have been trying to find groups of documents with more than a certain number of matches overall. (i.e. SELECT id ... GROUP BY id HAVING COUNT(id) > 4)

kyle-stachowiak-relativity on 21 Jun 2014

👍4

All 38 comments

We also need a way to filter a multi valued aggregate down to a single value so we don't have to get so much data back. We retrieve values like biggest month, smallest month, busiest month, slowest month, etc, and I was hoping to at least be able to do something like below, but there is no "size" field on the histograms, so even though the results are ordered the way I want, I have to get back thousands of results just so I can print a single number on a website.

{
    "query": { "match_all": {} },
    "aggs": {
        "biggest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "sum_of_po_totals": "desc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            },
            "aggs": {
                "sum_of_po_totals": {
                    "sum": { "field": "PO.IssuedAmount" }
                }
            }
        },
        "smallest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "sum_of_po_totals": "asc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            },
            "aggs": {
                "sum_of_po_totals": {
                    "sum": { "field": "PO.IssuedAmount" }
                }
            }
        },
        "busiest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "_count": "desc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            }
        },
        "slowest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "_count": "asc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            }
        }
    },
    "size": 0
}

Better yet, I would love to see the single value aggregations (or something like them) work on multi value aggregations as a filter (which is currently not being used for anything), so something like this:

{
    "query": { "match_all": {} },
    "aggs": {
        "biggest_month": {
            "max": { "field": "by_month.sum_of_po_totals" }
        },
        "smallest_month": {
            "min": { "field": "by_month.sum_of_po_totals" }
        },
        "busiest_month": {
            "max": { "field": "by_month.count_of_pos" }
        },
        "slowest_month": {
            "min": { "field": "by_month.count_of_pos" }
        },
        "aggs": {
            "by_month": {
                "date_histogram": {
                    "field": "PO.IssuedDate",
                    "interval": "month",
                    "format": "yyyy-MM"
                },
                "aggs": {
                    "sum_of_po_totals": {
                        "sum": { "field": "PO.IssuedAmount" }
                    },
                    "count_of_pos": {
                        "value_count": { "field": "PO.ID" }
                    }
                }
            }
        }
    },
    "size": 0
}

Of course, the "having" modifier mentioned originally would also be fantastic for doing range queries, etc on these aggregate results as well, and would also solve our issue if we could use a "max" aggregate in a "having" clause (even though I personally find it a little less intuitive to have it at the bottom of the hierarchy, but I'm guessing it would probably be easier to implement that way since it is still a single valued aggregate operating under a multi-valued bucket aggregate).

No matter what the solution, this is still a pretty massive hole for our needs in an otherwise awesome aggregations framework, which is unfortunate.

lusid on 16 Jan 2014

If I well understand this issue, you would like to have the equivalent of the facet_filter of the facets but for aggregations?

SeyZ on 13 Feb 2014

The filter functionality already provides something like facet_filter, in that it filters the results prior to aggregating them. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html

I think what we are looking for is a way to further filter the result of the aggregation after the fact (i.e. the MAX of the SUM).

lusid on 13 Feb 2014

👍4

Also, would it be possible to apply aggregation in the query filter? We need to limit search result based on the aggregation result. For Example, return all the employees in a department "dep1" whose salary is less than the average salary of a department "dep1".

kajal23 on 8 Apr 2014

👍1

A filter similar to "having" would be useful, I have been trying to find groups of documents with more than a certain number of matches overall. (i.e. SELECT id ... GROUP BY id HAVING COUNT(id) > 4)

kyle-stachowiak-relativity on 21 Jun 2014

👍4

Are there any plans to implement such a feature? We really need this for one of our "big data" projects. Are there any approaches to implement this via a plugin? I would be very thankful for any hints.