Elasticsearch: Filtering on results of an aggregation

Created on 10 Dec 2013  路  38Comments  路  Source: elastic/elasticsearch

The case here can best be described by referring to the HAVING clause in SQL. Basically we need to be able do an aggregation but limit the results of the query based on the value of the aggregation. For example if I wanted to look at what age bands have an average height greater that 5 feet, my query would look something like this:

 "aggs": {
        "genders": {
            "terms": {
                "field": "gender"
            },
            "aggs": {
                "age_groups" : {
                    "range" : {
                        "field" : "age",
                        "ranges" : [
                            { "to" : 5 },
                            { "from" : 5, "to" : 10 },
                            { "from" : 10, "to" : 15 },
                            { "from" : 15}
                        ]
                    },
                    "aggs" : {
                        "avg_height" : { 
                            "avg" : { "field" : "height" } ,
                            "having" : { "from" : 60 }
                        }
                    }
                }
            }
        }
    }

Most helpful comment

A filter similar to "having" would be useful, I have been trying to find groups of documents with more than a certain number of matches overall. (i.e. SELECT id ... GROUP BY id HAVING COUNT(id) > 4)

All 38 comments

We also need a way to filter a multi valued aggregate down to a single value so we don't have to get so much data back. We retrieve values like biggest month, smallest month, busiest month, slowest month, etc, and I was hoping to at least be able to do something like below, but there is no "size" field on the histograms, so even though the results are ordered the way I want, I have to get back thousands of results just so I can print a single number on a website.

{
    "query": { "match_all": {} },
    "aggs": {
        "biggest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "sum_of_po_totals": "desc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            },
            "aggs": {
                "sum_of_po_totals": {
                    "sum": { "field": "PO.IssuedAmount" }
                }
            }
        },
        "smallest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "sum_of_po_totals": "asc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            },
            "aggs": {
                "sum_of_po_totals": {
                    "sum": { "field": "PO.IssuedAmount" }
                }
            }
        },
        "busiest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "_count": "desc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            }
        },
        "slowest_month": {
            "date_histogram": {
                "field": "PO.IssuedDate",
                "interval": "month",
                "order": { "_count": "asc" },
                "format": "yyyy-MM",
                "size": 1  // no size???
            }
        }
    },
    "size": 0
}

Better yet, I would love to see the single value aggregations (or something like them) work on multi value aggregations as a filter (which is currently not being used for anything), so something like this:

{
    "query": { "match_all": {} },
    "aggs": {
        "biggest_month": {
            "max": { "field": "by_month.sum_of_po_totals" }
        },
        "smallest_month": {
            "min": { "field": "by_month.sum_of_po_totals" }
        },
        "busiest_month": {
            "max": { "field": "by_month.count_of_pos" }
        },
        "slowest_month": {
            "min": { "field": "by_month.count_of_pos" }
        },
        "aggs": {
            "by_month": {
                "date_histogram": {
                    "field": "PO.IssuedDate",
                    "interval": "month",
                    "format": "yyyy-MM"
                },
                "aggs": {
                    "sum_of_po_totals": {
                        "sum": { "field": "PO.IssuedAmount" }
                    },
                    "count_of_pos": {
                        "value_count": { "field": "PO.ID" }
                    }
                }
            }
        }
    },
    "size": 0
}

Of course, the "having" modifier mentioned originally would also be fantastic for doing range queries, etc on these aggregate results as well, and would also solve our issue if we could use a "max" aggregate in a "having" clause (even though I personally find it a little less intuitive to have it at the bottom of the hierarchy, but I'm guessing it would probably be easier to implement that way since it is still a single valued aggregate operating under a multi-valued bucket aggregate).

No matter what the solution, this is still a pretty massive hole for our needs in an otherwise awesome aggregations framework, which is unfortunate.

If I well understand this issue, you would like to have the equivalent of the facet_filter of the facets but for aggregations?

The filter functionality already provides something like facet_filter, in that it filters the results prior to aggregating them. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html

I think what we are looking for is a way to further filter the result of the aggregation after the fact (i.e. the MAX of the SUM).

Also, would it be possible to apply aggregation in the query filter? We need to limit search result based on the aggregation result. For Example, return all the employees in a department "dep1" whose salary is less than the average salary of a department "dep1".

A filter similar to "having" would be useful, I have been trying to find groups of documents with more than a certain number of matches overall. (i.e. SELECT id ... GROUP BY id HAVING COUNT(id) > 4)

Are there any plans to implement such a feature? We really need this for one of our "big data" projects. Are there any approaches to implement this via a plugin? I would be very thankful for any hints.

We are willing to provide financial support for this feature if that helps. Just let me know how to get in contact.

+1. This feature is very much needed. Happy to provide resources to help if necessary.

Mark. We also want to migrate our database to ES but stopped by this issue.

+1

+1

+1

+1

+1

+1

+1

+1

+1 as well.

+1

+1 this functionality would be very useful

+1

+1

+1

:+1:

+1

Closing in favour of #8110

+1

+1

+1

+1

+1

+1

1+

+1

+1

@yehosef have a look at the bucket selector agg

@clintongormley - great, thanks for the reference.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rbraley picture rbraley  路  67Comments

geekpete picture geekpete  路  59Comments

bubo77 picture bubo77  路  43Comments

javanna picture javanna  路  72Comments

eryabitskiy picture eryabitskiy  路  94Comments