Elasticsearch: Support dynamic interval and fixed buckets for histogram aggregation

Created on 5 Feb 2015  路  19Comments  路  Source: elastic/elasticsearch

This is basically the opposite of how the histogram currently acts: fixed interval and dynamic number of buckets.

My use case is that I would like to build a price range style aggregation for filtering products:

$0 - $100
$101 - $200
$201 - $300
$301 - $400
$401 - $500

While you can do this with the range agg. if you know in advance the lower and upper bounds of all your prices, I can't see a way to do this dynamically based off the current query. The range agg also suffers from other shortcomings such as generating buckets that may contain no documents.

Ideally, I'd like to be able to specify the number of ranges / buckets I want (eg. 5) with a dynamic interval calculated by the aggregation.

min_doc_count could also come into affect here to ensure that either we don't generate buckets that don't contain any products (eg min_doc_count: 0) and in which case the number of buckets just becomes a maximum number of potential buckets, rather than a guaranteed amount.

:AnalyticAggregations >enhancement

Most helpful comment

+1 for a workaround without using a 2 step request to elasticsearch. Auto-Calculating the ranges looks like a great idea, although i can see that implementing this can be tricky given the number of things that we should consider.

All 19 comments

It would indeed be a neat feature to dynamically guess the ranges. However, it seems difficult to do in a single request in a distributed environment, as we don't know the data on other shards. So for now as a workaround I would do two requests of the same query. The first one to get the ranges and the second one to perform the aggregation on those ranges.

For more discussion on a similar feature request see: https://github.com/elasticsearch/elasticsearch/issues/9531

Yeh I was really hoping to avoid separate queries as performance is fairly critical for us.

Yes but the second query should execute much faster thanks to the file system cache, not counting also any filter in the query that may also be cached.

Hi,
I am relatively new to Elasticsearch but I found interesting thing about date_histogram. We have aggregation query with interval=week.

{
  "aggs": {
    "scores_by_date": {
      "date_histogram": {
        "field": "date",
        "format": "MM-dd-yyyy",
        "interval": "week"
      }
    }
  }
}

If we use filter with date ranges as following starting from the Jan 1 YEAR

{
  "range": {
    "document.date":  {
      "from": "2015-01-01",
      "to": "2015-02-10"
     }
  }
}

In the result buckets we get extra week from past year's last month Dec NDAY YEAR

{
  "aggregations": {
    "scores_by_date": {
      "buckets": [
        {
  HERE => "key_as_string": "12-29-2014",
          "key": 1419811200000,
          "doc_count": 29
        },
        {
          "key_as_string": "01-05-2015",
          "key": 1420416000000,
          "doc_count": 39
        },
        {
          "key_as_string": "01-12-2015",
          "key": 1421020800000,
          "doc_count": 45
        },
        {
          "key_as_string": "01-19-2015",
          "key": 1421625600000,
          "doc_count": 55
        },
        {
          "key_as_string": "01-26-2015",
          "key": 1422230400000,
          "doc_count": 29
        }
      ]
    }
  }
}

Is this a standard behaviour of Elasticsearch date histogram aggregations or is this issue the part of current discussion?

@imanhodjaev I believe that is because it the date math code includes the entire week, instead of a partial week. It does not truncate the edges of the range.

I'd love to vote for this issue. It's severely prohibiting me from attaining my use case of date_histograms. I'm doing date_histogram as a sub-aggregation to a terms aggregation, so I'm getting one histogram per term bucket. If I can't define the interval dynamically per-bucket, then I have to do hundreds of queries instead of just 1.

At the moment, I don't care much about cost to server-side computations. The client can't be expected to perform hundreds of separate queries in my real-time interactive application. So having this feature is critical.

I've attempted to work around it using groovy scripts to calculate the interval, which works, but once again I'm blocked, as I can't refer to the result of my script inside the "interval" argument of the date_histogram.

Is there another workaround that doesn't multiply queries? Or can this feature be reconsidered? Thanks!

+1 on the original use case of dynamic range histogram buckets for numeric (non-datetime fields). This is a pretty common use case for commerce applications.

Stalled by #12316. It is still possible to implement this feature by doing two requests as described above.

+1 for a workaround without using a 2 step request to elasticsearch. Auto-Calculating the ranges looks like a great idea, although i can see that implementing this can be tricky given the number of things that we should consider.

Copying my comments from a duplicate issue:

This would be a boon for Kibana because it would allow us to easily auto scale the histogram visualization and protect users from accidentally creating too many buckets, crashing their browser in the process.

We could accomplish this by issuing two requests from the client, the first grabbing the min/max values for the given field. But since the aggregation may have a search query and/or filters associated with it we don't know how performant this would be. I imagine it would be more efficient to implement this option inside elasticsearch so the calculations can be done in a single request.

This would be similar to MongoDB $bucketAuto aggregation. I would love to see something similar in ES.

This issue was discussed today, here is a brief summary: one way to implement it would be to request the min/max values first like @Bargs mentioned. Doing it in Elasticsearch would have little benefit compared to doing it on the client-side. Another way to implement it would be for each shard to adapt the interval on the fly and then at reducing time we would pick the larger interval. However the latter would require us to add ways to reuse buckets, which we do not have today, but may be required by other aggs such as clustering (including percentiles). The recommended way to work around the lack of support for dynamic intervals for now is to issue two requests to Elasticsearch: one to figure out the min/max values of the field, and another one with the histogram aggregation.

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

I believe this issue should still be open. #28993 only supports date histograms. There needs to be a similar PR for non-date, regular histograms.

@melissachang you are absolutely right that we need to support numeric histograms, but I've created a separate issue for that, #31828. We can track that work using this new issue. Thank you!

Just for sake of completeness here's a link for Sturges' Rule, which is a sound approach to determining bin counts merely based on sample count, assuming a distribution reasonably close to normal distribution.

@monfera did you figure out a way to use Sturges' rule to determine the number of buckets? Someone else ask this question few years ago also https://discuss.elastic.co/t/histogram-and-sturges-formula/14133

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

I am not sure if I got it right, but here https://github.com/findinpath/elastic-price-range-aggregation/blob/master/src/test/kotlin/com/findinpath/PriceRangeAggregationTest.kt#L119 you can find an implementation of the two-query approach suggested.

+1 for the Support dynamic interval and fixed buckets for histogram aggregation

Was this page helpful?
0 / 5 - 0 ratings