Elasticsearch: Support dynamic interval and fixed buckets for histogram aggregation

Created on 5 Feb 2015 · 19Comments · Source: elastic/elasticsearch

This is basically the opposite of how the histogram currently acts: fixed interval and dynamic number of buckets.

My use case is that I would like to build a price range style aggregation for filtering products:

$0 - $100
$101 - $200
$201 - $300
$301 - $400
$401 - $500

While you can do this with the range agg. if you know in advance the lower and upper bounds of all your prices, I can't see a way to do this dynamically based off the current query. The range agg also suffers from other shortcomings such as generating buckets that may contain no documents.

Ideally, I'd like to be able to specify the number of ranges / buckets I want (eg. 5) with a dynamic interval calculated by the aggregation.

min_doc_count could also come into affect here to ensure that either we don't generate buckets that don't contain any products (eg min_doc_count: 0) and in which case the number of buckets just becomes a maximum number of potential buckets, rather than a guaranteed amount.

:AnalyticAggregations >enhancement

Source

rayward

👍49 😄3

Most helpful comment

+1 for a workaround without using a 2 step request to elasticsearch. Auto-Calculating the ranges looks like a great idea, although i can see that implementing this can be tricky given the number of things that we should consider.

gmoskovicz on 25 Aug 2016

👍6

All 19 comments

It would indeed be a neat feature to dynamically guess the ranges. However, it seems difficult to do in a single request in a distributed environment, as we don't know the data on other shards. So for now as a workaround I would do two requests of the same query. The first one to get the ranges and the second one to perform the aggregation on those ranges.

For more discussion on a similar feature request see: https://github.com/elasticsearch/elasticsearch/issues/9531

alexksikes on 5 Feb 2015

Yeh I was really hoping to avoid separate queries as performance is fairly critical for us.

rayward on 5 Feb 2015

Yes but the second query should execute much faster thanks to the file system cache, not counting also any filter in the query that may also be cached.

alexksikes on 6 Feb 2015

Hi,
I am relatively new to Elasticsearch but I found interesting thing about date_histogram. We have aggregation query with interval=week.

{
  "aggs": {
    "scores_by_date": {
      "date_histogram": {
        "field": "date",
        "format": "MM-dd-yyyy",
        "interval": "week"
      }
    }
  }
}

If we use filter with date ranges as following starting from the Jan 1 YEAR

{
  "range": {
    "document.date":  {
      "from": "2015-01-01",
      "to": "2015-02-10"
     }
  }
}

In the result buckets we get extra week from past year's last month Dec NDAY YEAR

{
  "aggregations": {
    "scores_by_date": {
      "buckets": [
        {
  HERE => "key_as_string": "12-29-2014",
          "key": 1419811200000,
          "doc_count": 29
        },
        {
          "key_as_string": "01-05-2015",
          "key": 1420416000000,
          "doc_count": 39
        },
        {
          "key_as_string": "01-12-2015",
          "key": 1421020800000,
          "doc_count": 45
        },
        {
          "key_as_string": "01-19-2015",
          "key": 1421625600000,
          "doc_count": 55
        },
        {
          "key_as_string": "01-26-2015",
          "key": 1422230400000,
          "doc_count": 29
        }
      ]
    }
  }
}

Is this a standard behaviour of Elasticsearch date histogram aggregations or is this issue the part of current discussion?

imanhodjaev on 10 Feb 2015

@imanhodjaev I believe that is because it the date math code includes the entire week, instead of a partial week. It does not truncate the edges of the range.

rjernst on 10 Feb 2015

I'd love to vote for this issue. It's severely prohibiting me from attaining my use case of date_histograms. I'm doing date_histogram as a sub-aggregation to a terms aggregation, so I'm getting one histogram per term bucket. If I can't define the interval dynamically per-bucket, then I have to do hundreds of queries instead of just 1.

At the moment, I don't care much about cost to server-side computations. The client can't be expected to perform hundreds of separate queries in my real-time interactive application. So having this feature is critical.

I've attempted to work around it using groovy scripts to calculate the interval, which works, but once again I'm blocked, as I can't refer to the result of my script inside the "interval" argument of the date_histogram.

Is there another workaround that doesn't multiply queries? Or can this feature be reconsidered? Thanks!

OPtoss on 24 Apr 2015

👍1

+1 on the original use case of dynamic range histogram buckets for numeric (non-datetime fields). This is a pretty common use case for commerce applications.

ppf2 on 25 Jun 2015

👍1

Stalled by #12316. It is still possible to implement this feature by doing two requests as described above.

jpountz on 31 Jul 2015

gmoskovicz on 25 Aug 2016

👍6

Copying my comments from a duplicate issue:

This would be a boon for Kibana because it would allow us to easily auto scale the histogram visualization and protect users from accidentally creating too many buckets, crashing their browser in the process.

We could accomplish this by issuing two requests from the client, the first grabbing the min/max values for the given field. But since the aggregation may have a search query and/or filters associated with it we don't know how performant this would be. I imagine it would be more efficient to implement this option inside elasticsearch so the calculations can be done in a single request.

Bargs on 21 Sep 2016

👍1

This would be similar to MongoDB $bucketAuto aggregation. I would love to see something similar in ES.

bompus on 31 Dec 2016

This issue was discussed today, here is a brief summary: one way to implement it would be to request the min/max values first like @Bargs mentioned. Doing it in Elasticsearch would have little benefit compared to doing it on the client-side. Another way to implement it would be for each shard to adapt the interval on the fly and then at reducing time we would pick the larger interval. However the latter would require us to add ways to reuse buckets, which we do not have today, but may be required by other aggs such as clustering (including percentiles). The recommended way to work around the lack of support for dynamic intervals for now is to issue two requests to Elasticsearch: one to figure out the min/max values of the field, and another one with the histogram aggregation.

jpountz on 27 Feb 2017

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

vaibhavtripathi on 7 Dec 2017

👍3

I believe this issue should still be open. #28993 only supports date histograms. There needs to be a similar PR for non-date, regular histograms.

melissachang on 17 Jul 2018

@melissachang you are absolutely right that we need to support numeric histograms, but I've created a separate issue for that, #31828. We can track that work using this new issue. Thank you!

pcsanwald on 17 Jul 2018

Just for sake of completeness here's a link for Sturges' Rule, which is a sound approach to determining bin counts merely based on sample count, assuming a distribution reasonably close to normal distribution.

monfera on 14 Sep 2018

@monfera did you figure out a way to use Sturges' rule to determine the number of buckets? Someone else ask this question few years ago also https://discuss.elastic.co/t/histogram-and-sturges-formula/14133