Elasticsearch: min_doc_count=0 doesn't work with a date_histogram with a filter

Created on 22 Jan 2014 · 17Comments · Source: elastic/elasticsearch

I'm trying to create a date_histogram for recent events, where days where no events happen are still shown.

{
  "aggs": {
    "events_last_week": {
      "filter": {
        "range": {
          "@timestamp": {
            "from": "2014-01-10"
          }
        }
      },
      "aggs": {
        "events_last_week_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "format": "yyyy-MM-dd",
            "interval": "1d"
          }
        }
      }
    }
  }
}

I get a response like this

"aggregations":  {
  "events_last_week": {
    "doc_count": 33861,
    "events_last_week_histogram": [
      {
        "key_as_string": "2014-01-10",
        "key": 1389744000000,
        "doc_count": 2120
      }, {
        "key_as_string": "2014-01-16",
        "key": 1389830400000,
        "doc_count": 3823
      }, {
        "key_as_string": "2014-01-17",
        "key": 1389916800000,
        "doc_count": 27918
      }
    ]
  }
}

The empty days are not returned. If I construct the query without the filter, the empty days are returned correctly.

There is also an issue even when the empty days are returned correctly without the filter. If, for example, today is "2014-01-22", and the latest timestamp in my data is "2014-01-17", then the 5 days between these two dates are not returned as empty buckets, though all the empty buckets prior to "2014-01-17" are returned correctly.

>bug v1.0.0.RC2 v1.1.0 v2.0.0-beta1

Source

cmaitchison

Most helpful comment

For anyone who arrived to this thread via Google, hard ranges is supported via the extended_bounds param. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html

deanchen on 28 Apr 2015

👍4

All 17 comments

@cmaitchison

I can't really reproduce it, I ran the same queries as you and I get the right responses. What es version are you working with? we introduced min_doc_count on 1.0.0.RC1

There is also an issue even when the empty days are returned correctly without the filter. If, for example, today is "2014-01-22", and the latest timestamp in my data is "2014-01-17", then the 5 days between these two dates are not returned as empty buckets, though all the empty buckets prior to "2014-01-17" are returned correctly.

the gaps that are filled are based on the dates in the documents you're aggregating... so the first histogram bucket will be based on the earliest date in the document set and the last bucket will be based on the latest date in the set... then we fill in all gaps between these two buckets.

we can consider adding a "range" settings to the histograms which will enable to define the value range (or date range in case of date_histogram) on which the buckets will be created. In your case, that'll mean that if you define a range of the form "range": { "to" : "now" } along with "min_doc_count" : 0 we'll return all the empty buckets until now (beyond the dates in the document set)

uboness on 22 Jan 2014

@cmaitchison scratch that... I finally managed to reproduce it (it happens when you have a single shard)... will work on a fix

uboness on 22 Jan 2014

Wow, nice find! I would never have thought to have mentioned that.

On 22 Jan 2014, at 21:32, uboness [email protected] wrote:

@cmaitchison scratch that... I finally managed to reproduce it (it happens when you have a single shard)... will work on a fix

—
Reply to this email directly or view it on GitHub.

cmaitchison on 22 Jan 2014

Also related to this title, I've found that min_doc_count=0 does not work if _all_ of the buckets would be empty after applying the filter. I can reproduce this issue on an index with 2 shards.

{
  "aggs": {
    "filtered_events": {
      "filter": {
        "and": [
          {
            "range": {
              "@timestamp": {
                "from": 1390267500000,
                "to":   1390267560000
              }
            }
          }
        ]
      },
      "aggs": {
        "filtered_events_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "interval": "1s"
          }
        }
      }
    }
  }
}

The above query should return 60 results, 1 for each second in the minute. If any events are found in that minute then 60 results are returned. If no events are found in that minute then 0 results are returned, when you would expect 60 empty buckets.

My use case is zooming in on a series on a chart. The zero value results are very helpful to know where to plot the zeros on the x-axis.

cmaitchison on 23 Jan 2014

🎉1 😄1

Another related issue I am finding is that sometimes the intervals do not go back far enough.

{
  "aggs": {
    "events_last_week": {
      "filter": {
        "and": [
          {
            "range": {
              "@timestamp": {
                "from": 1390267432894,
                "to": 1390267547037
              }
            }
          }
        ]
      },
      "aggs": {
        "events_last_week_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "interval": "second"
          }
        }
      }
    }
  }
}

returns exactly

{
  "aggregations": {
    "events_last_week": {
      "doc_count": 1099,
      "events_last_week_histogram": [
        {
          "key": 1390267526000,
          "doc_count": 12
        },
        {
          "key": 1390267527000,
          "doc_count": 0
        },
        {
          "key": 1390267528000,
          "doc_count": 29
        },
        {
          "key": 1390267529000,
          "doc_count": 32
        },
        {
          "key": 1390267530000,
          "doc_count": 58
        },
        {
          "key": 1390267531000,
          "doc_count": 64
        },
        {
          "key": 1390267532000,
          "doc_count": 35
        },
        {
          "key": 1390267533000,
          "doc_count": 36
        },
        {
          "key": 1390267534000,
          "doc_count": 43
        },
        {
          "key": 1390267535000,
          "doc_count": 52
        },
        {
          "key": 1390267536000,
          "doc_count": 58
        },
        {
          "key": 1390267537000,
          "doc_count": 62
        },
        {
          "key": 1390267538000,
          "doc_count": 76
        },
        {
          "key": 1390267539000,
          "doc_count": 70
        },
        {
          "key": 1390267540000,
          "doc_count": 53
        },
        {
          "key": 1390267541000,
          "doc_count": 72
        },
        {
          "key": 1390267542000,
          "doc_count": 81
        },
        {
          "key": 1390267543000,
          "doc_count": 48
        },
        {
          "key": 1390267544000,
          "doc_count": 88
        },
        {
          "key": 1390267545000,
          "doc_count": 45
        },
        {
          "key": 1390267546000,
          "doc_count": 83
        },
        {
          "key": 1390267547000,
          "doc_count": 2
        }
      ]
    }
  }
}

But it is missing all of the empty buckets between 1390267432894 and 1390267526000. Again, this is with a 2 shard index on 1.0.0RC1.

cmaitchison on 23 Jan 2014

@cmaitchison as I mentioned above, the histogram operates on the dataset and extracts the min/max of the histogram from the documents (the earliest/latest). There is no direct relations between the filter aggregation and the histogram aggregations (aggregations are unaware of other aggregations in their hierarchy). We could potentially add a range feature to histogram, but if we do it'll have to be post 1.0.

In the first example you gave, there are no documents in that minute, there are no buckets (as we can't determine the min/max values). For the second example, it might be that the first document in the doc set has a later timestamp than the from one in the filter.

uboness on 23 Jan 2014

Thanks, @uboness, for your help and excellent explanation. range on histogram is definitely a feature I would use. For now I can fill in the gaps on the client-side. Thanks again.

cmaitchison on 23 Jan 2014

@cmaitchison no worries... thank you for the bug report! important one!

uboness on 23 Jan 2014

I'm interested in hard range boundaries (returning empty buckets to fill gaps between from and to in the case of missing documents) as well. Is there an issue tracking this, or shall I raise one?

erikvanzijst on 28 Jan 2014

deanchen on 28 Apr 2015

👍4

I'm now experiencing the same issue as reported running es 1.6.0

histogram = {
  invervals: {
    date_histogram: {
      field: 'called_at',
      interval: 'day',
      order: { _key: "asc" },
      min_doc_count: 0 # doesn't appear to have any impact on the final result.
    },
    aggs: stats
  }
}

taf2 on 18 Jun 2015

it looks like when nesting a date_histogram within a term aggregation there is no way for the min_doc_count to auto fill the zero results.

aggs: {
   groups: {
     terms: {
       min_doc_count: 0
       script: '...'
    },
   aggs: {
   invervals: {
    date_histogram: {
      field: 'called_at',
      interval: 'day',
      order: { _key: "asc" },
      min_doc_count: 0 # doesn't appear to have any impact on the final result.
    },
    aggs: stats
  }
  }
}

taf2 on 18 Jun 2015

@taf2 please could you open an issue with a complete recreation which explains the problem?

clintongormley on 18 Jun 2015

Is this bug still there? I am trying to do the same exact thing as the OP right now.

quillan86 on 8 Apr 2020

👍1

me too! :)

vicapow on 21 Apr 2020

👍1

And me either. :)

mashahabi15 on 12 Aug 2020

👍1

Hi, i found the same issue but it could be workaround adding the object extended_bounds to the date_histogram aggregation, something like this:

{"extended_bounds":{"min":"+timeInit+","max":"+timeFin+"}} where timeInit and timeFin are the same period specified in the range filter in miliseconds

I hope this can help somebody.