Elasticsearch: Add "max_doc_count" to bucket aggregations

Created on 23 Jul 2016 · 4Comments · Source: elastic/elasticsearch

The idea is to provide the opposite of min_doc_count (in some places it's min_count) to bucketed aggregations. The reasoning is similarly opposite: to help to find _missing_ data.

The simplest example is when using another data store, you often want to turn around and verify that you have all of the data (consistency check). Looking for _missing_ data is hard otherwise because you need to use a bucket selector, which means you need to use a script.

GET /test/_search
{
  "size": 0,
  "aggs": {
    "find_missing_ids": {
      "histogram": {
        "field": "numeric_id",
        "interval": 1,
        "min_doc_count": 0
      },
      "aggs": {
        "max_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "_count"
            },
            "script": {
              "inline": "count == 0"
            }
          }
        }
      }
    }
  }
}

This does mean that there is a workaround to the feature, but it's much more verbose than "max_doc_count": 0 and inherently slower. Note: "min_doc_count": 0 was included just to be explicit -- that is currently the default.

:AnalyticAggregations >enhancement discuss

Source

pickypg

Most helpful comment

Another example: We'd like to find all 1-hit sessions on our web server so we'd like to have a sessionId bucket and filter where count(*) = 1. In SQL this would be something like:

SELECT COUNT(*) FROM docs GROUP BY sessionId HAVING COUNT(*) = 1

cawoodm on 21 Jan 2019

👍3

All 4 comments

This does mean that there is a workaround to the feature, but it's much more verbose than "max_doc_count": 0 and inherently slower.

Sure, but how frequent is this use case? It sounds like a one off check that you might run once in a while. I don't think we should add extra options for infrequent use cases, especially when the same result can already be achieved.

clintongormley on 27 Jul 2016

I don't think we should add extra options for infrequent use cases, especially when the same result can already be achieved.

It's a fair point, but I always find it weird when we offer a "min" without a "max", or vice versa. I suspect there are other use cases beyond verification, but I'm okay with just leaving this with the workaround above for now. If anyone drives by this with a different use case, then please feel free to add it here.

pickypg on 27 Jul 2016

Here's a use case: I have documents that should appear in pairs. I want to find when they don't appear in pairs. So, to be able to use "max_doc_count" of 1 would give me just what I need with no additional processing required.

Scripting it might work in your sandbox, and might even work securely in ES5, but not everyone is there and it shouldn't be an excuse to ignore features (especially ones comparable to existing features).

webmstr on 21 Mar 2017

👍3

Another example: We'd like to find all 1-hit sessions on our web server so we'd like to have a sessionId bucket and filter where count(*) = 1. In SQL this would be something like:

SELECT COUNT(*) FROM docs GROUP BY sessionId HAVING COUNT(*) = 1

cawoodm on 21 Jan 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Bad geopoint field should throw error

clintongormley · 3Comments

Check available disk space before starting a build

dadoonet · 3Comments

[feature request]smart routing detection when search

makeyang · 3Comments

multi datacenter deployment support

ttaranov · 3Comments

Secure Settings

rjernst · 3Comments