The idea is to provide the opposite of min_doc_count (in some places it's min_count) to bucketed aggregations. The reasoning is similarly opposite: to help to find _missing_ data.
The simplest example is when using another data store, you often want to turn around and verify that you have all of the data (consistency check). Looking for _missing_ data is hard otherwise because you need to use a bucket selector, which means you need to use a script.
GET /test/_search
{
"size": 0,
"aggs": {
"find_missing_ids": {
"histogram": {
"field": "numeric_id",
"interval": 1,
"min_doc_count": 0
},
"aggs": {
"max_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "_count"
},
"script": {
"inline": "count == 0"
}
}
}
}
}
}
}
This does mean that there is a workaround to the feature, but it's much more verbose than "max_doc_count": 0 and inherently slower. Note: "min_doc_count": 0 was included just to be explicit -- that is currently the default.
This does mean that there is a workaround to the feature, but it's much more verbose than "max_doc_count": 0 and inherently slower.
Sure, but how frequent is this use case? It sounds like a one off check that you might run once in a while. I don't think we should add extra options for infrequent use cases, especially when the same result can already be achieved.
I don't think we should add extra options for infrequent use cases, especially when the same result can already be achieved.
It's a fair point, but I always find it weird when we offer a "min" without a "max", or vice versa. I suspect there are other use cases beyond verification, but I'm okay with just leaving this with the workaround above for now. If anyone drives by this with a different use case, then please feel free to add it here.
Here's a use case: I have documents that should appear in pairs. I want to find when they don't appear in pairs. So, to be able to use "max_doc_count" of 1 would give me just what I need with no additional processing required.
Scripting it might work in your sandbox, and might even work securely in ES5, but not everyone is there and it shouldn't be an excuse to ignore features (especially ones comparable to existing features).
Another example: We'd like to find all 1-hit sessions on our web server so we'd like to have a sessionId bucket and filter where count(*) = 1. In SQL this would be something like:
SELECT COUNT(*) FROM docs GROUP BY sessionId HAVING COUNT(*) = 1
Most helpful comment
Another example: We'd like to find all 1-hit sessions on our web server so we'd like to have a sessionId bucket and filter where count(*) = 1. In SQL this would be something like: