Elasticsearch: doc_count metric agg

Created on 8 Apr 2016  路  15Comments  路  Source: elastic/elasticsearch

This might sound entirely trivial, but this would be a real boon for consistency in our aggregation parsing code. doc_count is currently a property of each bucket in a bucket agg, which makes it an exception when walking the aggregation tree.

If we could get a dedicated doc_count metric agg we could handle it the same way we handle every other metric agg, even if its only purpose is to duplicate the doc_count property of the bucket.

:AnalyticAggregations >feature Analytics help wanted

Most helpful comment

Another possible alternative that will not require additional memory on the shards is to use the bucket_script pipeline aggregation. An example of this would be:

GET test/_search
{
  "size": 0, 
  "aggs": {
    "terms": {
      "terms": {
        "field": "i",
        "size": 10
      },
      "aggs": {
        "doc_count_agg": {
          "bucket_script": {
            "buckets_path": "_count",
            "script": {
              "inline": "_value",
              "lang": "expression"
            }
          }
        }
      }
    }
  }
}

Note that I have named the bucket_script aggregation doc_count_agg to avoid clashing with doc_count due to https://github.com/elastic/elasticsearch/issues/17652

All 15 comments

@kimchy here's that issue we talked about

thanks @rashidkpc, @clintongormley is this something we can see if we can do, @colings86, it doesn't sound terribly tricky? It will help a lot for a potential improvement to visualize in Kibana

Aren't there concerns about increasing the size of the response with duplicate data? For instance a terms>doc_count aggregation would have twice the same amount of json as the terms aggregation alone even though it doesn't add information.

It will increase it, but on the other hand it will mean much simplified and more generic code for handling things in tools like Kibana. To me it is a question of consistency. We can reduce it heavily with compression for example, and CBOR (or smile if we are lucky).

I understand the consistency argument. I just want to ensure it will be useful in the long term and that we won't go back to having special handling of the doc_count property in 6 months because the extra ease of consumption of the response is not considered worth the increased size of the response (either because of network trafic or load on the parser on client side).

Just so I understand, for now, I don't suggest removing the doc count property, just adding the count aggregation. I remember that we wanted to reduce the size of the response for something common as doc count, I think we can rethink if it is needed once we have the count agg?

@rashidkpc Are you looking for the response format of doc_count to be the same as the other aggregations or do you need the request format to be the same as well? If it's just the response format, could we not just change the format for how we output doc count so instead of:

"doc_count": 200

We output:

"doc_count": {
  "value": 200
}

But this would still be done in the bucket itself rather than adding a new aggregation type for it?

If you need it to be the same on the request side too obviously this would not work.

@colings86 We need it to function like a real metric agg. We need it to be nameable, but we also need to be able to use it at the top of the aggregation tree, eg, outside of a bucket agg.

The other option is to use a filter agg and just not stick anything under it. Is there a performance drawback there?

GET /usagov*/_search
{
 "size": 0,
 "aggs": {
  "doc_count": {
   "filter": {"match_all": {}}
  }
 }
}

In this case we would end with the property containing the value being called doc_count, but we already need different handling for most metric aggs, so its not a big deal.

Another possible alternative that will not require additional memory on the shards is to use the bucket_script pipeline aggregation. An example of this would be:

GET test/_search
{
  "size": 0, 
  "aggs": {
    "terms": {
      "terms": {
        "field": "i",
        "size": 10
      },
      "aggs": {
        "doc_count_agg": {
          "bucket_script": {
            "buckets_path": "_count",
            "script": {
              "inline": "_value",
              "lang": "expression"
            }
          }
        }
      }
    }
  }
}

Note that I have named the bucket_script aggregation doc_count_agg to avoid clashing with doc_count due to https://github.com/elastic/elasticsearch/issues/17652

Given that, would it made sense to add the syntactic sugar to just expose that bucket script as a dedicated doc_count agg?

We don't currently have the ability to add the syntactic sugar for this but we could add a rewrite phase to the AggregatorBuilders so we can rewrite aggregations like this. We would have to do this differently for the top-level doc_count than sub agg doc_counts since bucket_script only works on buckets and not at the top-level. For the top-level we'll need to implement it as a MatchAllDocs filter agg instead.

I opened https://github.com/elastic/elasticsearch/issues/17676 for the syntactic sugar feature

+1 from my side.
As a side note, value_count could potentially be optimized to return the doc count by using a shortcut such as value_count("_id") (which currently doesn't work since _id does not support field data) or value_count(1). This is basically similar COUNT(1) or COUNT(*) in SQL.

@elastic/es-search-aggs

I'm going to close this, it doesn't seem to have gained much traction in the last four years. Technically I think it would be pretty simple to build, but from a maintenance and prioritization perspective it doesn't seem like a good use of time.

If folks feel this would be useful, please leave a comment and we can always re-open! :)

Was this page helpful?
0 / 5 - 0 ratings