Elasticsearch: Support for precise cardinality aggregation

Created on 10 Jan 2016  路  16Comments  路  Source: elastic/elasticsearch

While current cardinality aggregation works fast and supports very large datases it is not precise. There are many use cases where approximation is not acceptable and where cardinality is sufficiently low (and known) to allow a simple deterministic algorithms to work and to provide precise result with reasonable time and memory use.

It would be very useful if elastic supported a secondary simple algorithm to provide precise cardinality calculation and let users decide which one to use depending on their needs

:AnalyticAggregations feedback_needed

Most helpful comment

@clintongormley two issues

  1. No matter how high it was, for me it was not 100% precise (difference was by 1 document) even when I set it to be higher than actual cardinality
  2. When I have it as a nested aggregation and set a fairly high threshold (because there are few buckets in parent agg that are much bigger than others) and the parent aggregation generates fair number of buckets I ended up with memory circuit breaker kicking in. So in general case when nesting etc it can end up more memory intensive (bacause it is fixed to high number independently of its bucket) that a very trivial algorithm of accumulating unique values

All 16 comments

@roytmana what about using the precision_threshold for these low cardinality fields? See https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-cardinality-aggregation.html#_precision_control

@clintongormley two issues

  1. No matter how high it was, for me it was not 100% precise (difference was by 1 document) even when I set it to be higher than actual cardinality
  2. When I have it as a nested aggregation and set a fairly high threshold (because there are few buckets in parent agg that are much bigger than others) and the parent aggregation generates fair number of buckets I ended up with memory circuit breaker kicking in. So in general case when nesting etc it can end up more memory intensive (bacause it is fixed to high number independently of its bucket) that a very trivial algorithm of accumulating unique values

Accurate cardinalities can't be reasonably computed in a distributed index as it would require to forward all unique values to the coordinating node. Since we don't want to have features that only work in a single-shard setup, I'm afraid this means we can't support this feature.

Just to check that this is not a bug in the current impl: how many unique values does your field have?

when nesting etc it can end up more memory intensive (bacause it is fixed to high number independently of its bucket) that a very trivial algorithm of accumulating unique values

I'm interested in exploring this problem however. I opened #15892.

@jpountz I understand but would like to make another plea :-) since my customers are rather unhappy abut it

Elastic does forward shard search results or aggregations to coordinating node for merging why not the unique values? If a user needs precise count and that's the only way to do it, he/she would find the cost in processing/memory/networking acceptable and ensure the cardinality is not out of reasonable range before using this algorithm. As of now there is no way to do it at all no matter at what the cost is.

Even if you make the existing algorithm more memory efficient, it still does not guarantee exactness even what looks like with thresholds higher than actual cardinality. Is that the correct understanding (it is what I seem to observe anyways)?

I know it is probably not practical to do but here is a silly question anyway:
In a single node multi-shard scenario some algorithms such as this or guaranteed result of term agg when sorted by count could be entirely practical and fairly easy to implement. Would elastic care to support some better algorithms/guarantees for single node multishard scenario? This scenario covers in very many cases traditional data mart solution where amount of data is sizeable but not huge and would fit to a powerful multicore server just fine but typical data mart users needs exact values in majority of cases and thus approximating algorithms or missing value in term aggregation because of shard count "conflict" are not an option

Even if you make the existing algorithm more memory efficient, it still does not guarantee exactness even what looks like with thresholds higher than actual cardinality

True. Actually even below the treshold the count could be wrong, it's just much more unlikely and the error should be low.

Would elastic care to support some better algorithms/guarantees for single node multishard scenario?

We have a policy that all features that we implement need to work and scale in a distributed environment, so I'm afraid we would not want to do this.

For the record, it is possible to compute exact counts on client side by initiating a scroll and maintaining a set of all unique field values that have been seen.

If ibhad to resort to that I will term aggregate on that field and then count entries this plays well being a subagg at least. It is ironic that elastic would support that but not cardinality :-) I guess I can look into pipeline and see if I can replace terms with the count

We have a policy that all features that we implement need to work and scale in a distributed environment, so I'm afraid we would not want to do this.

Closing

I have the same issue and would like to ask how cardinality aggregation's precision_threshold works. Are you using Bloom filter (or one of its variations)? Does precision_threshold stand for its size?

_Edit._ Here's an article on HyperLogLog++ that is used in ES. But I still don't understand what is precision_threshold. An article says "... for precision 14 and use LinearCounting to the left ...", so it's in order of tens, not thousands. Should it be set to a maximal expected value of aggregation?

@polkovnikov-ph the internal HLL++ parameter is a bit opaque so we used something slightly more meaningful. Here is how we translate from the precision_threshold to HLL++'s precision if you are interested: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/aggregations/metrics/cardinality/HyperLogLogPlusPlus.java#L67

@jpountz what about the precision_threshold max value be configurable in the future ES release

@jpountz We currently have the same issue with one little caveat: we want to compute the cardinality for a field value that is guaranteed to be on the same shard because the routing is based on it. More precisely we want to compute the cardinality for each parent id.

I got a workaround for this case:

{
    "byFullListScripting": {
      "terms": {
        "field": "groupId",
        "shard_size": Integer.MAX_VALUE,
        "size": Integer.MAX_VALUE
      },
      "aggs": {
        "cntScripting": {
          "scripted_metric": {
            "map_script": "targetId='u'+doc['cntTargetId']; if (_agg[targetId] == null) { _agg[targetId] = 1}",
            "reduce_script": "map=[:]; for (a in _aggs){ map.putAll(a) }; return map.size()"
          }
        }
      }
}

Some comments:

  1. inline script must be true in config file or use filed script.
  2. if the "cntTargetId" is a string rather than a number, it is not necessary to plus 'u' to the id, which is just tell the script engine that our key is a map key rather than an array index.
  3. unfortunately ES does not support ordering by scripted sub aggregation, you need to sort the returned bucket list by yourself.
  4. to get precise aggregation you have to set shard_size and size to Integer.MAX_VALUE to make all the data to be retrieved.
  5. It is really very expensive operation, take the responsibility by yourself

Same here. Without an alternative to exact cardinality, I would need to implement this as a terms aggregation and then counting the results. Of course, just for fields I know have a relatively small cardinality.

Just bumped into this, I wish I could have an exact count available for some situations where I know that the cardinality is low. The workaround is to implement some custom code that gets the results and counts them.
While I understand that this can't be scaled it seems that the only solution for getting an exact number is to pass all those ids from ES to the APP on top of the work that is happening inside ES just to get the count, that is not good either.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

martijnvg picture martijnvg  路  3Comments

rbayliss picture rbayliss  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

clintongormley picture clintongormley  路  3Comments

rjernst picture rjernst  路  3Comments