Elasticsearch: Explore option of supporting more flexible search types

Created on 17 Jul 2015  路  5Comments  路  Source: elastic/elasticsearch

Today we have query_then_fetch and query_and_fetch. This imposes a limit on the types of search functionality we can support. For instance, if you want to auto-adjust the bucket interval so that your documents fit neatly into 10 buckets, you first need to determine the min and max values in order to calculate the correct interval (eg see https://github.com/elastic/elasticsearch/issues/9572 and https://github.com/elastic/elasticsearch/issues/9531).

This requires two round trips:

  • first determine the min/max values
  • calculate the required interval
  • do a second trip to bucket documents per interval

Or to improve term count accuracy in a terms agg, you could:

  • retrieve eg the top 20 terms from each shard
  • choose the top 10 overall
  • do a second trip (if needed) to get accurate counts for all terms

Or to guarantee that you get the top 10 terms overall:

  • first trip retrieves the top 20 terms per shard
  • calculate the overall top 10
  • take the doc count of the 10th term -> 10th_count
  • second trip retrieves all terms that have at least 10th_count / num_shards
  • third trip calculates accurate counts for all the terms returned by the second trip

Multiple search phases would also help with clustering algorithms

:AnalyticAggregations :SearcSearch >enhancement Meta Analytics Search high hanging fruit

Most helpful comment

We're seeing the same problem mentioned in https://github.com/elastic/elasticsearch/issues/1305 that was closed since facets were deprecated, and we're using terms aggregations. We have a pretty complex setup with multiple shards and replicas per index, and the field being aggregated is a nested document.

When we do the terms aggregation we often see buckets with wrong counts, or even no buckets returned at all. If we change the terms aggregation to a filter aggregation looking for a specific value in the nested document that should result in a bucket, we get hits returned. Note that we're not looking for "top X" buckets, just returning all buckets and trying to get an accurate count.

I believe our queries were fine up until a couple of weeks ago, so perhaps there's a shard/routing/etc. setting that causes this to happen? Otherwise, please add my +1 to the request for a parameter to force accurate results, even though execution would be slower.

All 5 comments

https://github.com/elastic/elasticsearch/issues/10217 will be required before we do this as decisions on how many phases are required will need to be made on the coordinating node so the query needs to be parsed there before we can do this.

Also this could get very complex since term count accuracy would require re-running the parent aggregations to get the right context (right documents) for the terms aggregation to work on for the accuracy round and would also require running the sub-aggregations on the accuracy round (and not on the initial round) to get the right values for the sub-aggregations. This gets even more complex if multiple terms aggregations are nested all with accuracy set to true.

We're seeing the same problem mentioned in https://github.com/elastic/elasticsearch/issues/1305 that was closed since facets were deprecated, and we're using terms aggregations. We have a pretty complex setup with multiple shards and replicas per index, and the field being aggregated is a nested document.

When we do the terms aggregation we often see buckets with wrong counts, or even no buckets returned at all. If we change the terms aggregation to a filter aggregation looking for a specific value in the nested document that should result in a bucket, we get hits returned. Note that we're not looking for "top X" buckets, just returning all buckets and trying to get an accurate count.

I believe our queries were fine up until a couple of weeks ago, so perhaps there's a shard/routing/etc. setting that causes this to happen? Otherwise, please add my +1 to the request for a parameter to force accurate results, even though execution would be slower.

@clintongormley do you think this could now be closed since we have the composite aggregation?

@colings86 these changes are all about the top-n results, which you can't get with the composite agg without retrieving all results. i think these requests are still valid

@elastic/es-search-aggs

Was this page helpful?
0 / 5 - 0 ratings