Elasticsearch: Explore option of supporting more flexible search types

Created on 17 Jul 2015 · 5Comments · Source: elastic/elasticsearch

Today we have query_then_fetch and query_and_fetch. This imposes a limit on the types of search functionality we can support. For instance, if you want to auto-adjust the bucket interval so that your documents fit neatly into 10 buckets, you first need to determine the min and max values in order to calculate the correct interval (eg see https://github.com/elastic/elasticsearch/issues/9572 and https://github.com/elastic/elasticsearch/issues/9531).

This requires two round trips:

first determine the min/max values
calculate the required interval
do a second trip to bucket documents per interval

Or to improve term count accuracy in a terms agg, you could:

retrieve eg the top 20 terms from each shard
choose the top 10 overall
do a second trip (if needed) to get accurate counts for all terms

Or to guarantee that you get the top 10 terms overall:

first trip retrieves the top 20 terms per shard
calculate the overall top 10
take the doc count of the 10th term -> 10th_count
second trip retrieves all terms that have at least 10th_count / num_shards
third trip calculates accurate counts for all the terms returned by the second trip

Multiple search phases would also help with clustering algorithms

:AnalyticAggregations :SearcSearch >enhancement Meta Analytics Search high hanging fruit

Source

clintongormley

👍5

Most helpful comment

We're seeing the same problem mentioned in https://github.com/elastic/elasticsearch/issues/1305 that was closed since facets were deprecated, and we're using terms aggregations. We have a pretty complex setup with multiple shards and replicas per index, and the field being aggregated is a nested document.

When we do the terms aggregation we often see buckets with wrong counts, or even no buckets returned at all. If we change the terms aggregation to a filter aggregation looking for a specific value in the nested document that should result in a bucket, we get hits returned. Note that we're not looking for "top X" buckets, just returning all buckets and trying to get an accurate count.

I believe our queries were fine up until a couple of weeks ago, so perhaps there's a shard/routing/etc. setting that causes this to happen? Otherwise, please add my +1 to the request for a parameter to force accurate results, even though execution would be slower.

brettlyman on 6 Jun 2016

👍10

All 5 comments

https://github.com/elastic/elasticsearch/issues/10217 will be required before we do this as decisions on how many phases are required will need to be made on the coordinating node so the query needs to be parsed there before we can do this.

Also this could get very complex since term count accuracy would require re-running the parent aggregations to get the right context (right documents) for the terms aggregation to work on for the accuracy round and would also require running the sub-aggregations on the accuracy round (and not on the initial round) to get the right values for the sub-aggregations. This gets even more complex if multiple terms aggregations are nested all with accuracy set to true.

colings86 on 24 Jul 2015

brettlyman on 6 Jun 2016

👍10

@clintongormley do you think this could now be closed since we have the composite aggregation?

colings86 on 13 Mar 2018

@colings86 these changes are all about the top-n results, which you can't get with the composite agg without retrieving all results. i think these requests are still valid

clintongormley on 13 Mar 2018

@elastic/es-search-aggs

colings86 on 13 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings