Elasticsearch: Make `dfs_query_then_fetch` the default search type

Created on 19 Sep 2016  路  5Comments  路  Source: elastic/elasticsearch

The fact that statistics are not distributed by default causes some confusion, see eg. #20552. Maybe we should make dfs_query_then_fetch the default search type and then still give users the ability to go back to query_then_fetch if they know they don't need a roundtrip for distributed statistics.

The impact should be negligible, see eg. this issue #7630 where search_type=count and size=0 were reported to be equally fast in spite of the fact that size=0 did one more round trip. (Things improved since then and there is no round trip for stored fields when size=0 anymore).

:SearcSearch >enhancement Search

Most helpful comment

To nuance the question a bit - these days we can detect in advance that _score is not needed (for example we sort by time) and keep on doing what we do today. So the question becomes whether we should default to always collect term stats if relevant and let people approve the current approximation of per-shard stats explicitly.

All 5 comments

The count/size:0 round trip is quite different from a DFS round trip. Imagine a query with thousands of terms that need to go out to 100 shards... that could be very heavy, no?

To nuance the question a bit - these days we can detect in advance that _score is not needed (for example we sort by time) and keep on doing what we do today. So the question becomes whether we should default to always collect term stats if relevant and let people approve the current approximation of per-shard stats explicitly.

Imagine a query with thousands of terms that need to go out to 100 shards... that could be very heavy, no?

I don't think this would be an issue in most cases, you would need many rare scoring terms with a query that would look something like this: rare_term_1 OR ... OR rare_term_100. (If the terms were frequent then the bottleneck would shift from looking up terms to reading postings lists so the DFS phase would not be an issue.)

The most costly bit of the DFS phase is the lookup of term statistics for each individual term that occurs in the query and is used for scoring. Disk access should not be an issue since we use the same shard copy for the DFS phase and for executing the query: data that we used in the DFS phase will be in the FS cache when searching. However it is true that we will need to lookup the term twice in the FST+BlockTree. I'm not sure this would be an issue in practice, but if we think this might be, we could save the TermState of each query term in the DFS phase so that they would only be looked up once overall.

Discussed in FixitFriday: let's do more benchmarking, especially with terms-dictionary intensive queries before making a decision.

@elastic/es-search-aggs

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ppf2 picture ppf2  路  3Comments

jasontedor picture jasontedor  路  3Comments

dawi picture dawi  路  3Comments

ttaranov picture ttaranov  路  3Comments

clintongormley picture clintongormley  路  3Comments