Elasticsearch: Can we use top hits optimizations when sorting by a field?

Created on 31 Dec 2018  路  15Comments  路  Source: elastic/elasticsearch

Currently optimizations for collection of top documents are only applicable when sorting by score or by a sort order that is congruent with the index-time sort. Given that fields are most of time both indexed and doc-valued, maybe we could optimize sorting in an even greater number of cases.

For instance imagine that a user searches for top documents matching query Q sorted by field @timestamp of type date in descending order. Maybe we could internally rewrite the query under the hood to SHOULD(LongDistanceFeatureQuery(field=@timestamp, origin=Long.MAX_VALUE)) FILTER(Q) and prepend _score to the sort: [ "_score", { "@timestamp": { "order": "desc" } } ].

There are a number of details to sort out, like we could only do that it the field is indexed and we would need to rewrite the produced TopFieldDocs so that they don't include a sort value for the score and maybe take the maximum value of the field rather than Long.MAX_VALUE to avoid defeating the optimization because most scores would be equal after casting to a float, but overall it looks to me like it could help optimize almost all sorted queries without requiring index-time decisions?

:SearcSearch >enhancement

All 15 comments

Pinging @elastic/es-search

@jpountz Do we want to expose LongDistanceFeatureQuery in elasticsearch and create a corresponding QueryBuilder for it? For example :

{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "content": "2016"
          }
        }
      ],
      "should": [
        {
          "long_distance_rank_feature": {
            "field": "timestamp",
            "boost": 2,
            "pivot" : 10,
            "origin" : 1550175141321
          }
        }
      ]
    }
  }
}

I think we should (#33382). But for this particular issue the idea is to rewrite the search request _under the hood_ to something that returns the same hits but runs faster.

@jpountz thanks for reminding about #33382.

But for this particular issue the idea is to rewrite the search request _under the hood_ to something that returns the same hits but runs faster.

Right, I was trying to do rewrite in SearchSourceBuilder::rewrite method, and noticed first we will need a QueryBuilder for LongDistanceFeatureQuery.

final long origin = (sort.order() == SortOrder.DESC) ? Long.MAX_VALUE : Long.MIN_VALUE;
final long pivotDistance = 5;  // ? not sure what to choose for pivot distance
Query query = LongPoint.newDistanceFeatureQuery(fieldName, 1, origin, pivotDistance);

BoolQueryBuilder rewritten = new BoolQueryBuilder();
rewritten.should();  // need a QueryBuilder for DistanceFeatureQuery to insert it here
rewritten.filter(queryBuilder); // filter for original query

I guess the plan is first to implement #33382

do rewrite in SearchSourceBuilder::rewrite

In my opinion one challenge with this optimization is that we also need to modify the produced FieldDoc.fields to remove the first element. I suspect implementing this optimization in SearchSourceBuilder#rewrite is going to make it challenging to also modify the produced TopFieldDocs when appropriate. It might have downsides that I haven't though about, but my initial idea was to do it in QueryPhase#execute by modifying the query before execute runs, and modifying the produced top docs before they are set on the search context.

not sure what to choose for pivot distance

The tricky bit with origin and pivotDistance is that we should pick values that make it unlikely that different long values map to the same float score. We'd probably need to use either the min or max value on the shard as the origin, and something in the order of (max-min)/2 for the pivot value.

I guess the plan is first to implement #33382

That works for me if that makes things easier.

I was linked to this issue by very helpful people at the elasticon ama booth.

Do I understand correctly that this enhancement would mean I could avoid the big sort-all-matching-documents operation when I want to sort by the value from a single field?

That would be a massive performance improvement for me!

@dvisztempacct You got it right.

@jpountz thank you! I'm very much looking forward to this!

@jpountz I have implemented a draft PR for the query conversion, but encountered the following problem:

Setting origin to Long.MAX_VALUE or Long.MIN_VALUE, when documents' fields have small long values produces very tiny scores that they become undifferentiable:

For example, if my index has 3 docs with values: 111, 555, 999, I will get these results:
origin: 9223372036854775807
pivot: 444
As score= pivot / (pivot + |origin - docValue|), and since origin is huge, scores become super tiny - close to 0 and undifferentiable.

"max_score": 4.8138576E-17,
    "hits": [
      {
        "_index": "index1",
        "_type": "_doc",
        "_id": "1",
        "_score": 4.8138576E-17,
        "_source": {
          "my_text": "text1",
          "my_long": 555,
          "my_int": 555
        }
      },
      {
        "_index": "index1",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.8138576E-17,
        "_source": {
          "my_text": "text1",
          "my_long": 999,
          "my_int": 999
        }
      },
      {
        "_index": "index1",
        "_type": "_doc",
        "_id": "3",
        "_score": 4.8138576E-17,
        "_source": {
          "my_text": "text1",
          "my_long": 111,
          "my_int": 111
        }
      }
    ]

We need to set origin differently. Maximum value for the index on all shards? @jpountz what do you think?

Also I am not sure that it is right that pivot values are different on different shards.

I had mentioned this issue in the description of this issue:

There are a number of details to sort out, like [...] maybe tak[ing] the maximum value of the field rather than Long.MAX_VALUE to avoid defeating the optimization because most scores would be equal after casting to a float

We need to set origin differently. Maximum value for the index on all shards? @jpountz what do you think?

+1 to try it out. For what it's worth, not all shards need to have the same values for origin and pivot since we won't return the produced scores to the coordinating node anyway.

Hi all :)

I see that PR #39770 has been merged!

Any idea when / which release this will be a part of?

But I notice that it was merged into a feature branch, and it does not appear that that branch has been merged into master.

What does the roadmap look like for this?

Thanks!

@dvisztempacct Hi Don, we don't know yet when it will be released. The idea looks promising but it also has some worst-case scenarios that makes sorting perform slower than a naive sort, for instance if you search for the most recent documents and the index is ordered by increasing timestamp (either explicitly via index sorting, or implicitly because index order correlates with timestamp - which is typical when indexing logs). We have a couple ideas to work around this such as shuffling the order of segments at collection time or disabling the optimization based on heuristics, that we need to explore further.

@jpountz Thanks for the update :)

In the meantime, is there another way to accomplish this kind of optimization? Right now for a large number of matching documents, my relational databases are much faster options than ES for some of my queries.

Any updates? :D

Hi Don, sorry that it took so long but we made good progress over the past month even though some bits are still missing. We're now focusing on a change in Lucene to allow sharing the minimum score between leaf collectors but that's not a simple one so we don't know when it will be ready yet. We'll update this issue as soon as we have something in Lucene. Stay tuned ;)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clintongormley picture clintongormley  路  3Comments

rjernst picture rjernst  路  3Comments

ppf2 picture ppf2  路  3Comments

dadoonet picture dadoonet  路  3Comments

malpani picture malpani  路  3Comments