Elasticsearch: Indexing: index-time sorting

Created on 3 Jul 2014 · 26Comments · Source: elastic/elasticsearch

Lucene has index-time sorting and early query termination capabilities. It would be nice to integrate it into Elasticsearch in order to speed up queries whose sort order matches the index order. This presentation contains some information about these features and their implementation in Lucene.

:CorInfrCore >feature

Source

jpountz

Most helpful comment

We are slowly making progress: index sorting is going to be better integrated into Lucene in the upcoming 6.2 release https://issues.apache.org/jira/browse/LUCENE-6766 so when 6.2 is out we can start working on the integration. This will take time so don't expect it to be implemented and automatically terminate queries early etc. soon, but there is definitely progress being made.

jpountz on 19 Jul 2016

👍4 🎉1

All 26 comments

I think this is an awesome feature for specific problems. Yet I think we have to make sure that this never ever corrupts an index during a merge. IMO we should restrict this to numeric fields and allow only asc / desc and that field is automatically marked as using on-disk docvalues to make sure we don't fail the merge due to OOM.

s1monw on 4 Jul 2014

❤2

why limit to numeric? whether a field is numeric or not is completely unrelated to this feature: sorry that just doesn't make sense. Since lucene 4.8 you can supply any arbitrary Sort here.

rmuir on 4 Jul 2014

I am super concerned about memory consumption so I should have rather said restricted to on-disk DV. I don't think it's unrelated

s1monw on 4 Jul 2014

my question was with regards to string fields though. I think those are ok too (as well as multiple-level sorts or whatever lucene has). If we want to limit it to DV to prevent traps in merging thats fine, but I think string is just as good as numeric.

Separately, the real issue here is how well early termination will work with situations like aggregations. I don't yet fully understand how this is working, but it looks like it simply plugs into collector. so you would get wrong aggregation counts with early termination if we aren't careful.

rmuir on 4 Jul 2014

+1 on allowing string doc values and multi-level sorts. We don't need to have them from the first iteration but having them eventually would be nice. Then we could early-terminate on any prefix on the sort that is configured at index time.

Even without aggregations, simple things like the total number of hits could only be approximated. I guess there would need to be a per-query flag to turn early termination on.

jpountz on 4 Jul 2014

This sounds like the PageRank use case, but there is another as well.

Sorting can be very useful for compression, to allow "clumps" of documents to be near each other (thus producing smaller deltas in postings). For example, let's say we have songs lyrics as our documents, and we sort by genre. If we restrict the search to a specific genre, we will quickly narrow into a small portion of the postings list, with small deltas. Also, if we do a search across all genres, but for a term which is highly correlated with a genre (e.g. "cowboy" for "Country"), then again we will have good compression and search speed for that term.

My point is just that you don't need early termination for this to be useful. It can help speed up queries if you just understand your data and the common types of queries you run.

rjernst on 8 Jul 2014

Algolia amazing speed is achieved through index-time ordering, and they do support typo-tolerance as well.
I think it's safe to say that this can very well be a major feature.

ofavre on 22 Aug 2014

Looking forward for this feature.

Vineeth-Mohan on 31 May 2015

Comment copied over from #8873:

Documents of different types tend to have different fields, which can lead to problems of sparsity when documents are stored in the order that they were indexed. If we sort documents by type, then missing values (eg in doc values or norms) can be represented very efficiently.

On merge, documents should be sorted by type (assuming there is more than one type). Additionally, for certain use cases, it can make sense to apply a secondary sort (eg on timestamp). This secondary level should be customizable per index, and should probably be dynamically updatable (although it would only have effect on later merges).

Relates to #8870

clintongormley on 6 Jun 2015

We probably want to have https://issues.apache.org/jira/browse/LUCENE-6766 first.

jpountz on 4 Sep 2015

@jpountz would this help in the typical ELK case where one indexes logs (older first, then newer ones) and typically short in the opposite order - first new ones, then old ones?

In other words, with a typical ELK case where logs are being indexed would the natural indexing order already result in ideally sorted index, or would one have to run something (whatever LUCENE-6766 and this ES issue expose) to inverse the sort order of such an index?

otisg on 5 Sep 2015

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

In my opinion, this would be more useful for pure search use-cases that only need to fetch the top docs and have a low to moderate ingestion rate.

jpountz on 5 Sep 2015

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

I suspect it'd be faster to answer "find me the last 10 things that match this query" which is pretty useful. You'd have to not care about the count of total matches and I'm not sure if that outweighs the index time cost. I don't know this feature well so I can speak to what that index time cost amounts to.

nik9000 on 7 Sep 2015

After reading the description I though it would be pretty easy to add the feature but then I realized that it would break the blockjoin/nested support. Seems like we would need another sorting merge policy which takes the nested document into account...

jimczi on 7 Dec 2015

@ofavre - I am curious how this can speed up algolia. They offer various sorting options and one of them looks like TF-IDF. AFAIK , this technique can only speed up sorting based on a single factor and not something dynamic.

Vineeth-Mohan on 11 Mar 2016

Any plans/updates on this issue? Using index time sorting would have huge performance impacts on queries like "give me 10 items with the highest rating" when having millions of them. Seems like Algolia is using exactly this approach: https://www.algolia.com/doc/faq/index-configuration/how-to-sort-the-results-with-a-specific-attribute

JanJakes on 19 Jul 2016

jpountz on 19 Jul 2016

👍4 🎉1

Hey, guys, is there any progress? Currently I need index-time sorting in my product, and I wonder if I can wait util you guys implement this feature or I have to do it myself first.

weiqiyiji on 17 Jan 2017

There has been good progress on the Lucene side, including a change @jimczi just pushed to Lucene 6.x (https://issues.apache.org/jira/browse/LUCENE-7579) for its future 6.5.0 release, to give a big boost to indexing performance with index-time sorting. You can see the impact of that change in the nightly sparse Lucene benchmarks: it's annotation V in this chart: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput

But we still have plenty of work to do to expose sorted indices in ES.

mikemccand on 17 Jan 2017

@mikemccand Ok, I'll watch this issue first. Thanks!

weiqiyiji on 17 Jan 2017

@mikemccand any branch working on this so we can take a look at it ?

makeyang on 27 Mar 2017

There is no branch yet.

jpountz on 27 Mar 2017

sorry to bother to propose some thoughts on this issue:

add mapping parameters: index_sort with values[no, asc, desc] for datatype[long,integer,short,byte,double,float] and index_sort_order with no-negative int value if and only if fields setting index_sort
these 2 parameters can't be modified once it is set for a index
index_sort_order is used to decide order of multi fields and 1 stands for the first sequence to decide sort
it should not allow null value for fields with mapping parameter index_sort set not no for easy to implement
it should not allow indices with no-zero-doc to add fields with index_sort parameter for easy to implement

makeyang on 31 Mar 2017

@makeyang thanks.
I am planning to add the index sorting spec in the index settings rather than on the index mapping since you can define an index sort based on multiple fields.
Anyway I am currently iterating on this feature and I'll add updates (and link to the PR when I open it) to this issue so stay tuned !

jimczi on 31 Mar 2017

❤3

Index time sorting has landed in master:
https://github.com/elastic/elasticsearch/pull/24055

There are still missing pieces that we need to add in order to make index sorting useful: