Elasticsearch: Indexing: index-time sorting

Created on 3 Jul 2014  路  26Comments  路  Source: elastic/elasticsearch

Lucene has index-time sorting and early query termination capabilities. It would be nice to integrate it into Elasticsearch in order to speed up queries whose sort order matches the index order. This presentation contains some information about these features and their implementation in Lucene.

:CorInfrCore >feature

Most helpful comment

We are slowly making progress: index sorting is going to be better integrated into Lucene in the upcoming 6.2 release https://issues.apache.org/jira/browse/LUCENE-6766 so when 6.2 is out we can start working on the integration. This will take time so don't expect it to be implemented and automatically terminate queries early etc. soon, but there is definitely progress being made.

All 26 comments

I think this is an awesome feature for specific problems. Yet I think we have to make sure that this never ever corrupts an index during a merge. IMO we should restrict this to numeric fields and allow only asc / desc and that field is automatically marked as using on-disk docvalues to make sure we don't fail the merge due to OOM.

why limit to numeric? whether a field is numeric or not is completely unrelated to this feature: sorry that just doesn't make sense. Since lucene 4.8 you can supply any arbitrary Sort here.

I am super concerned about memory consumption so I should have rather said restricted to on-disk DV. I don't think it's unrelated

my question was with regards to string fields though. I think those are ok too (as well as multiple-level sorts or whatever lucene has). If we want to limit it to DV to prevent traps in merging thats fine, but I think string is just as good as numeric.

Separately, the real issue here is how well early termination will work with situations like aggregations. I don't yet fully understand how this is working, but it looks like it simply plugs into collector. so you would get wrong aggregation counts with early termination if we aren't careful.

+1 on allowing string doc values and multi-level sorts. We don't need to have them from the first iteration but having them eventually would be nice. Then we could early-terminate on any prefix on the sort that is configured at index time.

Even without aggregations, simple things like the total number of hits could only be approximated. I guess there would need to be a per-query flag to turn early termination on.

This sounds like the PageRank use case, but there is another as well.

Sorting can be very useful for compression, to allow "clumps" of documents to be near each other (thus producing smaller deltas in postings). For example, let's say we have songs lyrics as our documents, and we sort by genre. If we restrict the search to a specific genre, we will quickly narrow into a small portion of the postings list, with small deltas. Also, if we do a search across all genres, but for a term which is highly correlated with a genre (e.g. "cowboy" for "Country"), then again we will have good compression and search speed for that term.

My point is just that you don't need early termination for this to be useful. It can help speed up queries if you just understand your data and the common types of queries you run.

Algolia amazing speed is achieved through index-time ordering, and they do support typo-tolerance as well.
I think it's safe to say that this can very well be a major feature.

Looking forward for this feature.

Comment copied over from #8873:

Documents of different types tend to have different fields, which can lead to problems of sparsity when documents are stored in the order that they were indexed. If we sort documents by type, then missing values (eg in doc values or norms) can be represented very efficiently.

On merge, documents should be sorted by type (assuming there is more than one type). Additionally, for certain use cases, it can make sense to apply a secondary sort (eg on timestamp). This secondary level should be customizable per index, and should probably be dynamically updatable (although it would only have effect on later merges).

Relates to #8870

We probably want to have https://issues.apache.org/jira/browse/LUCENE-6766 first.

@jpountz would this help in the typical ELK case where one indexes logs (older first, then newer ones) and typically short in the opposite order - first new ones, then old ones?

In other words, with a typical ELK case where logs are being indexed would the natural indexing order already result in ideally sorted index, or would one have to run something (whatever LUCENE-6766 and this ES issue expose) to inverse the sort order of such an index?

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

In my opinion, this would be more useful for pure search use-cases that only need to fetch the top docs and have a low to moderate ingestion rate.

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

I suspect it'd be faster to answer "find me the last 10 things that match this query" which is pretty useful. You'd have to not care about the count of total matches and I'm not sure if that outweighs the index time cost. I don't know this feature well so I can speak to what that index time cost amounts to.

After reading the description I though it would be pretty easy to add the feature but then I realized that it would break the blockjoin/nested support. Seems like we would need another sorting merge policy which takes the nested document into account...

@ofavre - I am curious how this can speed up algolia. They offer various sorting options and one of them looks like TF-IDF. AFAIK , this technique can only speed up sorting based on a single factor and not something dynamic.

Any plans/updates on this issue? Using index time sorting would have huge performance impacts on queries like "give me 10 items with the highest rating" when having millions of them. Seems like Algolia is using exactly this approach: https://www.algolia.com/doc/faq/index-configuration/how-to-sort-the-results-with-a-specific-attribute

We are slowly making progress: index sorting is going to be better integrated into Lucene in the upcoming 6.2 release https://issues.apache.org/jira/browse/LUCENE-6766 so when 6.2 is out we can start working on the integration. This will take time so don't expect it to be implemented and automatically terminate queries early etc. soon, but there is definitely progress being made.

Hey, guys, is there any progress? Currently I need index-time sorting in my product, and I wonder if I can wait util you guys implement this feature or I have to do it myself first.

There has been good progress on the Lucene side, including a change @jimczi just pushed to Lucene 6.x (https://issues.apache.org/jira/browse/LUCENE-7579) for its future 6.5.0 release, to give a big boost to indexing performance with index-time sorting. You can see the impact of that change in the nightly sparse Lucene benchmarks: it's annotation V in this chart: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput

But we still have plenty of work to do to expose sorted indices in ES.

@mikemccand Ok, I'll watch this issue first. Thanks!

@mikemccand any branch working on this so we can take a look at it ?

There is no branch yet.

sorry to bother to propose some thoughts on this issue:

  1. add mapping parameters: index_sort with values[no, asc, desc] for datatype[long,integer,short,byte,double,float] and index_sort_order with no-negative int value if and only if fields setting index_sort
  2. these 2 parameters can't be modified once it is set for a index
  3. index_sort_order is used to decide order of multi fields and 1 stands for the first sequence to decide sort
  4. it should not allow null value for fields with mapping parameter index_sort set not no for easy to implement
  5. it should not allow indices with no-zero-doc to add fields with index_sort parameter for easy to implement

@makeyang thanks.
I am planning to add the index sorting spec in the index settings rather than on the index mapping since you can define an index sort based on multiple fields.
Anyway I am currently iterating on this feature and I'll add updates (and link to the PR when I open it) to this issue so stay tuned !

Index time sorting has landed in master:
https://github.com/elastic/elasticsearch/pull/24055

There are still missing pieces that we need to add in order to make index sorting useful:

  • Documentation regarding when to (not) use index sorting and how

  • Early termination for queries that rely on the index sort

  • Faster sorted scroll queries when the query sort matches the index sort.

Index sorting is now fully supported in master.
The documentation for this feature can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html

We'll continue to add enhancements and features based on sorted indices but this issue can be closed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

makeyang picture makeyang  路  3Comments

jpountz picture jpountz  路  3Comments

dadoonet picture dadoonet  路  3Comments

matthughes picture matthughes  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments