Elasticsearch: Avoid the fetch phase when retrieving only _id

Created on 17 Mar 2016  路  11Comments  路  Source: elastic/elasticsearch

Is it possible to avoid the fetch phase for a search and return only document IDs? Is _id available at the end of the query phase such that fetch would be redundant when all I want is the _id? Can this be done through the current API?

Ultimately, I'm hoping to retrieve doc IDs from a search much faster than what I've seen so far. I've tried all documented ways in all sorts of permutations to get better performance, and I've found no satisfactory results. The best I've achieved was only a 25% speed improvement by, in parallel, querying each of 5 shards individually. An acceptable speed would be 90% faster. It would help a lot to understand whether this is reasonable and why if it is not. It's very difficult to understand why I can be given a) the first 100 results, b) the total count and c) have them sorted so quickly, but retrieving the results is very slow.

:SearcSearch discuss

Most helpful comment

I'd like to understand why.

The search phase fetches Lucene's doc ids (integers), not elasticsearch's ids (strings). The fetch phase looks up the doc ids using Lucene's stored fields mechanism. Stored fields are stored together in compressed chunks. Since _source is a stored field you have to decompress a lot of _source to get to the id field. Because it is chunked you also have to decompress stored fields for docs you didn't hit.

Aggregations are fast because they use doc values which is a non-chunked columnal structure. It is compressed, but using numeric tricks rather than a general purpose compression algorithm. If you can retool your work as an aggregation by pushing the interesting work to Elasticsearch then your thing can be orders of magnitude faster.

All 11 comments

Also, is there any possibility of improving performance for this (IDs only) scenario by developing a plugin? Are there any other options, documented or not that can reduce overhead?

Just to stress the importance of this, it would be crucial to our implementation and likely a deciding factor for our adoption of Elastic to replace our current massive persistence layer.

How many ids are you retrieving per request? If few then I am surprised that the fetch phase is taking so long, if many then I'm afraid elasticsearch is not the right tool for the job: this is something that regular databases are better at.

Returning few IDs is very fast. Returning 10k and up is slow. I'd like to understand why. Can you explain this? Also, I'd like to explore options for getting better performance. Could you provide some guidance or ideas on where to look developing performance improvements, e.g. plugin for Elastic, use Lucene directly? Why not try a query only (no fetch) search type?

I'd like to understand why.

The search phase fetches Lucene's doc ids (integers), not elasticsearch's ids (strings). The fetch phase looks up the doc ids using Lucene's stored fields mechanism. Stored fields are stored together in compressed chunks. Since _source is a stored field you have to decompress a lot of _source to get to the id field. Because it is chunked you also have to decompress stored fields for docs you didn't hit.

Aggregations are fast because they use doc values which is a non-chunked columnal structure. It is compressed, but using numeric tricks rather than a general purpose compression algorithm. If you can retool your work as an aggregation by pushing the interesting work to Elasticsearch then your thing can be orders of magnitude faster.

That's a great explanation. Thank you so much for that and the idea. I will try it immediately.

From looking at lucene/elastic code I had worried that the intermediate results from the query phase would not be usable. This comment appears in every(?) implementation of IndexReader in lucene.

<p> For efficiency, in this API documents are often referred to via <i>document numbers</i>, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.

But given this comment, I wonder if there is or could be an implementation of IndexReader that returns a usable ID.

This is exciting. Using the aggregation method, I was able to get back 10K IDs in 16ms. Via scroll, the same results took ~6000ms. Can you help me understand what costs or tradeoffs are made by using this method, e.g., is memory usage much greater, or performance degradation non-linear?

@jimferenczi I think I remember you did something about this?

@gcampbell-epiq in the upcoming 5.0 you can disable the stored fields retrieval. This should speed up the search if you need docvalue or fieldcache fields only. For instance if you want to retrieve the _uid field you can do:

GET _search 
{
    "stored_fields": "_none_",
    "docvalue_fields": ["_uid"]
}

.. this will retrieve the _uid field from the fielddata (this field doesn't have docvalues) so it should be slow on the first query which needs to build the fielddata in the heap but from there the next search should be much faster than the regular one.

Can someone explain how to use aggregations to return document ids only and avoid the slow fetching?

Can someone explain how to use aggregations to return document ids only and avoid the slow fetching?

Do what @jimczi suggests above - disable stored_fields and fetch only fields with docvalues. Your best bet is to only use this with fields that have docvalues like keyword fields or numbers.

I suggest use another field to store uid and fetching it with docvalues which should be fast enough.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abrahamduran picture abrahamduran  路  3Comments

matthughes picture matthughes  路  3Comments

makeyang picture makeyang  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

dadoonet picture dadoonet  路  3Comments