Is it possible to avoid the fetch phase for a search and return only document IDs? Is _id available at the end of the query phase such that fetch would be redundant when all I want is the _id? Can this be done through the current API?
Ultimately, I'm hoping to retrieve doc IDs from a search much faster than what I've seen so far. I've tried all documented ways in all sorts of permutations to get better performance, and I've found no satisfactory results. The best I've achieved was only a 25% speed improvement by, in parallel, querying each of 5 shards individually. An acceptable speed would be 90% faster. It would help a lot to understand whether this is reasonable and why if it is not. It's very difficult to understand why I can be given a) the first 100 results, b) the total count and c) have them sorted so quickly, but retrieving the results is very slow.
Also, is there any possibility of improving performance for this (IDs only) scenario by developing a plugin? Are there any other options, documented or not that can reduce overhead?
Just to stress the importance of this, it would be crucial to our implementation and likely a deciding factor for our adoption of Elastic to replace our current massive persistence layer.
How many ids are you retrieving per request? If few then I am surprised that the fetch phase is taking so long, if many then I'm afraid elasticsearch is not the right tool for the job: this is something that regular databases are better at.
Returning few IDs is very fast. Returning 10k and up is slow. I'd like to understand why. Can you explain this? Also, I'd like to explore options for getting better performance. Could you provide some guidance or ideas on where to look developing performance improvements, e.g. plugin for Elastic, use Lucene directly? Why not try a query only (no fetch) search type?
I'd like to understand why.
The search phase fetches Lucene's doc ids (integers), not elasticsearch's ids (strings). The fetch phase looks up the doc ids using Lucene's stored fields mechanism. Stored fields are stored together in compressed chunks. Since _source is a stored field you have to decompress a lot of _source to get to the id field. Because it is chunked you also have to decompress stored fields for docs you didn't hit.
Aggregations are fast because they use doc values which is a non-chunked columnal structure. It is compressed, but using numeric tricks rather than a general purpose compression algorithm. If you can retool your work as an aggregation by pushing the interesting work to Elasticsearch then your thing can be orders of magnitude faster.
That's a great explanation. Thank you so much for that and the idea. I will try it immediately.
From looking at lucene/elastic code I had worried that the intermediate results from the query phase would not be usable. This comment appears in every(?) implementation of IndexReader in lucene.
<p> For efficiency, in this API documents are often referred to via
<i>document numbers</i>, non-negative integers which each name a unique
document in the index. These document numbers are ephemeral -- they may change
as documents are added to and deleted from an index. Clients should thus not
rely on a given document having the same number between sessions.
But given this comment, I wonder if there is or could be an implementation of IndexReader that returns a usable ID.
This is exciting. Using the aggregation method, I was able to get back 10K IDs in 16ms. Via scroll, the same results took ~6000ms. Can you help me understand what costs or tradeoffs are made by using this method, e.g., is memory usage much greater, or performance degradation non-linear?
@jimferenczi I think I remember you did something about this?
@gcampbell-epiq in the upcoming 5.0 you can disable the stored fields retrieval. This should speed up the search if you need docvalue
or fieldcache
fields only. For instance if you want to retrieve the _uid
field you can do:
GET _search
{
"stored_fields": "_none_",
"docvalue_fields": ["_uid"]
}
.. this will retrieve the _uid
field from the fielddata (this field doesn't have docvalues) so it should be slow on the first query which needs to build the fielddata in the heap but from there the next search should be much faster than the regular one.
Can someone explain how to use aggregations to return document ids only and avoid the slow fetching?
Can someone explain how to use aggregations to return document ids only and avoid the slow fetching?
Do what @jimczi suggests above - disable stored_fields
and fetch only fields with docvalues. Your best bet is to only use this with fields that have docvalues like keyword
fields or numbers.
I suggest use another field to store uid and fetching it with docvalues which should be fast enough.
Most helpful comment
The search phase fetches Lucene's doc ids (integers), not elasticsearch's ids (strings). The fetch phase looks up the doc ids using Lucene's stored fields mechanism. Stored fields are stored together in compressed chunks. Since _source is a stored field you have to decompress a lot of _source to get to the id field. Because it is chunked you also have to decompress stored fields for docs you didn't hit.
Aggregations are fast because they use doc values which is a non-chunked columnal structure. It is compressed, but using numeric tricks rather than a general purpose compression algorithm. If you can retool your work as an aggregation by pushing the interesting work to Elasticsearch then your thing can be orders of magnitude faster.