Elasticsearch-dsl-py: How to get all results back from search?

Created on 15 Sep 2017 · 20Comments · Source: elastic/elasticsearch-dsl-py

After executing a search, the Search hits.total is over 9000. However, when I check the length of hits.hits it is only 10:

>>> client = Elasticsearch(['http://nightly.apinf.io:14002'])
>>> search = Search(using=client)
>>> results = search.execute()
>>> results.hits.total
9611
>>> len(results.hits.hits)
10

How do I get back all 9611 search results?

Source

brylie

😄5 👍4

Most helpful comment

@abhimanyu3 sorry for the late response, by default the scan return things unordered. To have it respect the sort you need to do s = s.params(preserve_order=True) before calling s.scan() (see [0] for details).

0 - https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

HonzaKral on 14 Nov 2018

👍6 🚀1 🎉1

All 20 comments

You need to specify the range of results, e.g.:

total = search.count()
search = search[0:total]
results = search.execute()

njoannin on 15 Sep 2017

Thanks @njoannin. I must admit that the syntax here is not intuitive. Particularly where I change the search reference using indexing brackets:

# Create Elasticsearch client
client = Elasticsearch(['http://nightly.apinf.io:14002'])

# Create search instance
search = Search(using=client)

# Count search results
total = search.count()

# What does this step do?
search = search[0:total]

# Get all search results
results = search.execute()

brylie on 18 Sep 2017

@brylie

It is very similar to what it would do on a list: it slices.
By default, the search only returns 10 results. If you want all the results you need to request it, and the simplest way is to slice from zero to the total number of hits (i.e. search.count()).

If you only want the first 100 hits, you would slice it like this: search[0:100].
If you only wanted the hits from 100 to 200, you would slice it like this: search[100:200].

Does that clarify it?

njoannin on 18 Sep 2017

Yeah, I understand that bracket notation slices, that part makes sense. Perhaps I am just new to Python, but I was confused by the reassignment (not the slicing syntax):

# Reassigning the search object
search = search[0:total]

Is search still the same search as before this line, or have I lost something (methods, aggregations, filters, etc.)?

brylie on 18 Sep 2017

Yes, this library is designed that way: you cannot modify an object in place, you need to reassign it.

s1 = Search()
s1[0:100]  # This does not change s1
r1 = s1.execute()  # This only returns up to 10 hits

s2 = Search()
s2 = s2[0:100]  # This reassigns s2 to return up to 100 hits
r2 = s2.execute()  # This returns up to 100 hits (if there are 100 hits in the results)

Have a look at the documentation:

With the exception of the aggregations functionality this means that the Search object is immutable - all changes to the object will result in a copy being created which contains the changes.

njoannin on 18 Sep 2017

Also, if you want to retrieve all the hits, there is a specialized method called scan that will iterate over all the hits, not just the top N. That is the recommended way if you want to unload all of the documents above lets say a few hundred or thousand.

HonzaKral on 18 Sep 2017

👍4

f you want to retrieve all the hits, there is a specialized method called scan that will iterate over all the hits

@HonzaKral thanks for that tip.

My main goals now are twofold:

explore/visualize certain metrics (e.g. with DataFrame.plot(), Bokeh, or some other approach)
load the results into a Pandas dataframe (for statistical analysis and ML classification)

Are there any tutorials that you are aware of for using Elasticsearch DSL in a data exploration/analysis workflow?

brylie on 19 Sep 2017

Not any tutorials, but just running a query with aggregations and then loading all the results into python/pandas via scan is a popular pattern which should help.

HonzaKral on 27 Oct 2017

👍1

Anyone getting the below error:-
name 'Search' is not defined

abhimanyu3 on 29 Aug 2018

@abhimanyu3 this just means that you haven't imported Search into your code by specifying from elasticsearch_dsl import Search This is a generic python question, please do not comment on unrelated issues for elasticsearch_dsl. If you find an issue feel free to open a new issue. Thank you!

HonzaKral on 29 Aug 2018

👍1

@HonzaKral thanks a lot for your help. It worked like a charm. I apologize for the inconvenience caused by me.

abhimanyu3 on 29 Aug 2018

Anyone getting issues with @timestamp. My issues is i am not getting it in sorted order and my millisecond is not there in the column, its 000Z. Is this is a common thing that we don't get @timestamp in sorted order. I can do the sorting in pandas but i have millions of records so just giving it a thought if we can do something while downloading the data.

abhimanyu3 on 30 Aug 2018

# Create Elasticsearch client
client = Elasticsearch(['http://nightly.apinf.io:14002'])

# Create search instance
search = Search(using=client)

# Count search results
total = search.count()

# What does this step do?
search = search[0:total]

# Get all search results
results = search.execute()

This snippet no longer works.

$ pip freeze | grep dsl
elasticsearch-dsl==6.2.1

no-member: Instance of 'Request' has no 'execute' member

Update - pylint is full of crap. It must be a native extension.

If pylint throws this error, it is wrong.

DarrienG on 13 Nov 2018

0 - https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

HonzaKral on 14 Nov 2018

👍6 🚀1 🎉1

@DarrienG that snippet should work but it is not ideal if total is a large number - larger than 10k would flat out be refused by elasticsearch and even smaller numbers might cause performance issues. If total is larger than thousands I'd recommend using the scan method instead. See [0] for justification

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

HonzaKral on 14 Nov 2018

@DarrienG that snippet should work but it is not ideal if total is a large number - larger than 10k would flat out be refused by elasticsearch and even smaller numbers might cause performance issues. If total is larger than thousands I'd recommend using the scan method instead. See [0] for justification

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

If we want to return all (say >10k), is it better to use [0:total] vs params(preserve_order=True).scan()?

zhanwenchen on 7 Feb 2020

if you have more than 10k hits then .scan() is your only option.

HonzaKral on 7 Feb 2020

Thanks @njoannin
Awesome!! This helped us a lot.
s.scan() --> is working for me on the overall index search but not on the executed query search.

s[0:total] --> This is helpful in retrieving the documents of the exected search

kumar8055 on 19 Aug 2020

You need to specify the range of results, e.g.:
total = search.count()
search = search[0:total]
results = search.execute()

This requires two calls to ES instead of one.
So I guess

search = search[0:max_int]

or something like it should be two times faster solution?

huntekah on 5 May 2021

Hi mate, @HonzaKral is rigth. In this case, to be crystal clear, your code will be something like this:
results = search.scan()
This will return a generator that can be use in a for loop or you can turn it into a list using comprehension list like this:
results = [e for e in search.scan()]
Hope this helps somebody, regards.