Elasticsearch-dsl-py: How to get all results back from search?

Created on 15 Sep 2017  路  20Comments  路  Source: elastic/elasticsearch-dsl-py

After executing a search, the Search hits.total is over 9000. However, when I check the length of hits.hits it is only 10:

>>> client = Elasticsearch(['http://nightly.apinf.io:14002'])
>>> search = Search(using=client)
>>> results = search.execute()
>>> results.hits.total
9611
>>> len(results.hits.hits)
10

How do I get back all 9611 search results?

Most helpful comment

@abhimanyu3 sorry for the late response, by default the scan return things unordered. To have it respect the sort you need to do s = s.params(preserve_order=True) before calling s.scan() (see [0] for details).

0 - https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

All 20 comments

You need to specify the range of results, e.g.:

total = search.count()
search = search[0:total]
results = search.execute()

Thanks @njoannin. I must admit that the syntax here is not intuitive. Particularly where I change the search reference using indexing brackets:

# Create Elasticsearch client
client = Elasticsearch(['http://nightly.apinf.io:14002'])

# Create search instance
search = Search(using=client)

# Count search results
total = search.count()

# What does this step do?
search = search[0:total]

# Get all search results
results = search.execute()

@brylie

It is very similar to what it would do on a list: it slices.
By default, the search only returns 10 results. If you want all the results you need to request it, and the simplest way is to slice from zero to the total number of hits (i.e. search.count()).

If you only want the first 100 hits, you would slice it like this: search[0:100].
If you only wanted the hits from 100 to 200, you would slice it like this: search[100:200].

Does that clarify it?

Yeah, I understand that bracket notation slices, that part makes sense. Perhaps I am just new to Python, but I was confused by the reassignment (not the slicing syntax):

# Reassigning the search object
search = search[0:total]

Is search still the same search as before this line, or have I lost something (methods, aggregations, filters, etc.)?

Yes, this library is designed that way: you cannot modify an object in place, you need to reassign it.

s1 = Search()
s1[0:100]  # This does not change s1
r1 = s1.execute()  # This only returns up to 10 hits

s2 = Search()
s2 = s2[0:100]  # This reassigns s2 to return up to 100 hits
r2 = s2.execute()  # This returns up to 100 hits (if there are 100 hits in the results)

Have a look at the documentation:

With the exception of the aggregations functionality this means that the Search object is immutable - all changes to the object will result in a copy being created which contains the changes.

Also, if you want to retrieve all the hits, there is a specialized method called scan that will iterate over all the hits, not just the top N. That is the recommended way if you want to unload all of the documents above lets say a few hundred or thousand.

f you want to retrieve all the hits, there is a specialized method called scan that will iterate over all the hits

@HonzaKral thanks for that tip.

My main goals now are twofold:

  • explore/visualize certain metrics (e.g. with DataFrame.plot(), Bokeh, or some other approach)
  • load the results into a Pandas dataframe (for statistical analysis and ML classification)

Are there any tutorials that you are aware of for using Elasticsearch DSL in a data exploration/analysis workflow?

Not any tutorials, but just running a query with aggregations and then loading all the results into python/pandas via scan is a popular pattern which should help.

Anyone getting the below error:-
name 'Search' is not defined

@abhimanyu3 this just means that you haven't imported Search into your code by specifying from elasticsearch_dsl import Search This is a generic python question, please do not comment on unrelated issues for elasticsearch_dsl. If you find an issue feel free to open a new issue. Thank you!

@HonzaKral thanks a lot for your help. It worked like a charm. I apologize for the inconvenience caused by me.

Anyone getting issues with @timestamp. My issues is i am not getting it in sorted order and my millisecond is not there in the column, its 000Z. Is this is a common thing that we don't get @timestamp in sorted order. I can do the sorting in pandas but i have millions of records so just giving it a thought if we can do something while downloading the data.

# Create Elasticsearch client
client = Elasticsearch(['http://nightly.apinf.io:14002'])

# Create search instance
search = Search(using=client)

# Count search results
total = search.count()

# What does this step do?
search = search[0:total]

# Get all search results
results = search.execute()

This snippet no longer works.

$ pip freeze | grep dsl
elasticsearch-dsl==6.2.1

no-member: Instance of 'Request' has no 'execute' member

Update - pylint is full of crap. It must be a native extension.

If pylint throws this error, it is wrong.

@abhimanyu3 sorry for the late response, by default the scan return things unordered. To have it respect the sort you need to do s = s.params(preserve_order=True) before calling s.scan() (see [0] for details).

0 - https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

@DarrienG that snippet should work but it is not ideal if total is a large number - larger than 10k would flat out be refused by elasticsearch and even smaller numbers might cause performance issues. If total is larger than thousands I'd recommend using the scan method instead. See [0] for justification

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

@DarrienG that snippet should work but it is not ideal if total is a large number - larger than 10k would flat out be refused by elasticsearch and even smaller numbers might cause performance issues. If total is larger than thousands I'd recommend using the scan method instead. See [0] for justification

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

If we want to return all (say >10k), is it better to use [0:total] vs params(preserve_order=True).scan()?

if you have more than 10k hits then .scan() is your only option.

Thanks @njoannin
Awesome!! This helped us a lot.
s.scan() --> is working for me on the overall index search but not on the executed query search.

s[0:total] --> This is helpful in retrieving the documents of the exected search

You need to specify the range of results, e.g.:

total = search.count()
search = search[0:total]
results = search.execute()

This requires two calls to ES instead of one.
So I guess

search = search[0:max_int]

or something like it should be two times faster solution?

Hi mate, @HonzaKral is rigth. In this case, to be crystal clear, your code will be something like this:
results = search.scan()
This will return a generator that can be use in a for loop or you can turn it into a list using comprehension list like this:
results = [e for e in search.scan()]
Hope this helps somebody, regards.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ypkkhatri picture ypkkhatri  路  4Comments

abuzakaria picture abuzakaria  路  4Comments

mortada picture mortada  路  3Comments

amih90 picture amih90  路  4Comments

beanaroo picture beanaroo  路  4Comments