When I want to paginate through the search results, not iterate as the scan does from elasticsearch.helpers.
My current workaround is something like this
search = Search()
...
# construct your search query
...
result = search.params(search_type='scan', scroll='1m').execute()
scroll_id = result.scroll_id
# Now start using scroll_id to do the pagination,
# but I have to use Elasticsearch.scroll which returns dictionaries not a Result object
client = connections.get_connection()
while data_to_paginate:
result = Response(client.scroll(scroll_id, scroll='1m'))
There probably should be a helper function that should abstract at least the following part
client = connections.get_connection()
result = Response(client.scroll(scroll_id, scroll='1m'))
Maybe even getting the scroll_id from the result. Basically the user probably shouldn't be getting a client and manually constructing a Response object.
@HonzaKral what do you think? If we agree on the interface I could implement that since I am probably going to do that for my project.
There is already a support for pagination:
s = Search()
s = s.query().filter().....
s = s[10:20]
s.execute()
will work for pagination. The slicing of the Search object will add from/offset args to the search body which should help. The only question I am unsure of is whether the slicing should also execute the search (currently it doesn't because of aggregations).
using scroll for pagination is not ideal because it has non-trivial overhead on the elasticsearch side. That said, there is a .scan method on the Search object that can actually do that thing already if you need it.
What do you think?
slicing is great, but it uses from/offset and that means that deep paginations are very costly.
I need to paginate through pretty much my entire index, that's why i need to use scan. I also don't care about scoring the date so this is the ideal case for scan.
I think your current design that you only communicate with the server on .execute is great. And I can't figure out a good equivalent for the scan since first time I execute with search_type='scan' it only returns me the scroll_id and I have to have a second call. In order to make it similar I could add .scroll to the Search class so one could construct
s = Search()
s = s.scroll(scroll_id='some_id')
s.execute()
And this would return one page from the scanning session with corresponding scroll_id.
have you tried just calling the scan method on the Search object? it will return a generator iterating over all the documents matching the query completely hiding the scroll_id mechanic.
That works for iterating over whole data. But I want to have pagination, because I'm exposing my elasticsearch data over a REST API and I want to send data over to my API users chunk by chunk (so they don't have to send a request per object and I can't send them all the data at once). The scan helper isn't suited very well for my use case.
Well, in that case I am afraid you have to replicate a lot of the functionality and use the scroll api directly using a scroll_id that you store somewhere.
You can always take the result of the scroll and feed the output to Response (response = Response(es.scroll(scroll_id=XYZ, scroll='1m'))). You will then be able to access the new scroll_id as response._scroll_id, all the hits will still be available as usual - as response.hits or you can just iterate over the response.
ok, cool. So you think this is a very weird use case and it shouldn't get into elasticsearch_dsl?
Let's leave this ticket here and if it attracts other people we will add a helper. So far I feel that the workaround is very simple so there is no need.
I would also don't think this is a very common use case - either people case about score and then need to use normal pagination or they want to just grab all the data so scan is ideal for them.
sounds good
Thanks, closing. If any more people find this and would find it useful, please leave comment.
@HonzaKral I find it very useful, but it would be nice to have a mention of the "scroll" attached to it in the documentation.
Thanks, makes perfect sense. I added a note in d01854824348bf46aed80acaa470cc0d96f89007
Found this thread after elasticsearch's suggestion to look at scroll.
Question is, how can I scan() over all results in a DocType?
Normally do:
docs = DocType.search().filter().execute()
for d in docs:
username, domain = d.email.split('@')
Now that I'm doing:
docs = DocType.search().filter().scan()
for d in docs:
username, domain = d.email.split('@')
once it processes 4,700 or so docs, the next iteration of the loop says:
user, domain = email.split('@')
ValueError: need more than 1 value to unpack
Which I assume is because scan() is getting the next chunk of data. So then I thought do a while len(docs) > 0: and pop from list, but there is no len for generator object.
Thanks!
+1 for a more default ES-DSL implementation for pagination :)
I am sorry, I missed this comment. @brizzbane the error you posted is in your data - you have an email field that doesn't contain any @ signs. It has nothing to do with elasticsearch or the library.
@0x11 I will be happy to help, but I am not sure what is needed - you can easily paginate just using standard slicing on a search object, tools like the Django paginator work seamlessly with it too. With the only exception of having to call execute if you also want aggregations (https://github.com/HonzaKral/es-django-example/blob/master/qa/views.py#L29-L36).
Is there anything in particular that you are missing? Thanks!
@HonzaKral Actually maybe a quick question if you'll permit me.
Just to confirm, can you do slicing on scan()? Something along the lines of: search[start:end].scan()
There really isn't any slicing on scan - the only purpose of the API is when you want _all_ the results, slicing then makes little sense.
@HonzaKral My use case is that i'm trying to grab more results than the default ES index.max_result_window, but I dont want ALL results either.
So I figured I could use scan to grab all results, but then only slice down to how many I exactly want. So say I have 50,000 results that mach the query, but I only want the first 30,000 that match, to just do search[0:30000].scan() would fufill my need.
If slicing doesn't work, does that mean I need to iterate over all the hits that scan() returns and stop once I've hit the Xth result?
Yes, iterate and then stop, don't forget to call clear_scroll after you stop to free up some resources.
@HonzaKral Quick follow up question with the following
results = []
for hit in search.scan():
while incrementer <= requested_records:
results.append(hit)
incrementer += 1
I'm trying to create a list of X requested records. I seem to be having an issue where I continuously add the same record to my list. Even if I change my code to:
results.append(copy.deepcopy(hit))
Im probably doing something very silly and stupid late at night, but I figured I'd check with you to make sure its not something funny with scan().
Thanks!
EDIT: Actually doing a deepcopy on the hit gives me a recursion depth error. I'm actually doing:
result.append(copy.deepcopy(hit._d_))
and still getting the sam record over and over.
Hi, the problem is in your code, you want to specify:
results = []
for position, hit in enumerate(search.scan()):
if position == requested_records:
break
results.append(hit)
In your nested loops the inner while cycle was actually just working with the same hit object over and over.
@HonzaKral Embarrassing but clearly I just needed a second set of eyes. Thanks for the help :)
@HonzaKral Not trying to necro this too much, but is clear_scroll documented anywhere in the docs?
It is documented as part of the low level API - http://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.clear_scroll
The problem, however, is that when using the scan method you won't have access to the scroll_id. The only solution I am afraid is to copy/paste the implementation of elasticsearch.helpers.scan and call clear_scroll there once enough hits have been outputted.
@HonzaKral commenting on the old issue
I want to use scroll es-api to download all the records in csv.
I want to dump users record from es to a csv file based on the filters applied. If no filters applied that means take the whole user dump. (user count = 10 lac).
I have checked the Search().scan() api.
My question is, I want to dump data in sorted order, as the document suggests (Note that in this case the results won't be sorted.) I didn't get, why the sort order is not maintained and I am assuming s.scan() is fetching the data in chunks from elasticsearch.
@sbnajardhane you can always pass in the preserve_order attribute to the underlying helper by calling
s = s.params(preserve_order=True) before calling s.scan() - that way the sorting will still be applied even when scanning
Thanks @HonzaKral
It worked. (y)
Most helpful comment
@sbnajardhane you can always pass in the
preserve_orderattribute to the underlying helper by callings = s.params(preserve_order=True)before callings.scan()- that way the sorting will still be applied even when scanning