Elasticsearch-dsl-py: Using the scan method: how to add preserve_order=True?

Created on 12 May 2015  路  7Comments  路  Source: elastic/elasticsearch-dsl-py

I need to scan through all the documents in a DocType, while preserving the order in which they were added to it.

I logically turned to the scan
method, but that does not preserve the order.
The elasticsearch-py scan
helper method can take a preserve_order argument. But I can't figure out how to use it with elasticsearch-dsl-py.

Is it possible?

Most helpful comment

@jaisharma639 regular sorting should work for you then: Search().sort('my_field').param(preserve_order=True).scan()

All 7 comments

You can do it by specifying params method, anything passed there will be used as a kwargs when calling the underlying method (client.search or, in this case, helpers.scan):

s = Search().params(preserve_order=True)
for d in s.scan(): print(d)

hope this helps.

I missed that one... Thanks!

@HonzaKral what if i want the sorting to happen based on a particular field, is there a way to do it ?
Say the scan method returns 500 documents each having a field with date value stored in it. I want the documents to be returned in ascending order of that date field value.

@jaisharma639 regular sorting should work for you then: Search().sort('my_field').param(preserve_order=True).scan()

Thanks for the quick reply. I tried it out and it worked. It definitely should be included in the documentation http://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan.

Also wondering if preserve_order is mandatory to be used with the sort parameter? What happens if i just use sort alone? I tried it out and it returns documents in random order.

Yes, without preserve_order it is sorted by _doc (see [0]) which is much more optimal. If you want to actually sort the search as reflected in its body then you need to pass in the preserve_sort parameter - otherwise sort gets overwritten as the documentation for that parameter suggests.

0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#search-request-scroll

This worked for me (but I'm not sure if this is clumsy or an anti-pattern). My use case is 1. Output all hits after search and filtering, and 2. do a value_counts (like that in pandas) of hits on a keyword (field 'my_doc_type' only has two possible values, 'my_doc_type_1' and 'my_doc_type_2').

from elasticsearch_dsl import A
from my_own_es_dsl_doc_wrapper import MyDocument # MyDocument inherits from elasticsearch_dsl.Document

def filter_query(query_filter, query):
    '''
    :param query: The elasticsearch-dsl query that's built and returned.
    '''
    field_value_dict = {
        query_filter.field: query_filter.value
    }
    return query.filter('term', **field_value_dict)

DOC_TYPE_FIELD_NAME = 'my_doc_type'
MY_QUERY = 'lol'
MY_FILTERS = []

my_document_search_object = MyDocument.search()

agg = A('terms', field=DOC_TYPE_FIELD_NAME)
my_document_search_object.aggs.bucket(DOC_TYPE_FIELD_NAME, agg)

my_document_search_object = my_document_search_object.query('multi_match', query=MY_QUERY)
while MY_FILTERS:
    query_filter = MY_FILTERS.pop()
    my_document_search_object = filter_query(query_filter, my_document_search_object)

response = my_document_search_object.execute()

flattened_buckets = {entry['key']: entry['doc_count'] for entry in response.aggs.my_doc_type.buckets}

print('There are {} my_doc_type_1 and {} my_doc_type_2 docs'.format(flattened_buckets['my_doc_type_1'], flattened_buckets['my_doc_type_2']))

print('Here are all the hits:', list(my_document_search_object.params(preserve_order=True).scan()))
Was this page helpful?
0 / 5 - 0 ratings