I need to scan through all the documents in a DocType, while preserving the order in which they were added to it.
I logically turned to the scan
method, but that does not preserve the order.
The elasticsearch-py
scan
helper method can take a preserve_order argument. But I can't figure out how to use it with elasticsearch-dsl-py.
Is it possible?
You can do it by specifying params method, anything passed there will be used as a kwargs when calling the underlying method (client.search or, in this case, helpers.scan):
s = Search().params(preserve_order=True)
for d in s.scan(): print(d)
hope this helps.
I missed that one... Thanks!
@HonzaKral what if i want the sorting to happen based on a particular field, is there a way to do it ?
Say the scan method returns 500 documents each having a field with date value stored in it. I want the documents to be returned in ascending order of that date field value.
@jaisharma639 regular sorting should work for you then: Search().sort('my_field').param(preserve_order=True).scan()
Thanks for the quick reply. I tried it out and it worked. It definitely should be included in the documentation http://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan.
Also wondering if preserve_order is mandatory to be used with the sort parameter? What happens if i just use sort alone? I tried it out and it returns documents in random order.
Yes, without preserve_order it is sorted by _doc (see [0]) which is much more optimal. If you want to actually sort the search as reflected in its body then you need to pass in the preserve_sort parameter - otherwise sort gets overwritten as the documentation for that parameter suggests.
0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#search-request-scroll
This worked for me (but I'm not sure if this is clumsy or an anti-pattern). My use case is 1. Output all hits after search and filtering, and 2. do a value_counts (like that in pandas) of hits on a keyword (field 'my_doc_type' only has two possible values, 'my_doc_type_1' and 'my_doc_type_2').
from elasticsearch_dsl import A
from my_own_es_dsl_doc_wrapper import MyDocument # MyDocument inherits from elasticsearch_dsl.Document
def filter_query(query_filter, query):
'''
:param query: The elasticsearch-dsl query that's built and returned.
'''
field_value_dict = {
query_filter.field: query_filter.value
}
return query.filter('term', **field_value_dict)
DOC_TYPE_FIELD_NAME = 'my_doc_type'
MY_QUERY = 'lol'
MY_FILTERS = []
my_document_search_object = MyDocument.search()
agg = A('terms', field=DOC_TYPE_FIELD_NAME)
my_document_search_object.aggs.bucket(DOC_TYPE_FIELD_NAME, agg)
my_document_search_object = my_document_search_object.query('multi_match', query=MY_QUERY)
while MY_FILTERS:
query_filter = MY_FILTERS.pop()
my_document_search_object = filter_query(query_filter, my_document_search_object)
response = my_document_search_object.execute()
flattened_buckets = {entry['key']: entry['doc_count'] for entry in response.aggs.my_doc_type.buckets}
print('There are {} my_doc_type_1 and {} my_doc_type_2 docs'.format(flattened_buckets['my_doc_type_1'], flattened_buckets['my_doc_type_2']))
print('Here are all the hits:', list(my_document_search_object.params(preserve_order=True).scan()))
Most helpful comment
@jaisharma639 regular sorting should work for you then:
Search().sort('my_field').param(preserve_order=True).scan()