Elasticsearch-dsl-py: Scan method takes forever to get 200000 records

Created on 28 Aug 2016 · 2Comments · Source: elastic/elasticsearch-dsl-py

I am using this code to get the results of a query

s = Search(using=es, index='project', doc_type='docs')
s = s.fields(['text'])
txt = [h.text[0] for h in s.scan()]

but it takes forever to get the dataset this way when I have over 1 mil records. What is the better way to fetch such large dataset using elasticsearch-dsl?

Source

scottydelta

Most helpful comment

scan is definitely the best way to get 200k records out of elasticsearch. If it's taking a long time it's most likely because of small size used. To adjust that you can do:

s = Search(using=es, index='project', doc_type='docs')
s = s.fields(['text'])
txt = [h.text[0] for h in s.params(size=1000).scan()]

Note that this can also be solved by upgrading elasticsearch-py to newest version since there the default is set to more reasonable 1000.

More details can be found here: https://github.com/elastic/elasticsearch-py/issues/397

HonzaKral on 29 Aug 2016

❤2 👍2

All 2 comments

scan is definitely the best way to get 200k records out of elasticsearch. If it's taking a long time it's most likely because of small size used. To adjust that you can do:

s = Search(using=es, index='project', doc_type='docs')
s = s.fields(['text'])
txt = [h.text[0] for h in s.params(size=1000).scan()]

Note that this can also be solved by upgrading elasticsearch-py to newest version since there the default is set to more reasonable 1000.

More details can be found here: https://github.com/elastic/elasticsearch-py/issues/397

HonzaKral on 29 Aug 2016

❤2 👍2

@HonzaKral I am still getting very poor performance. I just have 2 gb of data and it takes 12.5 secs for 55000 records. I am going to close this issue as the size parameter helps to get it faster than what I was doing earlier. I couldnt find the size parameter in any of the docs.

scottydelta on 30 Aug 2016

Was this page helpful?

0 / 5 - 0 ratings