I am using this code to get the results of a query
s = Search(using=es, index='project', doc_type='docs')
s = s.fields(['text'])
txt = [h.text[0] for h in s.scan()]
but it takes forever to get the dataset this way when I have over 1 mil records. What is the better way to fetch such large dataset using elasticsearch-dsl?
scan is definitely the best way to get 200k records out of elasticsearch. If it's taking a long time it's most likely because of small size used. To adjust that you can do:
s = Search(using=es, index='project', doc_type='docs')
s = s.fields(['text'])
txt = [h.text[0] for h in s.params(size=1000).scan()]
Note that this can also be solved by upgrading elasticsearch-py to newest version since there the default is set to more reasonable 1000.
More details can be found here: https://github.com/elastic/elasticsearch-py/issues/397
@HonzaKral I am still getting very poor performance. I just have 2 gb of data and it takes 12.5 secs for 55000 records. I am going to close this issue as the size parameter helps to get it faster than what I was doing earlier. I couldnt find the size parameter in any of the docs.
Most helpful comment
scanis definitely the best way to get 200k records out of elasticsearch. If it's taking a long time it's most likely because of smallsizeused. To adjust that you can do:Note that this can also be solved by upgrading
elasticsearch-pyto newest version since there the default is set to more reasonable1000.More details can be found here: https://github.com/elastic/elasticsearch-py/issues/397