Elasticsearch-dsl-py: Bulk Indexing of DocTypes? Documentation, or Feature.

Created on 13 May 2015  路  14Comments  路  Source: elastic/elasticsearch-dsl-py

I've asked in the #elasticsearch channel on Free node, and was told that helpers.bulk from the elasticsearch module can be used to index documents in bulk.

Can this be used in conjunction w/DocTypes? (Basically instead of .save(), maybe making something like .queue_bulk(), then .save_bulk()).

Or is this already possible by passing the DocType to helpers.bulk? Searched documents but couldn't find anything related to bulk indexing.

Thanks!

Most helpful comment

Implemented DocType.to_dict(include_metadata=True) which will include all the metadata from the document in the format that bulk expects.

All 14 comments

Ah, good point. Currently is isn't possible to tie these two together directly., what you'd have to do is:

bulk(es, ({'_index': getattr(d.meta, 'index', d._doc_type.index), '_type': d._doc_type.name, '_source': d.to_dict()} for d in MY_DOCS))

I will think about it and probably add a parameter to to_dict on the DocType to produce the full dict, including the metadata that could then be passed to bulk.

Does that make sense?

Yea it does, thanks!

Implemented DocType.to_dict(include_metadata=True) which will include all the metadata from the document in the format that bulk expects.

This is really nice! (almost as nice as being able to pass an Index instance an iterable of DocTypes for bulk indexing ;P Though one would have to find a way to specify the _op_type)

Thanks for your hard work and looking forward to the next release!

@0x64746b agreed, for now I want to make eveything work with strings, with elasticsearch-py being kept unaware of the dsl library. Later we can figure out what kind of convenient code paths could be added for more flexibility, Index.bulk method or DocType.bulk classmethod come to mind in this example.

Implemented DocType.to_dict(include_metadata=True)

Ftr: it's DocType.to_dict(include_meta=True)

Hey sorry for bringing this issue up again.

My question is, with the new DocType.to_dict(include_meta=True), I am able to figure out syntax to index, but how can I update (or specifically, I would like to upsert).

Not finding much online. Wish the readthedocs had examples :.

If the document has an id (doc.meta.id) it will replace the current document in elasticsearch so it will automatically perform an update. If it's not in elasticsearch, it will be inserted - the index operation for bulk (which is the default operation) behaves like that.

If you want anything else like partial updates or upserts (above the behavior described) you need to specify it manually since the document object cannot really help you there. I'd recommend creating a method on the DocType subclass you are using to produce the correct operation.

Awesome, thanks., and noticed I missed this: If you wish to perform other operations, like delete or update use the _op_type field in your actions (_op_type defaults to index)

thanks for the super quick response!

Is it possible to have a script (to increment a counter) directly in the DocType class? This would be really awesome. So majority of logic that needs to happen could happen directly in python, then at index time, If DocType's some_value = False: one script gets run, otherwise some_other_value, a different script is ran.

This would be super hard to do safely in a generic fashion - what you can do however is to have simple methods on the DocType class that either return data in format for the bulk helper or call the update API directly to do what you wish.

Does that make sense?

I am sorry to ask this question if its too naive, but is there a way to create documents from django models?

@gladsonvm sure, just produce a dictionary representing your model and insert it into elasticsearch. You can also use the persistence layer to make it cleaner.

I have an example project using django where you can see one of the ways it can be done - https://github.com/HonzaKral/es-django-example

Thanks for the reply :). I did it using postgresql query itself. PSQL query was executed and then output was written to a file. That file was edited with python and information regarding index and doc_type was inserted before each line using python's fileinput. Then I used curl -s -XPOST localhost:9200/_bulk --data-binary "@docs.json"; echo to update all index/doc_type info to elasticsearch. But from the example project it seems this can be achieved easier.

Was this page helpful?
0 / 5 - 0 ratings