Hi,
This is about documentation probably in tips and tricks section. Sometimes there could be duplicates in documents. Documentation about
I used document content (fields) to generate a Hash and stored it with document. Used the computed hash to detect dupplicates and avoided adding them again.
Hi @abibell
Sorry it has taken a while to get to this. It sounds like you're using a good approach. At the end of the day, there is only one unique field in Elasticsearch: the _id. Content deduplication is a big and complex subject, and probably worthy of several blog posts.
I would also like to see a documented strategy recommending what to do to avoid duplicate data. Is there somewhere else we should be submitting a request to create documentation on this topic?
For my use case I have unique IDs for my individual JSON documents in the _source field. Running a GET request on my indices I can retrieve the _id elasticsearch associates with it, and then a following PUT request on that _id to overwrite it, or create a new doc if not found.
Cheers
I have written a blog post on how to detect and remove duplicate documents here: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/
Most helpful comment
I would also like to see a documented strategy recommending what to do to avoid duplicate data. Is there somewhere else we should be submitting a request to create documentation on this topic?