Elasticsearch: Document how to detect and avoid duplicate data?

Created on 1 May 2014 · 4Comments · Source: elastic/elasticsearch

Hi,

This is about documentation probably in tips and tricks section. Sometimes there could be duplicates in documents. Documentation about

How do we detect duplicates (pre and post index)
How to avoid/remove them?
Is it possible to have a document in more than one type? {"title": "Doc title", "_type": ["type1", "type2"]}

I used document content (fields) to generate a Hash and stored it with document. Used the computed hash to detect dupplicates and avoided adding them again.

Source

abibell

Most helpful comment

I would also like to see a documented strategy recommending what to do to avoid duplicate data. Is there somewhere else we should be submitting a request to create documentation on this topic?

ChrisMagnuson on 7 Sep 2015

👍3

All 4 comments

Hi @abibell

Sorry it has taken a while to get to this. It sounds like you're using a good approach. At the end of the day, there is only one unique field in Elasticsearch: the _id. Content deduplication is a big and complex subject, and probably worthy of several blog posts.

clintongormley on 30 Dec 2014

I would also like to see a documented strategy recommending what to do to avoid duplicate data. Is there somewhere else we should be submitting a request to create documentation on this topic?

ChrisMagnuson on 7 Sep 2015

👍3

For my use case I have unique IDs for my individual JSON documents in the _source field. Running a GET request on my indices I can retrieve the _id elasticsearch associates with it, and then a following PUT request on that _id to overwrite it, or create a new doc if not found.

Cheers