Elasticsearch: Document how to detect and avoid duplicate data?

Created on 1 May 2014  路  4Comments  路  Source: elastic/elasticsearch

Hi,

This is about documentation probably in tips and tricks section. Sometimes there could be duplicates in documents. Documentation about

  1. How do we detect duplicates (pre and post index)
  2. How to avoid/remove them?
  3. Is it possible to have a document in more than one type? {"title": "Doc title", "_type": ["type1", "type2"]}

I used document content (fields) to generate a Hash and stored it with document. Used the computed hash to detect dupplicates and avoided adding them again.

Most helpful comment

I would also like to see a documented strategy recommending what to do to avoid duplicate data. Is there somewhere else we should be submitting a request to create documentation on this topic?

All 4 comments

Hi @abibell

Sorry it has taken a while to get to this. It sounds like you're using a good approach. At the end of the day, there is only one unique field in Elasticsearch: the _id. Content deduplication is a big and complex subject, and probably worthy of several blog posts.

I would also like to see a documented strategy recommending what to do to avoid duplicate data. Is there somewhere else we should be submitting a request to create documentation on this topic?

For my use case I have unique IDs for my individual JSON documents in the _source field. Running a GET request on my indices I can retrieve the _id elasticsearch associates with it, and then a following PUT request on that _id to overwrite it, or create a new doc if not found.

Cheers

I have written a blog post on how to detect and remove duplicate documents here: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ppf2 picture ppf2  路  3Comments

rbayliss picture rbayliss  路  3Comments

rjernst picture rjernst  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

clintongormley picture clintongormley  路  3Comments