Elasticsearch: Reindex API : improve robustness in case of error

Created on 6 Jan 2017 · 24Comments · Source: elastic/elasticsearch

Hello,

I did a migration from elasticsearch 2.3.4 to 5.1.1 by following the migration guide.
The migration went perfectly well and I updated my mapping in order to use the new keyword type instead of the string not analyzed old one.

So I wanted to reindex all my index and I encountered the 2 following problems :

some of my documents had id of length longer than 512 characters and elasticsearch 5.1.1 does not accept it while a reindexation.
some of my documents had fields whose name is an empty string : elasticsearch 5.1.1 refuses to reindex such field.

So my reindex task suddenly stops, leaving lots of my documents not being reindexed because of the error on one document.

The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.

Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.

In my case, i was not able to correct the data of my old indexes so I decided not to reindex them finally...

Regards,

:DistributeReindex >enhancement Distributed stalled

Source

cnico

👍1

Most helpful comment

Hi @clintongormley,

For the first point, I disagree with you because I think it is completely useless to have a system that if an error happens, simple leaves the task partly completed and partly uncompleted, so with an unknown state.

Even if an index contains billions of documents, the reindex could have many strategies relating to error handling with the choice of the user that knows its data and what he prefers to be done in case of error.
The strategies could be :

simple stop as actually _and tell which document caused the failure_.
ignore every errors without care
ignore errors and trace the document ids that are the cause of the failure, for example in a dedicated index
computes the error rate (per minutes, per index, per server, per shard to be determined), and if higher than a given rate, stops.

I hope you will reconsider your point of view in order to improve robustness.

cnico on 12 Jan 2017

👍13

All 24 comments

Hi @cnico

Sorry to hear about your troubles with reindexing.

The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.

This is problematic because reindexing might target billions of documents, all of which might have errors. We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.

Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.

This can be done today by running the reindex job with ?wait_for_completion=false. You get back a task ID which can be passed to the task API to get the job status. The final status is stored in the .tasks index and will remain there until you delete it.

clintongormley on 10 Jan 2017

Hi @clintongormley,

simple stop as actually _and tell which document caused the failure_.
ignore every errors without care
ignore errors and trace the document ids that are the cause of the failure, for example in a dedicated index
computes the error rate (per minutes, per index, per server, per shard to be determined), and if higher than a given rate, stops.

I hope you will reconsider your point of view in order to improve robustness.

cnico on 12 Jan 2017

👍13

Hi @clintongormley

I agree with @cnico. It would be great to have a param like the conflicts which allows me to ignore errors. My usecase is that I wan't to reindex one bucket but the api fails every time because of the error "Can't get text on a START_OBJECT at 1:251"

mathewmeconry on 17 Mar 2017

I agree with @cnico as well here -- reindex is really cool, but a huge pain to use if just a single error happens. When reindexing billions of documents, even a single error causes you to start all over again (assuming the error is transient). It's a huge pain, and requires us to use external ETL to re-index.

Cidan on 1 May 2017

👍2

Hi @clintongormley,

I also agree with @cnico, The reindex api is really useful and works great until their is one error, we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.
I also hope that you reconsider your point of view

shimonste on 19 May 2017

👍3

we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.

What if all of your documents have errors? Where would we log billions of errors?

The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?

clintongormley on 19 May 2017

The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?

This was in the first version of reindex that I had actually. We decided the complexity wasn't worth it at the time I believe. But we can still do it if we want.

If we did this, what http code would we return if we have errors but not enough to abort. Traditionally we've returned 200 in those cases. There is a 207 Multi-Status HTTP response but I think it is pretty mixed up with webdav so maybe it is trouble. Not sure!

nik9000 on 19 May 2017

What do you think about a force option as a first step to just ignore all errors and do what is possible?

mathewmeconry on 19 May 2017

A force option would require logging the errors instead of returning them. We could count them but that is it. I don't particularly like that idea.

nik9000 on 19 May 2017

I agree with you that this isn't a good option. I don't really know how the sdks interact with the server but for the rest api would be maybe the http chunked transfer an idea but you have to keep the connection open until the reindex is finished. Then you don't need to store the logs on the server an can transfer them directly to the client

mathewmeconry on 19 May 2017

I think 200 is OK. We do what the user asks, ie ignore errors, and so complete successfully

clintongormley on 19 May 2017

http chunked transfer

Elasticsearch is kind of build around request/response sadly and it'd be a huge change to make chunked transfer style things work. Relative to counting errors, that is a moonshot.

nik9000 on 19 May 2017

A workaround for the problem could be using "ignore_malformed": true for the fields with the bad data.

ZombieSmurf on 20 Oct 2017

👍5

+1 for the fault tolerant reindexing
@nik9000 I would suggest the following:

Default (as today): bail out if the first error occurs during reindexing (Status 400)
"allowedFailures": [absolute number of tolerated failures] or [percent of allowed failures] (Status stays 200 if count-failures < config)
In terms of failure-logging: Just return the first 10 error-messages in the array...

mr-mos on 27 Oct 2017

👍2

Guys I need this, +1 for @ZombieSmurf... it's such an easy solution

Temporary fix:
Before reindexing manually create the destination index like the following

PUT dest-index
{
  "settings": {
    "index.mapping.ignore_malformed": true 
  }
}

falcorocks on 9 Nov 2017

👍3

How to retrieve the the results of a Reindexing when the initial API call timed out as it tried to wait for completion? As long as the Reindexing was running I could see the status at: GET _tasks?detailed=true&actions=*reindex, but when the Reindexing was finished I could not see anything anymore in the tasks API or a .tasks index which was mentioned here. I was only able to tell that something was wrong as the document counts in the end were wrong. (ES 5 used here)

ludwigm on 21 Nov 2017

👍5

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

bleskes on 19 Jul 2018

❤1 👍1

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@bleskes - Any progress on this? We have version 6.2.3. When using Curator for a Reindex, Curator seems to ignore documents which don't meet the new Mapping file. For example if a date is malformed, it'll just drop the document from the Reindex, no error message.

rahst12 on 13 Nov 2018

@rahst12 I'm afraid my previous statement still holds:

We don't currently have any one actively working on this refactoring so it may take a while.

bleskes on 14 Nov 2018

@bleskes any updates on this? Is there a possibility from any experienced ES developer to mentor this that implementation?

thePanz on 3 Jan 2019

@thePanz sorry for the late response, I was out and catching up. We're always ready to guide external contributions. I have to warn you though that this will not be a simple one.

bleskes on 18 Jan 2019

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@bleskes - Is there any update on this ? Or maybe is there any way we can see logs while reindexing is in progress and stops at any error (it will be at the least useful to identify the document which caused error in reindexing)