Hello,
I did a migration from elasticsearch 2.3.4 to 5.1.1 by following the migration guide.
The migration went perfectly well and I updated my mapping in order to use the new keyword type instead of the string not analyzed old one.
So I wanted to reindex all my index and I encountered the 2 following problems :
So my reindex task suddenly stops, leaving lots of my documents not being reindexed because of the error on one document.
The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.
Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.
In my case, i was not able to correct the data of my old indexes so I decided not to reindex them finally...
Regards,
Hi @cnico
Sorry to hear about your troubles with reindexing.
The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.
This is problematic because reindexing might target billions of documents, all of which might have errors. We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.
Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.
This can be done today by running the reindex job with ?wait_for_completion=false
. You get back a task ID which can be passed to the task API to get the job status. The final status is stored in the .tasks
index and will remain there until you delete it.
Hi @clintongormley,
For the first point, I disagree with you because I think it is completely useless to have a system that if an error happens, simple leaves the task partly completed and partly uncompleted, so with an unknown state.
Even if an index contains billions of documents, the reindex could have many strategies relating to error handling with the choice of the user that knows its data and what he prefers to be done in case of error.
The strategies could be :
I hope you will reconsider your point of view in order to improve robustness.
Hi @clintongormley
I agree with @cnico. It would be great to have a param like the conflicts which allows me to ignore errors. My usecase is that I wan't to reindex one bucket but the api fails every time because of the error "Can't get text on a START_OBJECT at 1:251"
I agree with @cnico as well here -- reindex is really cool, but a huge pain to use if just a single error happens. When reindexing billions of documents, even a single error causes you to start all over again (assuming the error is transient). It's a huge pain, and requires us to use external ETL to re-index.
Hi @clintongormley,
I also agree with @cnico, The reindex api is really useful and works great until their is one error, we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.
I also hope that you reconsider your point of view
we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.
What if all of your documents have errors? Where would we log billions of errors?
The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?
The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?
This was in the first version of reindex that I had actually. We decided the complexity wasn't worth it at the time I believe. But we can still do it if we want.
If we did this, what http code would we return if we have errors but not enough to abort. Traditionally we've returned 200 in those cases. There is a 207 Multi-Status HTTP response but I think it is pretty mixed up with webdav so maybe it is trouble. Not sure!
What do you think about a force option as a first step to just ignore all errors and do what is possible?
A force option would require logging the errors instead of returning them. We could count them but that is it. I don't particularly like that idea.
I agree with you that this isn't a good option. I don't really know how the sdks interact with the server but for the rest api would be maybe the http chunked transfer an idea but you have to keep the connection open until the reindex is finished. Then you don't need to store the logs on the server an can transfer them directly to the client
I think 200 is OK. We do what the user asks, ie ignore errors, and so complete successfully
http chunked transfer
Elasticsearch is kind of build around request/response sadly and it'd be a huge change to make chunked transfer style things work. Relative to counting errors, that is a moonshot.
A workaround for the problem could be using "ignore_malformed": true
for the fields with the bad data.
+1 for the fault tolerant reindexing
@nik9000 I would suggest the following:
Guys I need this, +1 for @ZombieSmurf... it's such an easy solution
Temporary fix:
Before reindexing manually create the destination index like the following
PUT dest-index
{
"settings": {
"index.mapping.ignore_malformed": true
}
}
How to retrieve the the results of a Reindexing when the initial API call timed out as it tried to wait for completion? As long as the Reindexing was running I could see the status at: GET _tasks?detailed=true&actions=*reindex
, but when the Reindexing was finished I could not see anything anymore in the tasks API or a .tasks index which was mentioned here. I was only able to tell that something was wrong as the document counts in the end were wrong. (ES 5 used here)
A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.
A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.
@bleskes - Any progress on this? We have version 6.2.3. When using Curator for a Reindex, Curator seems to ignore documents which don't meet the new Mapping file. For example if a date is malformed, it'll just drop the document from the Reindex, no error message.
@rahst12 I'm afraid my previous statement still holds:
We don't currently have any one actively working on this refactoring so it may take a while.
@bleskes any updates on this? Is there a possibility from any experienced ES developer to mentor this that implementation?
@thePanz sorry for the late response, I was out and catching up. We're always ready to guide external contributions. I have to warn you though that this will not be a simple one.
A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.
@bleskes - Is there any update on this ? Or maybe is there any way we can see logs while reindexing is in progress and stops at any error (it will be at the least useful to identify the document which caused error in reindexing)
@PraneetKhandelwal the cause of errors should be returned in the failure field of the response/reindexing result - see here.
Pinging @elastic/es-distributed
Most helpful comment
Hi @clintongormley,
For the first point, I disagree with you because I think it is completely useless to have a system that if an error happens, simple leaves the task partly completed and partly uncompleted, so with an unknown state.
Even if an index contains billions of documents, the reindex could have many strategies relating to error handling with the choice of the user that knows its data and what he prefers to be done in case of error.
The strategies could be :
I hope you will reconsider your point of view in order to improve robustness.