Aws-sdk-java: Add Default Retry on InternalError during CompleteMultipartUpload

Created on 16 Oct 2015 · 16Comments · Source: aws/aws-sdk-java

We've been experiencing repeated issues on a system storing files in S3 which are reported up the chain as follows:

We encountered an internal error. Please try again. (Service: null; Status Code: 0; Error Code: InternalError; Request ID: [snip]).

I've turned on bucket logging for the relevant bucket, and I've tracked the source down to:

In all cases, it's InternalErrors on the Complete call for uploads that have been designated as multipart. We're not "hand-rolling" the multipart upload call, we're just creating a PutObject request, calling "upload" on a transfer manager then "WaitForCompletion" on the returned upload object. We are passing a custom implementation of RetryCondition to our S3 client, which neatly handles most errors. However, having traced through the SDK code, it looks like RetryConditions only apply when a non-OK response comes back to the actual http client. In the case of CompleteMultipartUpload, as per the docs and the logs what comes back on an "InternalError" is _not_ an http failure, so none of the RetryConditions we've defined apply.

I've had a look through the SDK and I'm reasonably confident that there's no wrapping on the "Complete" call that's made when making a default upload call to retry this final request if it fails. It would be ideal if this could be added - we can't put any handling for it into our code as it is now because the multipart upload is being done "under the hood", so when our code is informed of the failure there's no obvious way to extract the details of the specifically failed call to force a retry of it.

If we were "manually" creating our multipart uploads when the files were of sufficient size I believe we could easily add error handling for this to our code, but we're currently using the Async methods available to launch plenty of things in parallel, so I believe to do so we'd need to re-implement a significant amount of the code around this functionality to get this to work easily (though if anyone has suggestions on how we could easily wrap this in some retry logic that would be welcome). To me, it looks like it would be a lot better if the calls to CompleteMultiPartUpload within the SDK had some wrapping around them so that completion would be retried on a failure, as this appears to be the recommendation within the error message.

Source

groja

Most helpful comment

This issue still persists . I am not sure why this is closed.
We are experiencing repeated s3 internal errors on multi part copies using the java sdk

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: We encountered an internal error. Please try again. (Service: Amazon S3; Status Code: 200; Error Code: InternalError; Request ID: XXXXXX), S3 Extended Request ID: UInkHYZA7xLaZUsBF5QVYYxVR3FddMypElQWvNJWzKXI1zJT+mVbuQTWKrVNXGumVK96qMPkBVM= at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1866) at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146) at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:133) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:44)
We should reopen this issue.

dkc-bitsian on 13 Dec 2018

👍4

All 16 comments

Thanks for the bug report. As you said, the SDK doesn't retry when the response is 200 however the payload contains Error information. I will reach out to the S3 team to understand on what cases this error might occur and see if this can be from the SDK side. By any chance have you have logged the request id's while receiving the errors ? Can you share the request id and host id received as part of the response.

manikandanrs on 16 Oct 2015

Hi there

Sure, F996BFC9F5E8BC30 is a recent one.

Thanks,
James

groja on 16 Oct 2015

Do you have the host id as well ?

The error response from the server would be something like below

InternalError
We encountered an internal error. Please try again.
656c76696e6727732072657175657374
Uuag1LuByRx9e6j5Onimru9pO4ZVKnJ2Qz7/C1NPcfTWAtRPfTaOFg==

manikandanrs on 16 Oct 2015

Hi there

No I'm afraid not (unless you can tell me how to get that out of the S3 logs) - part of the reason this is a problem is that this is all happening internally within the SDK, and by the time it comes up to us it's too late to handle the specific error, and the actual XML response here doesn't come out.

I'd also re-emphasise that my issue here is not with the fact that the API threw an error - this happens - but with the fact that the way the SDK is currently set up means that I can't easily implement sensible defensive coding against an error like this when using the default upload methods. Focusing on the specific error would seem to be treating the symptom rather than the cause.

Thanks,
James

groja on 20 Oct 2015

Hi there

I was wondering if any further followup has been done on this?

We've been keeping track of re-occurrences of the same issue, and I have something interesting to add. We hit another of these failures again today, (request ID C10C9A694BB35B8C). When the same upload was retried about half an hour later, it failed because when it got to the file that was supposed to be completed by that request already existed in the bucket (our process is set to fail by default if it looks like it's trying to duplicate work). When I checked manually, that file was indeed there, despite the failure recorded for its "complete multipart upload" request. Is it possible that some sort of timeout is occurring, an error is being reported but the process is continuing and completing in the background?

Thanks,
James

groja on 28 Oct 2015

You can get the request id and the host id from the error response. You would be receiving an instance of AmazonS3Exception in this case. AmazonS3Exception has getter methods that you can use to print the information.
I will need to talk to the S3 team to get to the root of the issue where the s3 object happens to be present when you were trying to re upload again. And to do that, I will need the request Id and host id from the failure responses.

manikandanrs on 28 Oct 2015

Hi there

Unfortunately, this is only happening with a single bucket on our production system, so I can't easily poke in code to get additional debug logging out in the immediate term. Can I check again - is there any way to get the HostID out of the S3 bucket logs?

Thanks,
James

groja on 30 Oct 2015

I have been speaking with S3 team regarding this issue. Unfortunately, I don't see s3 prints host id in their bucket logs. Here is the log format.

http://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

S3 team has acknowledged there is an case in which the object appears in the case where they send an internal error to the requester. Also, they have said it is safe to retry when there is a Internal Error failure received for CompleteMultipartUpload. The retry should re create the object and delete the individual parts involved in the multipart upload

manikandanrs on 30 Oct 2015

Also, it would be good if I can get a reproducible test case for this. I am trying to simulate this on my side.

Could you give some metadata about the file that you are trying to upload.

manikandanrs on 30 Oct 2015

Hi there

I'm glad the S3 team have confirmed that retrying the "CompleteMultipartUpload" request is the correct thing to do if it fails, but as stated in the original issue I raised, there's currently no way that I'm aware of to make the SDK retry _just_ this final request if it has "intelligently" selected multipart upload due to the size of the content. This is the whole crux of the issue, and if retrying _is_ the correct thing to do in this case, then I think a change to the SDK should be made to accommodate this, whether it be a default assumption of retrying "CompleteMultipartUpload" requests that fail with an internal error, or a new option added to ClientConfiguration that allows you to set this behaviour.

I don't have a reproducible test case - retrying with exactly the same content after cleaning up the allegedly failed copy from the bucket usually works fine.

Thanks,
James

groja on 2 Nov 2015

You are right. I am going to look into how we can support this in TransferManager.

manikandanrs on 3 Nov 2015

We have fixed this issue. The fix would be available as part of our next release.

manikandanrs on 2 Dec 2015

Hi @manikandanrs,

What was the issue here ? I'm still seeing this with one of my buckets, and it doesn't go away after retrying either. It happens on a different object:

Reason: Failed to list any object from counter_service/dt=2017-04-27/hr=12/part-02567- AWS Error: Unable to parse ExceptionName: InternalError Message: We encountered an internal error. Please try again.
[20170427-130442] Retrying..

Failed to list any object from counter_service/dt=2017-04-27/hr=12/part-02020- AWS Error: Unable to parse ExceptionName: InternalError Message: We encountered an internal error. Please try again.