Terraform: S3 backend does not retry through server errors

Created on 12 Jul 2017 · 12Comments · Source: hashicorp/terraform

Amazon S3 specifies its availability at three nines (99.9%), which means that roughly one of every thousand requests is expected to fail. Currently, when this happens, Terraform immediately halts and fails.

Our configuration makes extensive use of the terraform_remote_state data source, which means that we're making hundreds of requests to S3 per plan. Currently, we're seeing about half of our plans fail due to a random S3 error, which lines up with what we'd expect probabilistically.

This issue is fixable by implementing retrying with backoff when Terraform receives an error response from S3. We're currently working around this by implementing retries for the entire plan, but this is obviously much slower than retrying individual requests.

EDIT: We're experiencing this issue with Terraform 0.9.7.

backens3 enhancement

Source

ThePletch

Most helpful comment

For what it's worth, this issue would be drastically mitigated (adds an additional three nines) by just retrying a single time after, say, a 0.1 second delay, and reporting the result of that retry if it fails again. This shouldn't provide a noticeable delay if the error is anything other than transient, but it reduces the odds of a plan failing when making 1000 reads from 63.2% to 0.1%.

ThePletch on 1 Sep 2017

👍2

All 12 comments

@ThePletch I have faced the issue you have but also have faced below issue from time to time. I am using Terraform 0.9.6

Failed to load backend: Error reading state: RequestError: send request failed
caused by: Get https://<bucket>.s3.amazonaws.com/?prefix=env%3A%2F: dial tcp 52.216.160.59:443: getsockopt: connection timed out

bclodius on 29 Aug 2017

@ThePletch Below is the internal server error I get from time to time.

Error loading state: InternalError: We encountered an internal error. Please try again.
    status code: 500, request id: <request_id>, host id: <host_id>

bclodius on 29 Aug 2017

@bclodius Yes, these are (as confirmed by an AWS rep) expected on about 0.1% of all S3 requests. I've seen it manifest as 500 errors and timeouts. I've requested a less opaque error message (503 seems sensible), but regardless, the solution seems to be retries.

ThePletch on 30 Aug 2017

👍1

@ThePletch I will work on getting a TRACE output of the error. From there I'll try to see If I can make a PR to improve the module with retries.

bclodius on 30 Aug 2017

👍1

I have had similar issues with the s3 backend

sstarcher on 31 Aug 2017

Hi @ThePletch, and everyone! Sorry for this limitation.

Implementing retries here seems reasonable to me. We are generally cautious on retrying _reads_ because in situations where the backend service is _really_ unreachable (e.g. there's no Internet connectivity at all, auth credentials have expired, etc) we want to give that feedback to the user as quickly as possible.

However, it seems reasonable here to retry for a few seconds before ultimately giving up, thus ensuring that if there is a persistent error we can still give feedback to the user relatively quickly. From the errors above it looks like these errors are not distinguishable from general outages, but if there are any specialized errors that are _unambigiously_ transient (that is, AWS has documented that they will only be returned in such situations) then we can afford to be more liberal in retrying these.

@bclodius, if you're willing to work on this that'd be much appreciated!

apparentlymart on 1 Sep 2017

@apparentlymart I have ran into similar issues when writing to the s3 remote state

sstarcher on 1 Sep 2017

ThePletch on 1 Sep 2017

👍2

@jbardin I think we can close this issue now that retries are merged

bclodius on 4 Oct 2017

Thanks @bclodius, closed via #16243

jbardin on 4 Oct 2017

sure would be nice if I could catch and retry somehow

citizenkahn on 26 Sep 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.