Packer: 1.2.5 not retrying on 503

Created on 24 Jul 2018 · 20Comments · Source: hashicorp/packer

12:52:45 + export AWS_MAX_ATTEMPTS=10,
12:52:45 + AWS_MAX_ATTEMPTS=10,
12:52:45 + export AWS_POLL_DELAY_SECONDS=30
12:52:45 + AWS_POLL_DELAY_SECONDS=30
12:52:45 + packer build build.json
12:52:45 amazon-ebs output will be in this color.
12:52:45 
12:52:52 Build 'amazon-ebs' errored: RequestLimitExceeded: Request limit exceeded.
12:52:52    status code: 503, request id: 3158b371-7f06-4a1a-b8ec-6112d211beef
12:52:52 
12:52:52 ==> Some builds didn't complete successfully and had errors:
12:52:52 --> amazon-ebs: RequestLimitExceeded: Request limit exceeded.
12:52:52    status code: 503, request id: 3158b371-7f06-4a1a-b8ec-6112d211beef
12:52:52 
12:52:52 ==> Builds finished but no artifacts were created.

As you can see above, within 7 seconds of starting, packer exited reporting a 503. Since RequestLimitExceeded is a retry-able error according to AWS, packer should be applying the retry logic to this.

Does the existing retry logic only apply when packer is waiting for AMIs to be ready?

bug buildeamazon

Source

CarlInglisBJSS

Most helpful comment

i can try - unfortunately it's not reproducible-on-demand. I'll update this comment with logs when I have them. Please hold... :)

UPDATE: Annoyingly I can't reproduce this since it's dependant on the load on our AWS account which is outside of the control of the job - I've left PACKER_LOG set in the Jenkins job, and it'll alert me when it fails next. I'll add another comment with the logs when I have them.

CarlInglisBJSS on 25 Jul 2018

❤2

All 20 comments

Thanks for opening this. I've seen one other AWS waiter-related bug come in already for v. 1.2.5 so I'm hoping they're related. Can you rerun the build with PACKER_LOG=1 set so that I can see the debug logs?

SwampDragons on 24 Jul 2018

i can try - unfortunately it's not reproducible-on-demand. I'll update this comment with logs when I have them. Please hold... :)

CarlInglisBJSS on 25 Jul 2018

❤2

Did you leave out any lines in your log? Looking at the log you've provided I'm boggled as to where this query even happened; I'd have expected to see several more output lines even if you didn't have the PACKER_LOG flag set.

SwampDragons on 17 Aug 2018

Hmm. I wonder if the timeout is when we're actually obtaining the session. I don't think there are retries wrapping those, and we don't have any error wrapping there to explain where in the code it'd be.

SwampDragons on 18 Aug 2018

Apologies for delay in replying. I didn't miss out any lines in my original copy-paste. Because the RequestLimitExceeded state is account-wide for a very short period (as explained to us by our AWS TAM) it's impossible for us to predict when this is going to happen. In our internal software all AWS calls are wrapped in retries because of this.

We haven't seen this specific issue again, however since this is unpredictable, that doesn't mean anything.

If I may suggest, perhaps something in the core AWS calling code checking for 503 and auto-retrying might be advisable.

CarlInglisBJSS on 19 Aug 2018

If I may suggest, perhaps something in the core AWS calling code checking for 503 and auto-retrying might be advisable.

That won't solve this particular problem for you. Our "core aws calling code" is the Amazon-provided AWS SDK which already does treat 503 as retryable, which is why it was a) confusing and b) worthwile to actually pinpoint where the issue is specifically occurring. My current best guess is that this happened because you managed to exceed your rate limit at the exact moment Packer was first generating the client and instantiating a session, which looking at the SDK does _not_ appear to be wrapped in any kind of retry. So I can wrap that in a retry and see if I can trigger AWS's rate-limiting to test it. I'll keep you posted.

SwampDragons on 20 Aug 2018

I think I am seeing these when waiting for the password to become available in 1.2.5. Packer log: (debug was not enabled when it happened sorry)

amazon-ebs: Error waiting for password: Error retrieving auto-generated instance password: RequestLimitExceeded: Request limit exceeded.
15:35:17 ==> amazon-ebs:    status code: 503, request id: 7ad4fd1f-4e87-4082-8ad8-77d38f14fe7c

akdor1154 on 28 Aug 2018

Now that _is_ weird. That request is part of the AWS sdk and is wrapped in a check for whether the error is retryable. I wonder if this is an upstream bug. I'll see if I can find more info on what the SDK considers retryable.

SwampDragons on 28 Aug 2018

Okay, dug around in the AWS code. Looks like the default is only a max of 3 retries. I can try bumping this when we generate the client and we can see if that helps. @akdor1154 does this happen do you frequently enough that you feel confident you'd be able to say whether such a change works for you?

SwampDragons on 28 Aug 2018

Unfortunately I've only seen it once in a period of about 4 builds per day for two weeks. I assume you mean 3 retries per instance of "is the password there yet?", not 3 checks for "is the password there yet?".

akdor1154 on 29 Aug 2018

Correct. I think the issue occurs when you're being throttled and the retry gets a response that you're above your Request Limit three times in a row.

SwampDragons on 29 Aug 2018

I'm a little torn on this one. We can add some logic into Packer to sleep and try again if the error is that you're being throttled, but ultimately this is a build environment problem; if you're being throttled enough that it's breaking Packer then making Packer more resilient to throttling just masks the problem.

SwampDragons on 29 Aug 2018

Fair enough - my org is using AWS quite heavily, maybe this a case where we can request a limit increase.

akdor1154 on 30 Aug 2018

To chime in here, as I run into "Request limit exceeded" errors more often lately, especially when waiting for an AMI to become available, I would suggest to simply increase the number of retries when initializing the AWS client. The SDKs use exponential backoff internally, so the time between retries doubles with every third retry (default algorithm, 2**(retry * 0.3), can be changed). I've achieved good results with setting the number of retries to 20, which eliminated "Request limit exceeded" errors entirely in all my Ruby code.

In Ruby, I initialize the client this way:

@ec2_client = Aws::EC2::Client.new(
    region:      region,
    credentials: Aws::SharedCredentials.new(profile_name: profile),
    logger:      (debug ? @logger : nil),
    retry_limit: 20
)

Read more about exponential backoff here: https://docs.aws.amazon.com/general/latest/gr/api-retries.html

dhs-rec on 30 Aug 2018

Sorry, wrong formula. It's actually 2*retry0.3 (no parentheses), which means [0.3, 0.6, 1.2, 2.4,...].

dhs-rec on 30 Aug 2018

That seems like a pretty straightforward change. I can do that.

SwampDragons on 30 Aug 2018

🎉1

If any of y'all want to try out the attached PR and verify that it doesn't cause any other issues for you, I'd appreciate it.

SwampDragons on 30 Aug 2018

Just took a look at the AWS SDK for Go source code, and it seems that the default formula is different than the one used by the Ruby SDK. Please lookup aws/client/default_retryer.go in the Go SDK for details.

dhs-rec on 31 Aug 2018

Yeah, I read that the other day. The important part for us is that I think the default only sets a max of 3 retries. My PR ups this to 20

SwampDragons on 31 Aug 2018

🎉1

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.