Packer: 1.2.5 not retrying on 503

Created on 24 Jul 2018  ยท  20Comments  ยท  Source: hashicorp/packer

12:52:45 + export AWS_MAX_ATTEMPTS=10,
12:52:45 + AWS_MAX_ATTEMPTS=10,
12:52:45 + export AWS_POLL_DELAY_SECONDS=30
12:52:45 + AWS_POLL_DELAY_SECONDS=30
12:52:45 + packer build build.json
12:52:45 amazon-ebs output will be in this color.
12:52:45 
12:52:52 Build 'amazon-ebs' errored: RequestLimitExceeded: Request limit exceeded.
12:52:52    status code: 503, request id: 3158b371-7f06-4a1a-b8ec-6112d211beef
12:52:52 
12:52:52 ==> Some builds didn't complete successfully and had errors:
12:52:52 --> amazon-ebs: RequestLimitExceeded: Request limit exceeded.
12:52:52    status code: 503, request id: 3158b371-7f06-4a1a-b8ec-6112d211beef
12:52:52 
12:52:52 ==> Builds finished but no artifacts were created.

As you can see above, within 7 seconds of starting, packer exited reporting a 503. Since RequestLimitExceeded is a retry-able error according to AWS, packer should be applying the retry logic to this.

Does the existing retry logic only apply when packer is waiting for AMIs to be ready?

bug buildeamazon

Most helpful comment

i can try - unfortunately it's not reproducible-on-demand. I'll update this comment with logs when I have them. Please hold... :)

UPDATE: Annoyingly I can't reproduce this since it's dependant on the load on our AWS account which is outside of the control of the job - I've left PACKER_LOG set in the Jenkins job, and it'll alert me when it fails next. I'll add another comment with the logs when I have them.

All 20 comments

Thanks for opening this. I've seen one other AWS waiter-related bug come in already for v. 1.2.5 so I'm hoping they're related. Can you rerun the build with PACKER_LOG=1 set so that I can see the debug logs?

i can try - unfortunately it's not reproducible-on-demand. I'll update this comment with logs when I have them. Please hold... :)

UPDATE: Annoyingly I can't reproduce this since it's dependant on the load on our AWS account which is outside of the control of the job - I've left PACKER_LOG set in the Jenkins job, and it'll alert me when it fails next. I'll add another comment with the logs when I have them.

Did you leave out any lines in your log? Looking at the log you've provided I'm boggled as to where this query even happened; I'd have expected to see several more output lines even if you didn't have the PACKER_LOG flag set.

Hmm. I wonder if the timeout is when we're actually obtaining the session. I don't think there are retries wrapping those, and we don't have any error wrapping there to explain where in the code it'd be.

Apologies for delay in replying. I didn't miss out any lines in my original copy-paste. Because the RequestLimitExceeded state is account-wide for a very short period (as explained to us by our AWS TAM) it's impossible for us to predict when this is going to happen. In our internal software all AWS calls are wrapped in retries because of this.

We haven't seen this specific issue again, however since this is unpredictable, that doesn't mean anything.

If I may suggest, perhaps something in the core AWS calling code checking for 503 and auto-retrying might be advisable.

If I may suggest, perhaps something in the core AWS calling code checking for 503 and auto-retrying might be advisable.

That won't solve this particular problem for you. Our "core aws calling code" is the Amazon-provided AWS SDK which already does treat 503 as retryable, which is why it was a) confusing and b) worthwile to actually pinpoint where the issue is specifically occurring. My current best guess is that this happened because you managed to exceed your rate limit at the exact moment Packer was first generating the client and instantiating a session, which looking at the SDK does _not_ appear to be wrapped in any kind of retry. So I can wrap that in a retry and see if I can trigger AWS's rate-limiting to test it. I'll keep you posted.

I think I am seeing these when waiting for the password to become available in 1.2.5. Packer log: (debug was not enabled when it happened sorry)

amazon-ebs: Error waiting for password: Error retrieving auto-generated instance password: RequestLimitExceeded: Request limit exceeded.
15:35:17 ==> amazon-ebs:    status code: 503, request id: 7ad4fd1f-4e87-4082-8ad8-77d38f14fe7c

Now that _is_ weird. That request is part of the AWS sdk and is wrapped in a check for whether the error is retryable. I wonder if this is an upstream bug. I'll see if I can find more info on what the SDK considers retryable.

Okay, dug around in the AWS code. Looks like the default is only a max of 3 retries. I can try bumping this when we generate the client and we can see if that helps. @akdor1154 does this happen do you frequently enough that you feel confident you'd be able to say whether such a change works for you?

Unfortunately I've only seen it once in a period of about 4 builds per day for two weeks. I assume you mean 3 retries per instance of "is the password there yet?", not 3 checks for "is the password there yet?".

Correct. I think the issue occurs when you're being throttled and the retry gets a response that you're above your Request Limit three times in a row.

I'm a little torn on this one. We can add some logic into Packer to sleep and try again if the error is that you're being throttled, but ultimately this is a build environment problem; if you're being throttled enough that it's breaking Packer then making Packer more resilient to throttling just masks the problem.

Fair enough - my org is using AWS quite heavily, maybe this a case where we can request a limit increase.

To chime in here, as I run into "Request limit exceeded" errors more often lately, especially when waiting for an AMI to become available, I would suggest to simply increase the number of retries when initializing the AWS client. The SDKs use exponential backoff internally, so the time between retries doubles with every third retry (default algorithm, 2**(retry * 0.3), can be changed). I've achieved good results with setting the number of retries to 20, which eliminated "Request limit exceeded" errors entirely in all my Ruby code.

In Ruby, I initialize the client this way:

@ec2_client = Aws::EC2::Client.new(
    region:      region,
    credentials: Aws::SharedCredentials.new(profile_name: profile),
    logger:      (debug ? @logger : nil),
    retry_limit: 20
)

Read more about exponential backoff here: https://docs.aws.amazon.com/general/latest/gr/api-retries.html

Sorry, wrong formula. It's actually 2*retry0.3 (no parentheses), which means [0.3, 0.6, 1.2, 2.4,...].

That seems like a pretty straightforward change. I can do that.

If any of y'all want to try out the attached PR and verify that it doesn't cause any other issues for you, I'd appreciate it.

Just took a look at the AWS SDK for Go source code, and it seems that the default formula is different than the one used by the Ruby SDK. Please lookup aws/client/default_retryer.go in the Go SDK for details.

Yeah, I read that the other day. The important part for us is that I think the default only sets a max of 3 retries. My PR ups this to 20

I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

s4mur4i picture s4mur4i  ยท  3Comments

Nikoos picture Nikoos  ยท  3Comments

mwhooker picture mwhooker  ยท  3Comments

shantanugadgil picture shantanugadgil  ยท  3Comments

mvermaes picture mvermaes  ยท  3Comments