Packer: Amazon Builder not retrying after RequestLimitExceeded - throttling issue

Created on 1 Jun 2018 · 15Comments · Source: hashicorp/packer

Packer version: 1.2.3
Platform: Linux
Builder: amazon-ebs

08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Waiting for state to become: available  
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Using 30s as polling delay (change with AWS_POLL_DELAY_SECONDS)  
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)  
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Error on AMIStateRefresh: RequestLimitExceeded: Request limit exceeded.  
08:29:33 2018/06/01 07:29:30 packer:    status code: 503, request id: 2c9c6471-16fc-45c4-9361-d92b6c5a78a1

From my examination of the code (I'm not a Go programmer) it appears that function WaitForState is relying on isTransientNetworkError to determine retries, and 503 isn't classed as a retryable error.

I can't tell if the go sdk is retrying, however I suspect not given how quickly the error is being returned.

I have opened a support case with Amazon to get more information on the specific request and I will post anything significant that I get back from them to this issue.

We do have multiple packers running at the time, however the issue I am concerned about is not that, it's that the builder isn't retrying after throttling.

It is of course possible that I've misunderstood something in the Go code. If that's the case I'd appreciate being pointed at my misunderstanding. Thanks.

bug buildeamazon

Source

CarlInglisBJSS

All 15 comments

Throttling shouldn't have status code 503. See https://docs.aws.amazon.com/health/latest/APIReference/CommonErrors.html

rickard-von-essen on 1 Jun 2018

I appreciate your point, however the status code is clearly 503, and the error is RequestLimitExceeded - and a lot of Amazon's documentation refers to that as a throttling error. So I guess being pragmatic one would expect to deal with that error as per other throttling errors.

CarlInglisBJSS on 1 Jun 2018

Coul you add some more lines of the log so we know which API call this is for?

rickard-von-essen on 1 Jun 2018

"Error on AMIStateRefresh" narrows it down precisely - that message only appears once in the codebase.

I have asked Amazon for more info on the specific request ID, and if I get anything I'll post it here.

CarlInglisBJSS on 1 Jun 2018

No it doesn't.. There are multiple places where we are waiting for AMI's to become available.

rickard-von-essen on 1 Jun 2018

Information from Amazon:

Regarding the specific request provided in the case:
request id: 2c9c6471-16fc-45c4-9361-d92b6c5a78a1

this was a DescribeImages API call
by userArn [snipped]
from sourceIP [snipped]

I see also that for the same time-frame you got other API calls throttled, most of them Describe actions (e.g. for DescribeInstances, DescribeInstanceAttribute, DescribeSecurityGroups, etc.)

That said, being that you are using a programmatic approach to API calls, I would recommend you to implement exponential back-off in order to retry calls and make sure they will succeed.

Reference documentation on this can be found here:
https://docs.aws.amazon.com/general/latest/gr/api-retries.html
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html

I've checked the code base again and that message is only used in that one procedure call but I think I may have misunderstood what you were asking - here it is in context of the output:

08:27:54 2018/06/01 07:27:53 ui:     amazon-ebs: Complete!
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] 19043 bytes written for 'stdout'
08:27:54 2018/06/01 07:27:53 [INFO] 0 bytes written for 'stderr'
08:27:54 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] 19043 bytes written for 'stdout'
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] 0 bytes written for 'stderr'
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [DEBUG] Opening new ssh session
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [DEBUG] starting remote command: rm -f /tmp/script_4177.sh
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] (telemetry) ending shell
08:27:54 2018/06/01 07:27:53 ui: ==> amazon-ebs: Stopping the source instance...
08:27:54 2018/06/01 07:27:53 ui:     amazon-ebs: Stopping instance, attempt 1
08:27:54 2018/06/01 07:27:53 ui: ==> amazon-ebs: Waiting for the instance to stop...
08:28:26 2018/06/01 07:28:23 ui: ==> amazon-ebs: Creating the AMI: mot2-api_java_mot_test-ops-OPSTEAM-1750-update-packer/40-79081
08:28:26 2018/06/01 07:28:24 ui:     amazon-ebs: AMI: ami-7db7b204
08:28:26 2018/06/01 07:28:24 ui: ==> amazon-ebs: Waiting for AMI to become ready...
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Waiting for state to become: available
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Using 30s as polling delay (change with AWS_POLL_DELAY_SECONDS)
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Error on AMIStateRefresh: RequestLimitExceeded: Request limit exceeded.
08:29:33 2018/06/01 07:29:30 packer:    status code: 503, request id: 2c9c6471-16fc-45c4-9361-d92b6c5a78a1
08:29:33 2018/06/01 07:29:30 ui error: ==> amazon-ebs: Error waiting for AMI: RequestLimitExceeded: Request limit exceeded.
08:29:33 2018/06/01 07:29:30 ui: ==> amazon-ebs: Terminating the source AWS instance...
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Waiting for state to become: terminated
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Using 30s as polling delay (change with AWS_POLL_DELAY_SECONDS)
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
08:30:06 2018/06/01 07:30:01 ui: ==> amazon-ebs: Cleaning up any extra volumes...
08:30:06 2018/06/01 07:30:01 ui: ==> amazon-ebs: No volumes to clean up, skipping
08:30:06 2018/06/01 07:30:01 ui: ==> amazon-ebs: Deleting temporary security group...

CarlInglisBJSS on 1 Jun 2018

Oh I taught #6177 was released, but apparently it is awaiting 1.3.0 which is the next release. It should handle this correctly.

rickard-von-essen on 1 Jun 2018

That's great, thanks! Is there an estimated release date? (So I can shove the ticket on our backlog with a "revisit after" date)

CarlInglisBJSS on 1 Jun 2018

Unfortunately no, maybe @SwampDragons or @mwhooker have one.

rickard-von-essen on 1 Jun 2018

we just put out 1.2.4 and normally we wait ~ 6 weeks between releases; I would guess mid-July at the earliest. That said, I'm working on #6177 now, so if you want to follow that ticket I can build you a binary when it's done so you can test it out next week or so.

SwampDragons on 1 Jun 2018

😕1

@CarlInglisBJSS thanks for the report. Hopefully #6177 will solve this, and it would be really useful if you could help us test that when it’s ready. In the mean time, I would suggest increasing the AWS_POLL_DELAY_SECONDS env var as a workaround.

mwhooker on 2 Jun 2018

I'm watching that issue and will happily help test it when it's ready.

I'll try the AWS_POLL_DELAY_SECONDS change - thanks.

CarlInglisBJSS on 4 Jun 2018

Our existing AWS_POLL_DELAY_SECONDS was 30 seconds. I have added a random 0-9 seconds to that for each invocation to stagger the number of simultaneous calls. This, coupled with a random 1-10 second at the start of each invocation, appears (based on one set of 15 simultaneous starts) to have given us a work around.

EDIT: Now up to 3 different sets of 15 simultaneous starts of the script - looks like this is working! Thanks.

CarlInglisBJSS on 4 Jun 2018

👍1

This was fixed by #6332

SwampDragons on 10 Aug 2018

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.