Packer: Error waiting for AMI: ResourceNotReady: exceeded wait attempts (Waiting for AMI to become ready...)

Created on 31 Oct 2019 · 31Comments · Source: hashicorp/packer

Quite a few issues raised around these #6526 #6569. I have tried some of the workarounds mentioned but with no luck. Moreover I thought these have been fixed in the later releases.

So I'm using Packer v(1.4.3). Building an AMI (windows 2016) from a Encrypted Source, and the target is encrypted using KMS Alias. The build fails 80-90% of the time.
Sometimes it fails/times out when waiting for the AMI to be available and other times fails when Stopping the source instance (Error stopping instance: RequestExpired: Request has expired.)
Generally we see the failures when the end to end build takes over 1 hr.

I've attached the packer logs and packer template
gist/c4add22c69e2a94a39389bb25b7b2753

@SwampDragons

buildeamazon question

Source

nbshetty

Most helpful comment

Good to know that cloudtrail at least gives a more useful error than their API.

I recently updated the docs to warn that kms permissions may be part of the issue, so hopefully future users may have an easier time finding these bugs.

SwampDragons on 11 Nov 2019

👍2 🎉1

All 31 comments

Hi, thanks for reaching out.

Do you mind putting those templates and logs into a gist as requested so that folks don't need to download and open the text files locally?

Specifically what workarounds have you tried? The env vars?

Looking at your logs, it seems to me like your credentials are expiring. Are you using STS credentials that are only valid for 1 hour?

I'll try to take a look at whether Packer can try to automatically refresh its credentials in long-running builds, but in the meantime checking to see whether you still have issues with an IAM user or longer-lived credentials would be a good step to verify that this is the issue.

SwampDragons on 31 Oct 2019

@SwampDragons - Apologies will push the logs into a gist. Yes correct the env vars workaround were tried. AWS_MAX_ATEMPTS = 300/400/500.
So we are using the Instance Profile, will try increasing the duration and see whether that makes any difference. Thanks for the speedy response.

nbshetty on 31 Oct 2019

👍1

@SwampDragons - So I've increased the Maximum CLI/API session duration for the instance profile role to 2 hrs and it has made no difference unfortunately. So from point when it tries to stop the source instance to waiting for the AMI to become ready exceeds ~30-35 minutes it times out, anything below that time works.

nbshetty on 31 Oct 2019

Increased the timeout to below-
AWS_MAX_ATTEMPTS=700
AWS_POLL_DELAY_SECONDS=5

It now awaits for nearly 1 hr before it raises the exception. Funny thing is that the AMI is in available state on AWS within 30 minutes. So feels like the code that checks whether the AMI is available has failed for some reason, nothing on the packer logs though.

2019/10/31 19:58:28 [INFO] (telemetry) ending powershell
--
  | ==> iress-cis-encrypted-win-2016-iis-app-base: Stopping the source instance...
  | iress-cis-encrypted-win-2016-iis-app-base: Stopping instance
  | ==> iress-cis-encrypted-win-2016-iis-app-base: Waiting for the instance to stop...
  | ==> iress-cis-encrypted-win-2016-iis-app-base: Creating AMI iress-cis-encrypted-win-2016-iis-app-base_2019-10-31_1918 from instance i-09f48564a1619988a
  | iress-cis-encrypted-win-2016-iis-app-base: AMI: ami-0dd47f65331bf847e
  | ==> iress-cis-encrypted-win-2016-iis-app-base: Waiting for AMI to become ready...
  | 2019/10/31 21:00:27 packer: 2019/10/31 21:00:27 Error waiting for AMI: ResourceNotReady: exceeded wait attempts
  | 2019/10/31 21:00:28 packer: 2019/10/31 21:00:28 Unable to determine reason waiting for AMI failed: RequestExpired: Request has expired.
  | 2019/10/31 21:00:28 packer:     status code: 400, request id: b42ebb88-c4f9-42e3-a8f6-12d480707cbe

nbshetty on 31 Oct 2019

I think this may be a permissions issue: see #8323

SwampDragons on 5 Nov 2019

I'm going to tinker and figure out what permissions are necessary for this encryption to succeed.

SwampDragons on 5 Nov 2019

@SwampDragons - Doubt it is a permission issue. Because it works intermittently, basically if the AMI build takes over 1 hr it fails. Is there a way to increase the Session Expiry?

nbshetty on 5 Nov 2019

Off the top of my head I don't know, but I can investigate.

SwampDragons on 5 Nov 2019

Thanks

nbshetty on 5 Nov 2019

@SwampDragons, I'm having the same issue that I was on packer v1.4.3 and still observed v1.4.5
(aws-cli/1.16.274 Python/2.7.5 Linux/3.10.0-957.1.3.el7.x86_64 botocore/1.13.10)

"variables":
{ "ebs_kms_key_id_us-west-2" : "arn:aws:kms:us-west-2:[[aws-account-id]]:key/exxxx-xxx-470d-b8bd-13ab1bxxxxxxxx"},
"builders":[{
"region" : us-west-2,
"kms_key_id" : "{{user ebs_kms_key_id_us-west-2}}",
"region_kms_key_ids" : {
"us-west-2" : "{{user ebs_kms_key_id_us-west-2}}",
"us-west-1" : "",
"us-east-1" : ""
},
"ami_regions" : [
"us-west-1",
"us-west-2",
"us-east-1"
],
]}

Copying the same AMI to different regions are working fine with the default KSM since I leave the blank. I have tried keyid, "us-west-2" : "exxxx-xxx-470d-b8bd-13ab1bxxxxxxxx", instead of the keyarn. I have tried the below settings as you have indicated in the different ticket.
export AWS_POLL_DELAY_SECONDS=10 export AWS_MAX_ATTEMPTS=400
but this didn't helped.
Packer v1.2.3 worked fine with regular aws configure credentials but I have upgraded the packer to use aws profile instead of the aws credentials.
I hope that it helps to troubleshoot the issue.

WeekendsBull on 6 Nov 2019

@WeekendsBull To clarify: you are currently able to copy using the default ksm but not a custom key?

SwampDragons on 6 Nov 2019

I did some research on this and found a couple of things. First:

It now awaits for nearly 1 hr before it raises the exception. Funny thing is that the AMI is in available state on AWS within 30 minutes

I can reproduce this, but it's worth mentioning that when I use the AWS cli to query whether the AMI is "available", the state is still pending for that full hour, and only switches to available at the same time as Packer discovers it is available. I think this is a fluke with the AWS API being eventually consistent, not with the Packer code.

Second, I've found people say that they come across that RequestExpired error when the time zones between their instance and their querying computer don't match, though I doubt this is your issue. I am going to try to reproduce using an instance profile.

SwampDragons on 6 Nov 2019

@SwampDragons, yes correct.
I have just confirmed that applying aws default KMS key as putting region_kms_key_ids as below
"region_kms_key_ids" : {
"us-west-2" : "",
"us-west-1" : "",
"us-east-1" : ""
},
got me the AMIs.
--> amzn-centos7: AMIs were created:
us-east-1: ami-03b6d67bd8b5033d6
us-west-1: ami-0c641df0b52ba233a
us-west-2: ami-0ae74d1069f8f3e4d

WeekendsBull on 6 Nov 2019

@SwampDragons you might be correct that I have a python code which allowing AMIs copying over cross regions that waiter never worked due to the max exceeded trials after the copying AMI to a different account but boto3 was fixed the issue after 1.10.5 (early of this week but I saw the latest 1.10.10). I just want to make sure this is not by packer then I can work with AWS to address the issue.

WeekendsBull on 6 Nov 2019

Thanks @SwampDragons . Yes you are correct the timezones between the source instance and querying computer are in the same region and same aws account.

So to be clear the scenario is - the time taken by the packer build, if it takes over 1 hour and the ami is still in pending state on AWS, it fails. If the AMI becomes available inside that 1 hr every thing is cool.

nbshetty on 6 Nov 2019

@WeekendsBull check that your custom key policy contains

    "Action": [
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*"
    ],

SwampDragons on 6 Nov 2019

@SwampDragons
still I'm seeing the same issue when I enabled us-west-2 with our own custom KMS key.
I just did build another build using on packer v1.2.3, it seems to work fine that I am able to the encrypted AMI with our own kms key.
I will check further with KMS iam permission to see if I miss anything.

WeekendsBull on 6 Nov 2019

@nbshetty I don't think this has to do with instance profile session length; even when I manually revoke instance profile sessions in the middle of a build, the packer build successfully copies the ami.

Are you role chaining? From https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-iam-instance-profile.html:

Using the credentials for one role to assume a different role is called role chaining. When you use role chaining, your new credentials are limited to a maximum duration of one hour.

If you're using STS credentials which you request right before the build, it could be that the credential duration-seconds is set to an hour, which I believe is the default, but you can request it to be higher.

SwampDragons on 6 Nov 2019

@WeekendsBull I can't reproduce so I am going to need a fully functioning repro case. Can you please open a separate issue? I don't think yours is the same as @nbshetty's

SwampDragons on 6 Nov 2019

👍1

Just confirming that I am able to reproduce the RequestExpired: Request has expired. behavior by manually revoking my STS credentials during a build, or by requesting STS credentials with such a short expiry that they run out during the build.

SwampDragons on 7 Nov 2019

@SwampDragons - Thanks for your diagnosis much appreciated. Yes feels like the role chaining could be the culprit here. Let me figure out how I can increase the default timeout on this. Will keep you updated.

nbshetty on 7 Nov 2019

@SwampDragons mine was solved that it required source account's KMS id needed to be added in to the assume role and the key policy of the target account. I don't think I need to open any issue since it's working fine now. I found the error from the cloud trail like "errorCode": "AccessDenied", "errorMessage": "User: arn:aws:sts::[[aws account id]]:assumed-role/[[role-name]]/[[aws account id]] is not authorized to perform: kms:CreateGrant on resource: arn:aws:kms:[[aws regopm]]:[[aws account id]]:key/[[aws key id]]". Thank you always.

WeekendsBull on 11 Nov 2019

Good to know that cloudtrail at least gives a more useful error than their API.

I recently updated the docs to warn that kms permissions may be part of the issue, so hopefully future users may have an easier time finding these bugs.

SwampDragons on 11 Nov 2019

👍2 🎉1

@SwampDragons 3 days ago I was started to work fine for 4 different linux images (centos7, amazon-linux, amzon ecs optimized and elastic ami we build from the elastic enterprise public images) from the packer, unfortunately, it occurs again.
errored: Error waiting for instance (i-xxxxxxxx) to become ready: ResourceNotReady: failed waiting for successful resource state.

The issue is that I spotted some AMI build not completely done. what I'm saying is that there is a pre-converted AMI name renames like ami name as fauAn8A or QOiGEpv which didn't get converted after KMS encryption gets applied.

WeekendsBull on 14 Nov 2019

@WeekendsBull Those AMI names are intermediary AMIs that are copied to be encrypted and then deleted. They should be getting deleted when the copy fails. I'll look into why they aren't.

SwampDragons on 15 Nov 2019

@WeekendsBull what version of Packer are you on again?

SwampDragons on 15 Nov 2019

@SwampDragons I'm currently working with v1.4.5 and have tested out from v 1.4.3 to v1.4.5.
I have created 20 different AMIs today from 4 different OS that I spotted only two and was intermittent I would say. After I sent out email to AWS support, those two AMIs didn't get encrypted are gone not showing. I will posted out if I hear something from them as long as a valid.

WeekendsBull on 15 Nov 2019

They don't always vanish immediately from your dashboard because of Amazon's eventual consistency, but if they're still there after a few minutes it's an issue. If you can confirm they AMIs are really being abandoned instead of cleaned up I'll try to repro but it sounds like it may just be because you checked so soon.

SwampDragons on 15 Nov 2019

@SwampDragons - Eventually I've managed to run packer using Ec2 Instance Profile (AssumeRoleProvider) which means we get the benefit of things not expiring, and I can confirm that the issue has resolved. The reason it took me a while to rerun this with Ec2 Instance Profile was because Packer does not seem to recognise the Environment Variable AWS_PROFILE, basically at the end, I set the Profile property on Packer template to the needed profile. Will raise that as a separate issue. I'm happy now to close this issue, and thanks for all your assistance, much appreciated.

nbshetty on 18 Nov 2019

Good to know, thanks. Definitely open an issue for the AWS_PROFILE behavior.

SwampDragons on 18 Nov 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.