Packer: 1.4.1: Amazon-EBS builder fails while copying encrypted AMI with meaningless error message if KMS key provided is invalid

Created on 22 May 2019 · 23Comments · Source: hashicorp/packer

After updating from 1.4.0 to 1.4.1 the Amazon-EBS builder fails reproducably when it tries to copy/encrypt the resulting AMI:

amazon-ebs: Creating encrypted copy in build region: us-east-1
amazon-ebs: Waiting for all copies to complete...
amazon-ebs: 1 error(s) occurred:
==> amazon-ebs:
==> amazon-ebs: * Error waiting for AMI (ami-078d087247324ef86) in region (us-east-1): ResourceNotReady: failed waiting for successful resource state

Downgrading back to 1.4.0 solves the problem.

buildeamazon regression

Source

dhs-rec

👍2

All 23 comments

Can you please share

Debug log output from PACKER_LOG=1 packer build template.json.
Please paste this in a gist https://gist.github.com
The _simplest example template and scripts_ needed to reproduce the bug.
Include these in your gist https://gist.github.com

SwampDragons on 22 May 2019

Template: https://gist.github.com/dhs-rec/7084aa659a3fb77572299d02a209b5ff
Log: https://gist.github.com/dhs-rec/34f393ed7ec1fe0d25531cfe446e6007

It might be related to the kms_key_id setting. If I omit that to use the default key it works. However, this worked fine with 1.4.0.

dhs-rec on 23 May 2019

Hey @SwampDragons / @dhs-rec,

same issue with Packer 1.4.2

Looks like the kms_key_id field is not working since Packer Version > 1.4.0.
You will need to provide the region_kms_key_ids + ami_regions + region option to make it work.

Broken config

Works for Packer <= 1.4.0

AWS_DEFAULT_REGION=eu-central-1

"encrypt_boot": true,
"kms_key_id": "{{ user `kms_key` }}"

same error for

"encrypt_boot": true,
"region": "eu-central-1",
"ami_regions": [ "eu-central-1" ],
"kms_key_id": "{{ user `kms_key` }}",

All will fail. Also the State Reason of the AMI will be Copy image failed with an internal error.

==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI dI0wyoo from instance i-0e6626ac4c2dd7464
    amazon-ebs: AMI: ami-0412ca842b62553fa
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Copying/Encrypting AMI (ami-0412ca842b62553fa) to other regions...
    amazon-ebs: Copying to: eu-central-1
    amazon-ebs: Waiting for all copies to complete...
==> amazon-ebs: 1 error(s) occurred:
==> amazon-ebs: 
==> amazon-ebs: * Error waiting for AMI (ami-0cb2b7d3600f4e652) in region (eu-central-1): ResourceNotReady: failed waiting for successful resource state

Broken and buggy

AWS_DEFAULT_REGION=eu-central-1

"encrypt_boot": true,
"kms_key_id": "{{ user `kms_key` }}",
"ami_regions": [ "eu-central-1" ]

Here you will end up with 2 broken images.

==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI eSPnMS3 from instance i-0d9b3a802d9d71ce0
    amazon-ebs: AMI: ami-0841f5532986aaad7
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Copying/Encrypting AMI (ami-0841f5532986aaad7) to other regions...
    amazon-ebs: Copying to: eu-central-1
    amazon-ebs: Copying to: eu-central-1
    amazon-ebs: Waiting for all copies to complete...
==> amazon-ebs: 2 error(s) occurred:
==> amazon-ebs: 
==> amazon-ebs: * Error waiting for AMI (ami-0a4bcac577bdd2204) in region (eu-central-1): ResourceNotReady: failed waiting for successful resource state
==> amazon-ebs: * Error waiting for AMI (ami-0a8513872d4bdfeb8) in region (eu-central-1): ResourceNotReady: failed waiting for successful resource state

Working but buggy

AWS_DEFAULT_REGION=eu-central-1

"encrypt_boot": true,
"ami_regions": [ "eu-central-1" ],
"region_kms_key_ids":
{
  "eu-central-1": "{{ user `kms_key` }}"
}

AMI is copied to the region twice. Only one AMI will be tagged

==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI H2K6CvC from instance i-02adcfbd74204ac86
    amazon-ebs: AMI: ami-0fa8c2bc9e5b1aece
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Copying/Encrypting AMI (ami-0fa8c2bc9e5b1aece) to other regions...
    amazon-ebs: Copying to: eu-central-1
    amazon-ebs: Copying to: eu-central-1
    amazon-ebs: Waiting for all copies to complete...
==> amazon-ebs: Modifying attributes on AMI (ami-078a0c880621cb546)...

Working

AWS_DEFAULT_REGION=eu-central-1

"encrypt_boot": true,
"region": "eu-central-1",
"ami_regions": [ "eu-central-1" ],
"region_kms_key_ids":
{
  "eu-central-1": "{{ user `kms_key` }}"
}

Everything is fine, AMIs are tagged, works as expected.

==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI 5uzeVLD from instance i-0495201b131b230ce
    amazon-ebs: AMI: ami-0235001035f9a07d0
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Copying/Encrypting AMI (ami-0235001035f9a07d0) to other regions...
    amazon-ebs: Copying to: eu-central-1
    amazon-ebs: Waiting for all copies to complete...
==> amazon-ebs: Modifying attributes on AMI (ami-0c88880007d74e1e5)...

timmjd on 10 Jul 2019

👍1

Thanks for the detailed report and output on this. I'll look into it before the next release.

SwampDragons on 12 Jul 2019

👍1

The fix timmjd recommended did not work for me. Even though I supplied a custom KMS key packer used the default key.

When I use the following template snippet (the original fix referenced in issue #7673) packer ignores the customized region KMS key mapping provided and uses the default EBS KMS key. I would consider this a bug.

"encrypt_boot": true,
"region": "eu-central-1",
"ami_regions": [ "eu-central-1" ],
"region_kms_key_ids":
{
"eu-central-1": "{{ user kms_key }}"
}

AndrewCi on 29 Jul 2019

From PR #7870:

When I used the provided artifact I changed my template back to the original format (that worked in 1.4.0) and packer failed with the waiting for AMI.

"region": "{{user aws_region}}",
"encrypt_boot": "true",
"kms_key_id": "{{user shared_services_account_kms_key_alias}}"

SwampDragons on 29 Jul 2019

@AndrewCi I can't reproduce. :( When I provide my key id, the region copy succeeds, no duplicates, using the non-default key provided. I don't know what's going on for you here. Is there any more information you can provide? Full template/full debug logs, for example?

SwampDragons on 29 Jul 2019

Clarification: I can't reproduce based on the config I copied above. I think I know what's up with the validation for the workaround you were using when the above wasn't working.

SwampDragons on 29 Jul 2019

@SwampDragons not sure if this is relevant but the build region and copy region are the same in my case. Also, I'm using the ARN in the alias format.

AndrewCi on 29 Jul 2019

Even setting the "regions" list, this works for me:

{
  "builders": [
    {
      "type": "amazon-ebs",
      "force_deregister": true,
      "ssh_username": "ubuntu",
      "ami_name": "Test AMI",
      "instance_type": "t2.micro",
      "source_ami_filter": {
              "filters": {
                "virtualization-type": "hvm",
                "name": "ubuntu/images/*ubuntu-xenial-16.04-amd64-server-*",
                "root-device-type": "ebs"
              },
              "owners": ["099720109477"],
              "most_recent": true
      },

      "region": "us-east-1",
      "encrypt_boot": true,
      "kms_key_id": "arn:aws:kms:us-east-1:{{ aws_user }}:key/{{ key_UUID}}",
      "ami_regions": ["us-east-1"]
    }
  ]
}

SwampDragons on 30 Jul 2019

Hmm. I'll try using the ID instead of alias. Can you provide the exact link to the binary as well. I'll test again later tonight.

AndrewCi on 30 Jul 2019

If you're on linux it's https://7311-8966356-gh.circle-artifacts.com/0/packer_linux_amd64.zip

Mac is https://7311-8966356-gh.circle-artifacts.com/0/packer_darwin_amd64.zip

SwampDragons on 30 Jul 2019

Make sure the version you see in your logs is 1.4.3-dev, too. The number of times I've loaded up the wrong binary... 🤦‍♀

SwampDragons on 30 Jul 2019

Make sure the version you see in your logs is 1.4.3-dev, too. The number of times I've loaded up the wrong binary... 🤦‍♀

I'm working on Ubuntu 18.04. I verified I was using the correct binary. See below for logs:

[0;32m [cleaned] Stopping instance[0m
[1;32m==> [cleaned] Waiting for the instance to stop...[0m
[1;32m==> [cleaned] Enabling Enhanced Networking (ENA)...[0m
[1;32m==> [cleaned] Creating AMI OZ76Squ from instance [instance_id][0m
[0;32m [cleaned] AMI: [ami_id][0m
[1;32m==> [cleaned] Waiting for AMI to become ready...[0m
[1;32m==> [cleaned] Copying/Encrypting AMI ([ami_id]) to other regions...[0m
[0;32m [cleaned] Copying to: us-east-2[0m
[0;32m [cleaned] Waiting for all copies to complete...[0m
[1;31m==> [cleaned] 1 error(s) occurred:
==> [cleaned]
==> [cleaned] * Error waiting for AMI ([ami_id]) in region (us-east-2): ResourceNotReady: failed waiting for successful resource state[0m
[1;32m==> [cleaned] Deregistering the AMI and deleting unencrypted temporary AMIs and snapshots[0m
[1;32m==> [cleaned] Deregistered AMI id: [ami_id][0m
[1;32m==> [cleaned] Deleted snapshot: snap-06b88b1957ac2241d[0m
[1;32m==> [cleaned] Deregistering the AMI and deleting associated snapshots because of cancellation, or error...[0m
[1;32m==> [cleaned] Terminating the source AWS instance...[0m
[1;32m==> [cleaned] Cleaning up any extra volumes...[0m
[1;32m==> [cleaned] No volumes to clean up, skipping[0m
[1;31mBuild [cleaned] errored: 1 error(s) occurred:

Error waiting for AMI ([ami_id]) in region (us-east-2): ResourceNotReady: failed waiting for successful resource state[0m

==> Some builds didn't complete successfully and had errors:
--> [cleaned] 1 error(s) occurred:

Error waiting for AMI ([ami_id]) in region (us-east-2): ResourceNotReady: failed waiting for successful resource state

==> Builds finished but no artifacts were created.

And the template (note - the same template works in 1.4.0):

"region": "{{user aws_region}}",
"encrypt_boot": "true",
"kms_key_id": "{{user kms_key_id}}",

UPDATE:
Strange behavior. The template below fails in 1.4.3 but works (with the default KMS key bug) in 1.4.2.

        "encrypt_boot": "true",
         "ami_regions": [ "{{user `aws_region`}}" ],
        "region_kms_key_ids":
        {
        "us-east-2": "{{ user `kms_key_id` }}"
       }

Something is definitely up with the encryption portion of the EBS builder.

AndrewCi on 30 Jul 2019

Thanks for your patience on this back and forth. I've tried everything I can think of, here. I've tested with the exact format you're giving me in your template samples, making sure I'm interpolating the same variables as you. I've also tested with those vars hardcoded. I've tried with the kms Id, ARN, and alias. I cannot reproduce this. It works for me every time.

The only time I've managed to produce the ResourceNotReady: failed waiting for successful resource state error was when I provided a kms_key_id that was for a different region than the one I said to use it for. E.g.

      "region_kms_key_ids": {
          "us-east-2": "{{ user `kms_key_id` }}"
        }

where the value set in kms_key_id was actually a valid key but for us-east-1

I was also able to produce the error by adding a typo the kms key variable.

So... Is this possibly the issue? I'm not trying to cop out and say "user error" here but I'm well and truly stumped, and this is literally the only way I've been able to produce this error in a good five hours of testing. Another thought that I haven't yet tested is that maybe something about IAM user permissions changed since 1.4.0 and your role is more restrictive than mine? I've been casting back through my memory and I don't _think_ this is the case unless you're using spot fleets, which shouldn't affect encryption at all. As you can see, I'm at the "grasping at straws" phase of debugging.

Last question: You're using amazon-ebs and not the amazon-ebssurrogate builder or a different one? I want to make sure I've not been barking up the wrong tree this whole time.

SwampDragons on 30 Jul 2019

Do you have any potential explanation of what may cause the use of the default key instead of the user supplied key in my last example?

AndrewCi on 31 Jul 2019

I'm not sure I understand. There was a logic error in 1.4.2, which I resolved in the patch you tested, where the default key was always being used in the build region instead of the provided kms key.

Are you saying that in the 1.4.3-dev patch you're still seeing the default key being used? I had understood that you were seeing the same failed waiting for successful resource state error for both of your configurations with the 1.4.3-dev build.

SwampDragons on 31 Jul 2019

I'm not sure I understand. There was a logic error in 1.4.2, which I resolved in the patch you tested, where the default key was always being used in the build region instead of the provided kms key.

Are you saying that in the 1.4.3-dev patch you're still seeing the default key being used? I had understood that you were seeing the same failed waiting for successful resource state error for both of your configurations with the 1.4.3-dev build.

Yes. Apologies for the confusion. When I run the 1.4.3-dev build with the key mapping it runs and does not give me an error. But it uses the default key (which is a bug since I supplied a custom key). See below for the the template I used on 1.4.3-dev.

"encrypt_boot": "true",
"ami_regions": [ "{{user aws_region}}" ],
"region_kms_key_ids":
{
"us-east-2": "{{ user kms_key_id }}"
}

AndrewCi on 31 Jul 2019

👍1

I have no idea how that's happening. I can't reproduce that behavior.

SwampDragons on 31 Jul 2019

Hi SwampDragons -

Thank you so much for spending time on this. I finally figured it out - it was a combination of user error and a potential "bug". Long story short I had a hard-coded value in one of my terraform deployment scripts and the EC2 IAM role I was using had an old KMS key ID in the policy. Therefore, the EC2 instance I was running packer from did not have the proper IAM permissions for the KMS key I was referencing in my packer template.

While this was definitely my error I do believe there is something to be gained from this exercise. The behavior of packer ignoring the specified custom key and using the region's default EBS KMS key since it could not access the custom key seems improper. I believe packer should error out if it can't access the key referenced in the template and not automatically switch to the default key. Additionally, if there is a way for packer to exit with an error referencing IAM permissions issues with the provided key instead of ResourceNotReady: failed waiting for successful resource state.

Would it be possible to add a validation check before the template executes that checks to see if the provided custom KMS key can accessed?

Curious on your thoughts. Again, apologies for the user error here.

-AC

AndrewCi on 3 Aug 2019

Hmm. I'll investigate and see if we can do some kind of quick query that checks for validity of kms key so we at least fail early. The problem is, I don't think we'll ever get a useful error message from Amazon--My gut says they specifically don't say the key is invalid for some kind of security through obscurity reason, since it seems like they should be failing with a useful message in these situations.

SwampDragons on 12 Aug 2019

I can't find anything online or in the AWS docs that suggests there's a way to validate kms keys before using them. As far as I can tell, you just have to use them, wait, and check the error when it eventually fails. I don't think this is something we can catch in the prevalidate stage.

From the SDK docs:

    // AWS parses KmsKeyId asynchronously, meaning that the action you call may
    // appear to complete even though you provided an invalid identifier. This action
    // will eventually report failure.
    //
    // The specified CMK must exist in the region that the snapshot is being copied
    // to.

I also checked the SDK code to see whether there was a better error message that we could bubble up to make live easier on you if it does fail with this message. No dice. There's nothing getting returned to make it clear what's going on here. I think we're out of luck on making this a user friendly experience. :( .

If I come across something that I think could improve things, I'll reopen.

SwampDragons on 13 Aug 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.