Before the aws SDK had a viable wait method (i.e. to wait for resources to become ready), packer implemented its own version of this. See builder/amazon/common/state.go.
We routinely get issues reported that we're not waiting long enough. I've slowly been replacing the homegrown waiters with the SDK waiters, which generally seems to solve the problems. I think we should remove the home grown waiters for good and replace them with the SDK waiters.
There is a BC aspect to consider:
The common/state.go waiters use the env vars AWS_POLL_DELAY_SECONDS
and AWS_TIMEOUT_SECONDS
to control how long to wait.
The SDK waiters implement defaults for the above values per resource. For example, the wait for instances to be running looks like this
func (c *EC2) WaitUntilInstanceRunningWithContext(ctx aws.Context, input *DescribeInstancesInput, opts ...request.WaiterOption) error {
w := request.Waiter{
Name: "WaitUntilInstanceRunning",
MaxAttempts: 40,
Delay: request.ConstantWaiterDelay(15 * time.Second),
Acceptors: []request.WaiterAcceptor{
{
State: request.SuccessWaiterState,
Matcher: request.PathAllWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
Expected: "running",
},
{
State: request.FailureWaiterState,
Matcher: request.PathAnyWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
Expected: "shutting-down",
},
{
State: request.FailureWaiterState,
Matcher: request.PathAnyWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
Expected: "terminated",
},
{
State: request.FailureWaiterState,
Matcher: request.PathAnyWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
Expected: "stopping",
},
{
State: request.RetryWaiterState,
Matcher: request.ErrorWaiterMatch,
Expected: "InvalidInstanceID.NotFound",
},
},
Logger: c.Config.Logger,
NewRequest: func(opts []request.Option) (*request.Request, error) {
var inCpy *DescribeInstancesInput
if input != nil {
tmp := *input
inCpy = &tmp
}
req, _ := c.DescribeInstancesRequest(inCpy)
req.SetContext(ctx)
req.ApplyOptions(opts...)
return req, nil
},
}
w.ApplyOptions(opts...)
return w.WaitWithContext(ctx)
}
as you can see, it has the delay hardcoded to 15s, and retries set to 40. We do a 5 minute absolute timeout and 2 second delay.
I believe packer has a different use-case than the SDK defaults. We don't care so much about liveliness, and instead care deeply about the request eventually succeeding. I believe we should use the SDK waiters with default retries set to 10x the SDK default.
We must also allow these to be overridable by the user in case the values are unacceptable. Ideally we could reuse the existing env vars in a way that makes sense, but we may have to come up with an alternate configuration syntax.
Looking at #6330, I find it odd that the default waiters use a constant delay. I wonder if we should add an option to use exponential back off...
I like the approach of using the SDK waiters for this solution - any ETA on having 1.3.0
ready?
It'll probably be at least a month -- I'm working on implementing the waiters right now in https://github.com/hashicorp/packer/pull/6332 and will hopefully be done with testing today. I can build you a binary if you'd like to try it out.
@mwhooker I'm not sure points 2 and 4 are necessary in the way I've implemented it; I have used Packer's current env vars instead of overriding them with anything, so the default is already set by those env vars (point 2) and it should already be documented because nothing's changed (point 4)
Could I please point out that the issue reported in #6305 that ends up getting tracked here is about the timeout happening when we have specifically requested that no timeout should apply via disable_stop_instance
Although the log output would suggest that a timeout is set:
2018/06/26 10:59:45 ui: ==> amazon-ebs: Automatic instance stop disabled. Please stop instance manually.
2018/06/26 10:59:45 ui: ==> amazon-ebs: Waiting for the instance to stop...
2018/06/26 10:59:45 packer: 2018/06/26 10:59:45 Waiting for state to become: stopped
2018/06/26 10:59:45 packer: 2018/06/26 10:59:45 Using 2s as polling delay (change with AWS_POLL_DELAY_SECONDS)
2018/06/26 10:59:45 packer: 2018/06/26 10:59:45 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
Using packer 1.0.2 packer waits indefinitely as we would want it to.
I have tried using packer version 1.2.2 and 1.2.4 and both time out - which is not the behaviour we require if we specify disable_stop_instance
@nstreet disable_stop_instance
just prevents the instance from being stopped when the build has finished running; it doesn't affect timeouts.
@nstreet Looking closer at this, you've found a bug unrelated to the timeouts -- the issue here is that the waiter should be located inside an if statement that it's actually outside of. I can fix that for the next release.
Okay, final comment @nstreet -- I looked into this and it turns out it's a flag for a very very specific use case; for that use case it works fine, but it does not do what you want it to do. I've made a PR updating the docs to make them clearer.
@SwampDragons any chance I could get a build with this fix in it to try out?
@PCSshell sure -- what's your host OS?
@SwampDragons Ubuntu 16.04
@SwampDragons Apparently I am not doing something correctly. Below are the commands I am running to set the timeout settings. In the log it is telling me that the values to over ride are not set.
$ export AWS_POLL_DELAY_SECONDS=10
$ export AWS_MAX_ATTEMPTS=300
$ export AWS_TIMEOUT_SECONDS=3000
$ export PACKER_LOG=1
$ packer build 2012R2/Server2012R2.json
==> amazon-ebs: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
2018/07/11 09:12:43 ui: ==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Terminating the source AWS instance...
2018/07/11 09:12:44 packer: 2018/07/11 09:12:44 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
$ echo $AWS_MAX_ATTEMPTS
300
$ echo $AWS_POLL_DELAY_SECONDS
10
$ echo $AWS_TIMEOUT_SECONDS
3000
Oh weird. And of course I merged before I saw your comment. I'll take a look at this.
ugh, found it. I made some last-minute changes to make testing easier, and... broke the functionality. 😅
packer.zip
I am still seeing the same behavior. The --version number I am seeing now is [INFO] Packer version: 1.3.0-dev
The same behavior as the previous build I gave you? I just tested again locally, and I only get the No AWS timeout and polling overrides have been set
warning if I have not set the environment variables. If I set the environment variables as you have above, I get a different (correct) warning -- WARNING: AWS_MAX_ATTEMPTS and AWS_TIMEOUT_SECONDS are both set....
etc
I was bouncing back and forth between several issues yesterday so I guess it's possible I built from the wrong commit (I can't test the binary right now since I'm on a darwin machine)... just in case, here's a new build of the master branch.
packer.zip
Looks like it is recognizing the environment variables now, but does not seem to respect what I am entering for them.
export AWS_POLL_DELAY_SECONDS=30
export AWS_MAX_ATTEMPTS=3000
export AWS_TIMEOUT_SECONDS=3000
2018/07/12 13:47:41 ui: ==> amazon-ebs: Automatic instance stop disabled. Please stop instance manually.
==> amazon-ebs: Automatic instance stop disabled. Please stop instance manually.
2018/07/12 13:47:41 ui: ==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
2018/07/12 13:57:38 ui error: ==> amazon-ebs: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
==> amazon-ebs: Terminating the source AWS instance...
2018/07/12 13:57:38 ui: ==> amazon-ebs: Terminating the source AWS instance...
2018/07/12 13:57:38 packer: 2018/07/12 13:57:38 WARNING: AWS_MAX_ATTEMPTS and AWS_TIMEOUT_SECONDS are both set. Packer will be using AWS_MAX_ATTEMPTS and discarding AWS_TIMEOUT_SECONDS. If you have not set AWS_POLL_DELAY_SECONDS, Packer will default to a 2 second poll delay.
2018/07/12 13:58:39 ui: ==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: Cleaning up any extra volumes...
2018/07/12 13:58:39 ui: ==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: No volumes to clean up, skipping
2018/07/12 13:58:39 ui: ==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary security group...
2018/07/12 13:58:40 ui: ==> amazon-ebs: Deleting temporary keypair...
==> amazon-ebs: Deleting temporary keypair...
2018/07/12 13:58:40 [INFO] (telemetry) ending amazon-ebs
2018/07/12 13:58:40 ui error: Build 'amazon-ebs' errored: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
Build 'amazon-ebs' errored: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
Okay, I dug around some more -- it looks like you've found a bug where this code was not governed by our environment variables even before I swapped out the waiters in the attached PR; it just used the waiter's default. I missed it when swapping out our custom code for the AWS waiters, because this particular wait had already been swapped out.
https://github.com/hashicorp/packer/blob/master/builder/amazon/common/step_stop_ebs_instance.go#L81
Third time's the charm?
packer.zip
@SwampDragons - Working as expected now. Thanks for all the assistance!
No problem -- thanks for your patience as I looked into this.
@SwampDragons I have tried both packer v1.3 and v1.2.3 that I wasn't able to get rid of the error saying unexpected state 'failed', wanted target 'available'
I currently building all 4 different AMIs from jenkins-slaves for 4 different aws accounts and three of them are working fine meaning it builds AMIs on all three accounts without having this error. Only one account is giving me the error even though packer version is on v1.2.3 and runs based on the same Ansible (v2.4.1) packer codes.
I wonder this might be related to IAM role permissions. I will update once I find out further after compared the IAM permissions.
[1;32m==> aws: Copying AMI: us-west-2(ami-0a30229bcbd197499)[0m
[1;32m==> aws: Encrypting with KMS Key ID: arn:aws:kms:us-west-2:8xxxxxxxxxxxx:key/be91ec52-a7ed-4b6b-xxxx-d6594f71axxxxx[0m
[1;32m==> aws: Waiting for AMI copy to become ready...[0m
[1;31m==> aws: Error waiting for AMI Copy: unexpected state 'failed', wanted target 'available'[0m
[1;32m==> aws: Deregistering the AMI because cancellation or error...[0m
[1;32m==> aws: Terminating the source AWS instance...[0m
[1;32m==> aws: Cleaning up any extra volumes...[0m
[1;32m==> aws: No volumes to clean up, skipping[0m
[1;32m==> aws: Deleting temporary security group...[0m
[1;32m==> aws: Deleting temporary keypair...[0m
1;31mBuild 'aws' errored: Error waiting for AMI Copy: unexpected state 'failed', wanted target 'available'[0m
==> Some builds didn't complete successfully and had errors:
--> aws: Error waiting for AMI Copy: unexpected state 'failed', wanted target 'available'
==> Builds finished but no artifacts were created.
The jenkins slave machine which doesn't work, I have ran pip list to make sure all pip installed components are the same as the other including awscli, pip version, ansible, requests and etc. There is very minimal version discrepancies and almost the same.
This is something I don't understand why the other three accounts builds fine without having the issue but one of the 4 accounts shows the error on packer 1.2.3, 1.2.4 or 1.2.5 (above) and packer 1.3.0 as below.
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Detected home directory from env var: /home/jenkins-slave
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Plugin address: unix /tmp/packer-plugin482621616
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Waiting for connection...
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Serving a plugin connection...
2018/08/29 01:23:06 Loading provisioner: ansible-local
2018/08/29 01:23:06 Plugin could not be found. Checking same directory as executable.
2018/08/29 01:23:06 Current exe path: /usr/bin/packer
2018/08/29 01:23:06 Creating plugin client for path: /usr/bin/packer
2018/08/29 01:23:06 Starting plugin: /usr/bin/packer []string{"/usr/bin/packer", "plugin", "packer-provisioner-ansible-local"}
2018/08/29 01:23:06 Waiting for RPC address for: /usr/bin/packer
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 [INFO] Packer version: 1.3.0-dev
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Packer Target OS/Arch: linux amd64
==> aws: Stopping the source instance...
2018/08/29 01:35:16 ui: ==> aws: Stopping the source instance...
aws: Stopping instance, attempt 1
2018/08/29 01:35:16 ui: aws: Stopping instance, attempt 1
==> aws: Waiting for the instance to stop...
2018/08/29 01:35:16 ui: ==> aws: Waiting for the instance to stop...
2018/08/29 01:35:16 packer: 2018/08/29 01:35:16 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
2018/08/29 01:35:47 ui: ==> aws: Creating the AMI: sss-centos7-test
==> aws: Creating the AMI: sss-centos7-test
2018/08/29 01:35:47 ui: aws: AMI: ami-06059b63f16b53d61
==> aws: Waiting for AMI to become ready...
2018/08/29 01:35:47 ui: ==>aws: Waiting for AMI to become ready..
2018/08/29 01:35:47 packer: 2018/08/29 01:35:47 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
2018/08/29 01:38:33 ui: ==> aws: Creating Encrypted AMI Copy
==> aws: Creating Encrypted AMI Copy
2018/08/29 01:38:33 ui: ==> aws: Copying AMI: us-west-2(ami-06059b63f16b53d61)
==> aws: Copying AMI: us-west-2(ami-06059b63f16b53d61)
2018/08/29 01:38:33 ui: ==> aws: Encrypting with KMS Key ID: arn:aws:kms:us-west-2:xxxxxxxxx:key/be91ec52-a7ed-4b6b-b8cd-d6594f71a044
==> aws: Encrypting with KMS Key ID: arn:aws:kms:us-west-2:xxxxxxxxxxx:key/be91ec52-a7ed-4b6b-b8cd-d6594f71a044
2018/08/29 01:38:33 ui: ==> aws: Waiting for AMI copy to become ready...
==>aws: Waiting for AMI copy to become ready...
2018/08/29 01:38:33 packer: 2018/08/29 01:38:33 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
2018/08/29 01:48:23 ui error: ==> aws: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
2018/08/29 01:48:23 ui: ==> aws: Deregistering the AMI because cancellation or error...
==> aws: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
==> aws: Deregistering the AMI because cancellation or error...
2018/08/29 01:48:23 ui: ==> aws: Terminating the source AWS instance...
==> aws: Terminating the source AWS instance...
2018/08/29 01:48:38 ui: ==> aws: Deleting temporary security group...
2018/08/29 01:48:39 ui: ==> aws: Deleting temporary keypair...
==> aws: Deleting temporary keypair...
2018/08/29 01:48:39 [INFO] (telemetry) ending amazon-ebs
2018/08/29 01:48:39 ui error: Build 'aws' errored: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
2018/08/29 01:48:39 Builds completed. Waiting on interrupt barrier...
2018/08/29 01:48:39 machine readable: error-count []string{"1"}
2018/08/29 01:48:39 ui error:
==> Some builds didn't complete successfully and had errors:
2018/08/29 01:48:39 machine readable: aws,error []string{"Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts"}
2018/08/29 01:48:39 ui error: --> aws-ext-il2: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
2018/08/29 01:48:39 ui:
==> Builds finished but no artifacts were created.
Build 'aws' errored: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
==> Some builds didn't complete successfully and had errors:
--> aws: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
==> Builds finished but no artifacts were created.
2018/08/29 01:48:39 [INFO] (telemetry) Finalizing.
2018/08/29 01:48:39 waiting for all plugin processes to complete...
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
script returned exit code 1
Any suggestions?
I'll try to look at this today.
@SwampDragons Thanks for your attention and I added some logs at https://github.com/WeekendsBull/packerbuild-error/tree/master/packer_debug
I put both centos and windows 2012r2 build logs with ones working and works failed. I have only provided packer v1.2.3 and would you want me to provide the v1.3.0-dev as well?
I put an excel file showing what I have installed in each jenkins slave instances but they are almost identical for installed version.
Here is the copy of centos template I use at https://github.com/WeekendsBull/packerbuild-error/blob/master/centos7build.json (each build name associated with each account).
@WeekendsBull I figured this out and responded to you in the mailing list, but I'll leave a note here in case future users run into the same problem. You need to increase the AWS timeout by setting the environment variables AWS_POLL_DELAY_SECONDS
and AWS_MAX_ATTEMPTS
For example:
export AWS_POLL_DELAY_SECONDS=5
export AWS_MAX_ATTEMPTS=300
packer build example_template.json
or simply:
AWS_POLL_DELAY_SECONDS=5 AWS_MAX_ATTEMPTS=300 packer build example_template.json
Using 1.4.2 version of Packer. And still hit issue, in spite of setting AWS_POLL_DELAY_SECONDS=5 AWS_MAX_ATTEMPTS=300
, while doing a packer build, here is my base.json...
{
"variables": {
"region": "us-east-1",
"ssh_username": "ec2-user",
"instance_type": "t2.micro",
"vpc_id": "vpc-xxxxx",
"subnet_id": "subnet-YYYY"
},
"builders": [
{
"type": "amazon-ebs",
"ami_name": "eks-ami-sample-{{timestamp}}",
"region": "{{user `region` }}",
"vpc_id": "{{user `vpc_id`}}",
"subnet_id": "{{user `subnet_id`}}",
"instance_type": "{{user `instance_type` }}",
"ssh_username": "{{user `ssh_username`}}",
"associate_public_ip_address": false,
"source_ami_filter": {
"filters": {
"virtualization-type": "hvm",
"source": "amazon/amazon-eks-node-1.13-*",
"root-device-type": "ebs",
"architecture": "x86_64"
},
"owners": [
"602401143452"
],
"most_recent": true
}
}
],
"provisioners": [
{
"type": "shell",
"script": "./scripts/touch_it.sh"
}
]
}
This issue is a year old -- if you're still experiencing problems then please open a new ticket with full steps to reproduce and full debug logs.
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
@WeekendsBull I figured this out and responded to you in the mailing list, but I'll leave a note here in case future users run into the same problem. You need to increase the AWS timeout by setting the environment variables
AWS_POLL_DELAY_SECONDS
andAWS_MAX_ATTEMPTS
For example:
or simply: