this is packer version 1.4.2 on ubuntu 19.04.
basically, snapshots take a really long time, and it seems like packer is using something internally to keep track of when a snapshot completes that isn't working. I've got a build right now that's just been sitting here for ~15m saying it's waiting for the snapshot to finish, but the snapshot's been done for a while now.
eventually, packer times out, and then I get errors like this on cleanup:
==> amazon-ebssurrogate: Removing snapshots since we cancelled or halted...
==> amazon-ebssurrogate: Error: RequestExpired: Request has expired.
==> amazon-ebssurrogate: status code: 400, request id: 0c8cb2f6-050c-4ba3-9da6-477494fca5f0
==> amazon-ebssurrogate: Terminating the source AWS instance...
==> amazon-ebssurrogate: Error terminating instance, may still be around: RequestExpired: Request has expired.
==> amazon-ebssurrogate: status code: 400, request id: f45bea2f-9167-4990-8dbf-6f4b7b22acd2
==> amazon-ebssurrogate: Cleaning up any extra volumes...
==> amazon-ebssurrogate: Error describing volumes: RequestExpired: Request has expired.
==> amazon-ebssurrogate: status code: 400, request id: 6b5507b8-baa1-4fae-8244-d2f3580c8dd2
==> amazon-ebssurrogate: Deleting temporary security group...
=> amazon-ebssurrogate: Error cleaning up security group. Please delete the group manually: sg-03a47a3f737fbc2a6
Build 'amazon-ebssurrogate' errored: 1 error occurred:
* RequestCanceled: waiter context canceled
caused by: context canceled
I'm not sure what the request that's been expired is, but I'd definitely prefer if packer just tried to retry operations like this if they fail.
Can you share full logs or at least the portion of logs where the error originally appears in the snapshotting step?
the problem is that packer doesn't see an error in the snapshot, it just doesn't see the snapshot complete, and eventually times out the operation. the snapshot completes successfully, but packer just never notices. so there's no error in the snapshot step to show because no error actually happens.
Are you using an IAM user, or some kind of credentials with a timed expiry? I've seen this kind of problem when the Packer user's credentials expire in the middle of the build.
Ahhh yes, this is probably it. we're using aws-vault to coordinate multiple profiles and accounts and assumed roles, and it probably has a credential expiry thing builting. once I've got things on #7968 sorted I'm sure I'll be able to figure this out.
ok, it's not aws-vault. I've manually extended the credential lifetime to 1h, but packer 1.4.3 dies after 21m with:
=> amazon-ebssurrogate: Creating snapshot of EBS Volume vol-08b7c17eff33f905f...
==> amazon-ebssurrogate: 1 error occurred:
==> amazon-ebssurrogate: * ResourceNotReady: exceeded wait attempts
==> amazon-ebssurrogate:
==> amazon-ebssurrogate:
==> amazon-ebssurrogate: Removing snapshots since we cancelled or halted...
==> amazon-ebssurrogate: Error: RequestExpired: Request has expired.
==> amazon-ebssurrogate: status code: 400, request id: 01e33829-afd6-423f-bc00-1e514d51f7b8
==> amazon-ebssurrogate: Terminating the source AWS instance...
==> amazon-ebssurrogate: Error terminating instance, may still be around: RequestExpired: Request has expired.
==> amazon-ebssurrogate: status code: 400, request id: 34d1d614-3704-4350-a47b-743eaaa5f1da
==> amazon-ebssurrogate: Cleaning up any extra volumes...
==> amazon-ebssurrogate: Error describing volumes: RequestExpired: Request has expired.
==> amazon-ebssurrogate: status code: 400, request id: cc8118d1-5956-4088-8031-380dd7e73466
==> amazon-ebssurrogate: Deleting temporary security group...
==> amazon-ebssurrogate: Error cleaning up security group. Please delete the group manually: sg-0bc18001e820fdba8
Build 'amazon-ebssurrogate' errored: 1 error occurred:
* ResourceNotReady: exceeded wait attempts
the snapshot definitely completed before this timeout message, too. it looks like packer is using some sort of AWS Request object that's expiring and so it never sees the snapshot complete.
ah, but THAT error message is due to a configurable wait! You can tweak how long the AWS sdk will poll for before giving up by using the env vars AWS_MAX_ATTEMPTS and AWS_POLL_DELAY_SECONDS. They multiply together to give how long the waiter will go before timing out. Example: AWS_POLL_DELAY_SECONDS=10 and AWS_MAX_ATTEMPTS=600 will mean "try every 10 seconds, up to 600 tries" which means it'll try for roughly 1.6 hours before timing out. You can tweak those variables until you get a long enough wait.
Ahhh perfect, that seems to have worked correctly. Thanks!
I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.