We are using Packer for jobs as part of an automatic & serverless CI Pipeline. (In particular, manually called from an Azure Pipeline to handle multiple GPU builds, with a default of 0 GPU instances.) So a CI trigger hits, and a Pipeline task launches a Packer job, and Packer either succeeds, fails, or the Pipeline task hits a timeout and ends. The problem is, the next morning, we have a whole bunch of idling Packer Builder
GPU AMIs. That gets prohibitively expensive really quick.
Ideally:
-- A fleet of GPU Packer Builders wouldn't just be sitting there, idling, the next day <-- bug?
-- The builder can be configured to auto-terminate itself due to Time
-- The builder has a AWS launch hook for external monitoring in case the builder gets into a bad state
There may be best practices here that we're missing, but the current buggy behavior and no easy defense layers has us scratching our heads of the typical case here for automatic "set-and-forget" production use.
Any guidance or best practices here would be appreciated.
Hard to repro, sample builder:
{
"type": "amazon-ebs",
"ami_name": "zzz-{{user `version` | clean_resource_name}}",
"source_ami_filter": {
"filters": {
"image-id": "{{user `zzz_ami`}}"
},
"most_recent": true,
"owners": "zzz"
},
"force_deregister": true,
"force_delete_snapshot": true,
"shutdown_behavior": "terminate",
"ssh_username": "ubuntu",
"launch_block_device_mappings": [
{
"device_name": "/dev/sda1",
"volume_size": 150,
"volume_type": "gp2",
"delete_on_termination": true
}
],
"tags": {
"created": "{{isotime | clean_resource_name}}"
},
"instance_type": "p3.2xlarge",
"region": "us-east-1",
"ebs_optimized": true,
"ami_description": "zzzz"
}
1.4.3
See above
Azure Pipelines linux task call Packer to run AWS Ubuntu 18.04 GPU jobs, which themselves may spawn more of the same.
Can update the next time we have such a box
Hi, thanks for reaching out.
Packer _is_ supposed to clean up instances after it runs so if that isn't the behavior you're seeing that's very bad. I definitely need those logs in order to take a closer look, since nothing in your template looks problematic to me. The only time I've ever seen this kind of behavior before is when people use AWS credentials with a short expiry, and the the credentials are expiring before the build ends, making it so Packer is unable to clean up the instances because it lacks the permission to do so.
For what it's worth, you can set timeouts on specific long running provisioners: https://www.packer.io/docs/templates/provisioners.html#timeout
When a build gets cancelled by CI, it should still be cleaning up the running instances unless you're setting on-error=abort
. docs link: https://www.packer.io/docs/commands/build.html#on-error-cleanup
Please update with logs when you get the chance, and let me know if I'm onto something with your credential situation, and we can move from there :)
Thank you!
-- RE:creds, not in this case, saw this issues before :(
-- RE:on-error=abort, nope, just as above ^^^ unless some env vars to hunt down somehow?
I'll watch for logs, may be a few days / a week to both replicate + get extractable logs (Settings fiddling between incidents). Just added more precise job tagging across CI stage ami's to make easier correlations.
Just a quick update:
-- Still happening, so been a lot of whackamole (20 GPUs today 😅
-- We're adding some external packer builder monitoring & killers in our CI system (Github Actions / Azure Pipelines) which is more manual & ugly than we'd hope
-- This makes me increasing think that Packer should support some defense-in-depth for cleanup, e.g., an explicit timeout support ("builder_timeout_minutes": 60, "builder_timeout_action": "stop"
), and enforce it both inside the image + outside
The new boxes now have correlation IDs so we should be able to get some reports. My guess is something like "CI times out the task, the task dies and kills the packer runner, the packer runner dies before it can externally cleanup, and the packer builder doesn't realize it's a zombie, AWS bill explodes."
@lmeyerov; for timeouts; you can set a timeout per provisioner step: https://www.packer.io/docs/templates/provisioners.html#timeout
Yeah we don't have a broad overall build timeout, but you can set that provisioner timeout for each step which will handle the problem if your build is becoming a zombie because the provision process never completes. Many of the other build steps already have built-in timeouts. As for an "inside the image" timeout, I'm not clear on what that would look like. We don't manage the vm lifecycle from inside the image, we manage it using API calls from the build host. And to have an inner script that would shut down the VM would require us to make assumptions about your operating system, which Packer tries very hard not to do. I suppose you could launch such a script as your first provisioner if that's something you want, but it probably wouldn't handle deleting volumes associated with the instance.
If you get a chance, try to get a list of the running packer processes (ps aux | grep packer
would get me the relevant info) on the build host for one of your zombie instances. I'd be interested to see what, if any, packer processes are still active after the build has been terminated.
Again, we can't really look into this until you can give us logs.
Any news on whether you can get hold of those logs? Otherwise, we can close.
It's still my guess that your Packer build is being OOM killed or something similar, and it's not something we'd be able to recover from. Otherwise, I'd expect to see this kind of thing happening far more often.
@SwampDragons Thanks for checking in!
So we worked around by doing a packer-killer (well, stopper!) on job fails
... so now we have a few failed-to-terminate packers from every day in our AWS account, piling up :)
But unfortunately, I just checked, and failed to log into one of the stopped historic failed packer AWS instances, as the keypair name is an autogen throwaway "packer_5da3fe..." . I can temporarily set our CI to ssh_password so we can access future ones, and revisit next week. Would you recommend tweaking any other settings?
Hm, to make the box accessible, I tried setting:
"builders": [
{
"ssh_password": "#####",
=>
==> amazon-ebs: Waiting for SSH to become available...
2019/10/18 05:05:29 packer: 2019/10/18 05:05:29 [INFO] Waiting for SSH, up to timeout: 5m0s
2019/10/18 05:05:44 packer: 2019/10/18 05:05:44 [DEBUG] TCP connection to SSH ip/port failed: dial tcp 35.174.12.204:22: i/o timeout
2019/10/18 05:05:50 packer: 2019/10/18 05:05:50 [DEBUG] TCP connection to SSH ip/port failed: dial tcp 35.174.12.204:22: connect: connection refused
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [INFO] Attempting SSH connection to 35.174.12.204:22...
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] Config to &ssh.Config{SSHConfig:(*ssh.ClientConfig)(0xc00036aa90), Connection:(func() (net.Conn, error))(0x1176ed0), Pty:false, DisableAgentForwarding:false, HandshakeTimeout:0, UseSftp:false, KeepAliveInterval:5000000000, Timeout:0}...
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] reconnecting to TCP connection for SSH
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] handshaking with SSH
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] SSH handshake err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none], no supported methods remain
Is there a recommended way to get logs off packer builders?
Just give me those logs you're showing there. We don't leave logs on the remote vms.
I don't need you to change the SSH key or anything to gain special post-build access to the machines. I just want to see the logs output by the packer build
call on the machine running Packer.
Closing since we've never gotten more information on this; If we can get hold of Packer logs and more information about the machine running the Packer builds, we can reopen then.
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.