Packer: Runaway packer builders not terminating

Created on 24 Sep 2019 · 12Comments · Source: hashicorp/packer

Overview of the Issue

We are using Packer for jobs as part of an automatic & serverless CI Pipeline. (In particular, manually called from an Azure Pipeline to handle multiple GPU builds, with a default of 0 GPU instances.) So a CI trigger hits, and a Pipeline task launches a Packer job, and Packer either succeeds, fails, or the Pipeline task hits a timeout and ends. The problem is, the next morning, we have a whole bunch of idling Packer Builder GPU AMIs. That gets prohibitively expensive really quick.

Ideally:
-- A fleet of GPU Packer Builders wouldn't just be sitting there, idling, the next day <-- bug?
-- The builder can be configured to auto-terminate itself due to Time
-- The builder has a AWS launch hook for external monitoring in case the builder gets into a bad state

There may be best practices here that we're missing, but the current buggy behavior and no easy defense layers has us scratching our heads of the typical case here for automatic "set-and-forget" production use.

Any guidance or best practices here would be appreciated.

Reproduction Steps

Hard to repro, sample builder:

 {   
            "type": "amazon-ebs",
            "ami_name": "zzz-{{user `version` | clean_resource_name}}",
            "source_ami_filter": {
                "filters": {
                    "image-id": "{{user `zzz_ami`}}"
                },
                "most_recent": true,
                "owners": "zzz"
            },
            "force_deregister": true,
            "force_delete_snapshot": true,
            "shutdown_behavior": "terminate",

            "ssh_username": "ubuntu",
            "launch_block_device_mappings": [
                {   
                    "device_name": "/dev/sda1",
                    "volume_size": 150,
                    "volume_type": "gp2",
                    "delete_on_termination": true
                }
            ],

            "tags": {
                "created": "{{isotime | clean_resource_name}}"
            },

            "instance_type": "p3.2xlarge",
            "region": "us-east-1",

            "ebs_optimized": true,
            "ami_description": "zzzz"
        }

Packer version

1.4.3

Simplified Packer Buildfile

See above

Operating system and Environment details

Azure Pipelines linux task call Packer to run AWS Ubuntu 18.04 GPU jobs, which themselves may spawn more of the same.

Log Fragments and crash.log files

Can update the next time we have such a box

buildeamazon need-more-info question

Source

lmeyerov

All 12 comments

Hi, thanks for reaching out.

Packer _is_ supposed to clean up instances after it runs so if that isn't the behavior you're seeing that's very bad. I definitely need those logs in order to take a closer look, since nothing in your template looks problematic to me. The only time I've ever seen this kind of behavior before is when people use AWS credentials with a short expiry, and the the credentials are expiring before the build ends, making it so Packer is unable to clean up the instances because it lacks the permission to do so.

For what it's worth, you can set timeouts on specific long running provisioners: https://www.packer.io/docs/templates/provisioners.html#timeout

When a build gets cancelled by CI, it should still be cleaning up the running instances unless you're setting on-error=abort. docs link: https://www.packer.io/docs/commands/build.html#on-error-cleanup

Please update with logs when you get the chance, and let me know if I'm onto something with your credential situation, and we can move from there :)

SwampDragons on 25 Sep 2019

Thank you!

-- RE:creds, not in this case, saw this issues before :(

-- RE:on-error=abort, nope, just as above ^^^ unless some env vars to hunt down somehow?

I'll watch for logs, may be a few days / a week to both replicate + get extractable logs (Settings fiddling between incidents). Just added more precise job tagging across CI stage ami's to make easier correlations.

lmeyerov on 25 Sep 2019

Just a quick update:

-- Still happening, so been a lot of whackamole (20 GPUs today 😅
-- We're adding some external packer builder monitoring & killers in our CI system (Github Actions / Azure Pipelines) which is more manual & ugly than we'd hope
-- This makes me increasing think that Packer should support some defense-in-depth for cleanup, e.g., an explicit timeout support ("builder_timeout_minutes": 60, "builder_timeout_action": "stop"), and enforce it both inside the image + outside

The new boxes now have correlation IDs so we should be able to get some reports. My guess is something like "CI times out the task, the task dies and kills the packer runner, the packer runner dies before it can externally cleanup, and the packer builder doesn't realize it's a zombie, AWS bill explodes."

lmeyerov on 26 Sep 2019

@lmeyerov; for timeouts; you can set a timeout per provisioner step: https://www.packer.io/docs/templates/provisioners.html#timeout

azr on 26 Sep 2019

👍1

Yeah we don't have a broad overall build timeout, but you can set that provisioner timeout for each step which will handle the problem if your build is becoming a zombie because the provision process never completes. Many of the other build steps already have built-in timeouts. As for an "inside the image" timeout, I'm not clear on what that would look like. We don't manage the vm lifecycle from inside the image, we manage it using API calls from the build host. And to have an inner script that would shut down the VM would require us to make assumptions about your operating system, which Packer tries very hard not to do. I suppose you could launch such a script as your first provisioner if that's something you want, but it probably wouldn't handle deleting volumes associated with the instance.

If you get a chance, try to get a list of the running packer processes (ps aux | grep packer would get me the relevant info) on the build host for one of your zombie instances. I'd be interested to see what, if any, packer processes are still active after the build has been terminated.

Again, we can't really look into this until you can give us logs.

SwampDragons on 26 Sep 2019

👍1

Any news on whether you can get hold of those logs? Otherwise, we can close.

It's still my guess that your Packer build is being OOM killed or something similar, and it's not something we'd be able to recover from. Otherwise, I'd expect to see this kind of thing happening far more often.

SwampDragons on 18 Oct 2019

@SwampDragons Thanks for checking in!

So we worked around by doing a packer-killer (well, stopper!) on job fails
... so now we have a few failed-to-terminate packers from every day in our AWS account, piling up :)
But unfortunately, I just checked, and failed to log into one of the stopped historic failed packer AWS instances, as the keypair name is an autogen throwaway "packer_5da3fe..." . I can temporarily set our CI to ssh_password so we can access future ones, and revisit next week. Would you recommend tweaking any other settings?

lmeyerov on 18 Oct 2019

Hm, to make the box accessible, I tried setting:

    "builders": [
        {   

            "ssh_password": "#####",

==> amazon-ebs: Waiting for SSH to become available...
2019/10/18 05:05:29 packer: 2019/10/18 05:05:29 [INFO] Waiting for SSH, up to timeout: 5m0s
2019/10/18 05:05:44 packer: 2019/10/18 05:05:44 [DEBUG] TCP connection to SSH ip/port failed: dial tcp 35.174.12.204:22: i/o timeout
2019/10/18 05:05:50 packer: 2019/10/18 05:05:50 [DEBUG] TCP connection to SSH ip/port failed: dial tcp 35.174.12.204:22: connect: connection refused
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [INFO] Attempting SSH connection to 35.174.12.204:22...
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] Config to &ssh.Config{SSHConfig:(*ssh.ClientConfig)(0xc00036aa90), Connection:(func() (net.Conn, error))(0x1176ed0), Pty:false, DisableAgentForwarding:false, HandshakeTimeout:0, UseSftp:false, KeepAliveInterval:5000000000, Timeout:0}...
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] reconnecting to TCP connection for SSH
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] handshaking with SSH
2019/10/18 05:05:55 packer: 2019/10/18 05:05:55 [DEBUG] SSH handshake err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none], no supported methods remain

Is there a recommended way to get logs off packer builders?

lmeyerov on 18 Oct 2019

Just give me those logs you're showing there. We don't leave logs on the remote vms.

SwampDragons on 18 Oct 2019

I don't need you to change the SSH key or anything to gain special post-build access to the machines. I just want to see the logs output by the packer build call on the machine running Packer.

SwampDragons on 18 Oct 2019

Closing since we've never gotten more information on this; If we can get hold of Packer logs and more information about the machine running the Packer builds, we can reopen then.

SwampDragons on 4 Nov 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.