Terraform-provider-google: google_compute_instance unable to properly utilize lifecycle of create_before_destroy

Created on 2 Nov 2018 · 6Comments · Source: hashicorp/terraform-provider-google

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.11.10

Tested with both google and google-beta provider versions:

* provider.google: version = "~> 1.17"
* provider.google-beta: version = "~> 1.19"

Affected Resource(s)

google_compute_instance

Terraform Configuration Files

resource "google_compute_instance" "my_vm" {
  depends_on                  = ["google_project_service.compute"]
  project                     = "${module.project.id}"
  name                        = "my-vm"
  machine_type                = "${var.vm_machine_type}"
  zone                        = "${var.google_region}-a"
  allow_stopping_for_update   = true
  network_interface {
    subnetwork  = "${element(module.project.subnetwork_uris, 0)}"
    access_config {
      // Ephemeral IP
    }
  }
  boot_disk {
    device_name = "my-vm"
    initialize_params {
      type  = "${var.vm_boot_disk_type}"
      image = "..."
      size  = "${var.vm_boot_disk_size_gb}"
    }
  }
  metadata {
    startup-script = "${file("${var.vm_startup_script_path}")}"
  }
  lifecycle {
    create_before_destroy = true
  }
}

Debug Output

Pretty sure the error alone should be enough:

Error: Error applying plan:

1 error(s) occurred:

* google_compute_instance.my_vm: 1 error(s) occurred:

2018-11-01T17:13:57.624-0600 [DEBUG] plugin.terraform-provider-google_v1.17.1_x4: 2018/11/01 17:13:57 [ERR] plugin: plugin server: accept unix /tmp/plugin104039794: use of closed network connection
* google_compute_instance.my_vm: Error creating instance: googleapi: Error 409: The resource 'projects/.../zones/.../instances/my-vm' already exists, alreadyExists

Expected Behavior

We should be able to create a new instance of the same name in the same project/zone before destroying the previous one.

Actual Behavior

I don't think create_before_destroy = true can work in the case of a VM trying to be replaced in the same project/zone with the same name, unless I'm missing another clever way to do this other than via lifecycle.

Steps to Reproduce

With the above google_compute_instance definition:

terraform apply ...
terraform taint google_compute_instance.my_vm
terraform apply ...

With the lifecycle rule for creating the replacement instance before destroying the old one, you'll hit the error above.

Important Factoids

I realize this issue may simply be considered "won't fix" because of GCP's lack of support for renaming an existing instance, but figured I'd bring it up nonetheless in case there were ideas around solutions or workarounds. The name_prefix option is a clear workaround, and what Martin suggested on the linked Terraform ticket, but that is an option that can present complications in automated tooling around the VM, i.e. automated tasks interacting with this VM would need to be made aware of the new name after the replacement for continue to interact with it successfully. Thanks.

References

https://github.com/hashicorp/terraform/issues/19255

bug

Source

rockholla

All 6 comments

I vote for adding this functionality by this algorithm:

1) Create VM with name=$name.tmp-suffix
2) Destroy VM name=$name
3) Create VM name=$name
4) Destroy VM name=$name.tmp-suffix

As a result, you have a VM with desired name, and there's always at least one running VM presented.

Chupaka on 2 Nov 2018

😄1

@Chupaka thought through a number of possible options like this, too, but none seem to always provide access to a running VM by name. For instance, between steps 2 and 3 only $name.tmp-suffix is accessible, not $name. The inability to rename and uniqueness constraint on name within GCP may simply mean there are no actual solutions here for zero downtime on replacing an instance by exact name.

rockholla on 2 Nov 2018

@rockholla is correct - it is not possible to replace an instance with zero downtime. From our perspective, instances are primitives - they don't support high-level operations like "replace without name change". That functionality is provided by the more complex and advanced resources built on top of the instance, like the load balancer that you can find in examples/ in this repo. :)

From our perspective, this is acceptable, because we provide other methods to get highly reliable constantly-named resources. For instance, it's my opinion that if ~100% uptime is important to you, it's not good to depend on a single instance - GCP is good, but sometimes lightning strikes or there are earthquakes or fiber cuts. :)

I'll be closing this, since it's not feasible.

ndmckinley on 6 Nov 2018

@ndmckinley Sorry for necrobump, but I keep coming back to this page in my research.

Perhaps I'm missing something, but I don't see a way to use google_compute_instance in the way you describe. Under the current google provider, how do I leverage a loadbalancer to do zero-downtime updates?

If name were optional, or we could instead provide a name-prefix, then create-before-destroy could be implemented, and my loadbalancer would never be without a backend. But under current conditions, with a required and unique-constrained name attribute, I don't see how this would be done.

Is there something infeasible about such a change? If not, I'll open a new feature request. For the moment though, I assume that I'm simply ill-educated.

bukzor on 14 Apr 2019

😕1 👍1

That's completely correct - it's not possible to use a single google_compute_instance to accomplish this. You could use a managed instance group instead, which is what I meant by "use the load balancer in examples/" (it uses a managed instance group). My phrasing wasn't perfect and sorry for the confusion, thanks for asking for clarification.

The easiest way to accomplish this is to use an instance group manager and to set update_strategy to ROLLING_UPDATE on it. That's the supported way to accomplish zero downtime updates.

If you _needed_ to do instance-level manipulations (generally not advisable - that gets towards the "pets" side of the cattle vs pets dichotomy), you _can_ - the primitives are there for you to accomplish it. You could have N copies of a google_compute_instance resources with independent counts, each of which depends on another, so you could guarantee that (N-1) of them would be up at any given time, and then put them all behind the same load balancer (so that the down instances don't receive traffic). You'd do this by adding them all to an unmanaged instance group.

There are some challenges. For one thing there'll be a short while where both versions are serving from behind a stateless, non-"sticky" load balancer. For another you'll need to size your instance resources so that traffic can be served by (n-1)/n of them - and that means either a lot of wasted capacity (if you only have 2 instance resources, you need twice as many instances as are needed to serve traffic so you don't go down during update), or else a very complex config (if you have 10 instance resources, you've written a lot of needless boilerplate, though you only need to be over-sized by 11%).

Even though it would work, it's not a good idea! Needlessly complex. I really recommend using managed instance groups - it's what we use at Google when we implement things on GCP that need to stay up. GKE's clusters, for instance, are implemented via managed instance groups - you can see the created groups in the cloud console. Let Google handle the complicated part for you. :)

Did that make sense?

ndmckinley on 15 Apr 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!