Terraform: Error when instance changed that has EBS volume attached

Created on 7 Aug 2015  ·  82Comments  ·  Source: hashicorp/terraform

This is the specific error I get from terraform:

aws_volume_attachment.admin_rundeck: Destroying...
aws_volume_attachment.admin_rundeck: Error: 1 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
Error applying plan:

3 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
* aws_instance.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.
* aws_volume_attachment.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

We are building out some infrastructure in EC2 using terraform (v0.6.0). I'm currently working out our persistent storage setup. The strategy I'm planning is to have the root volume of every instance be ephemeral, and to move all persistent data to a separate EBS volume (one persistent volume per instance). We want this to be as automated as possible of course.

Here is a relevant excerpt from our terraform config:

resource "aws_instance" "admin_rundeck" {
  ami = "${var.aws_ami_rundeck}"
  instance_type = "${var.aws_instance_type}"
  subnet_id = "${aws_subnet.admin_private.id}"
  vpc_security_group_ids = ["${aws_security_group.base.id}", "${aws_security_group.admin_rundeck.id}"]
  key_name = "Administration"

  root_block_device {
    delete_on_termination = false
  }

  tags {
    Name = "admin-rundeck-01"
    Role = "rundeck"
    Application = "rundeck"
    Project = "Administration"
  }
}

resource "aws_ebs_volume" "admin_rundeck" {
  size = 500
  availability_zone = "${var.default_aws_az}"
  snapshot_id = "snap-66fc2258"
  tags = {
    Name = "Rundeck Data Volume"
  }
}

resource "aws_volume_attachment" "admin_rundeck" {
  device_name = "/dev/xvdf"
  instance_id = "${aws_instance.admin_rundeck.id}"
  volume_id = "${aws_ebs_volume.admin_rundeck.id}"

  depends_on = "aws_route53_record.admin_rundeck"

  connection {
    host = "admin-rundeck-01.<domain name>"
    bastion_host = "${aws_instance.admin_jumpbox.public_ip}"
    timeout = "1m"
    key_file = "~/.ssh/admin.pem"
    user = "ubuntu"
  }

  provisioner "remote-exec" {
    script = "mount.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo mkdir -m 2775 /data/rundeck",
      "sudo mkdir /data/rundeck/data /data/rundeck/projects && sudo chown -R rundeck:rundeck /data/rundeck",
      "sudo service rundeckd restart"
    ]
  }
}

And mount.sh:

#!/bin/bash

while [ ! -e /dev/xvdf ]; do sleep 1; done

fstab_string='/dev/xvdf /data ext4 defaults,nofail,nobootwait 0 2'
if grep -q -F -v "$fstab_string" /etc/fstab; then
  echo "$fstab_string" | sudo tee -a /etc/fstab
fi

sudo mkdir -p /data && sudo mount -t ext4 /dev/xvdf /data

As you can see, this:

  • Provisions an instance to run Rundeck (http://rundeck.org/)
  • Provisions an EBS volume based off of a snapshot. The snapshot in this case is just an empty ext4 partition.
  • Attaches the voulme to the instance
  • Mounts the volume inside the instance, and then creates some directories to store data in

This works fine the first time it's run. But any time we:

  • make a change to the instance configuration (i.e. change the value of var.aws_ami_rundeck) or
  • make a change to the provisioner config of the volume attachment resource

Terraform then tries to detach the extant volume from the instance, and this task fails every time. I believe this is because you are meant to unmount the ebs volume from inside the instance before detaching the volume. The problem is, I can't work out how to get terraform to unmount the volume inside the instance _before_ trying to detach the volume.

It's almost like I need a provisioner to run before the resource is created, or a provisioner to run on destroy (obviously https://github.com/hashicorp/terraform/issues/386 comes to mind).

This feels like it would be a common problem for anyone working with persistent EBS volumes using terraform, but my googling hasn't really found anyone even having this problem.

Am I simply doing it wrong? I'm not worried about how I get there specifically, I just would like to be able to provision persistent EBS volumes, and then attach and detach that volume to my instances in an automated fashion.

bug provideaws

Most helpful comment

I know this thread was closed in favor of #2761, but given that that issue is still open, I wanted to leave this here for anyone else still experiencing this particular issue.

I was able to set skip_destroy to true on the volume attachment to solve this issue.
Details here: https://www.terraform.io/docs/providers/aws/r/volume_attachment.html#skip_destroy

Note: in order for it to work, I had to do the following
1) set skip_destroy to true on the volume attachment
2) run terraform apply
3) make the other changes to the instance that caused it to be terminated/recreated (changing the AMI in my case)
4) run terraform apply again

Leaving this here in case anyone else finds it useful.

All 82 comments

Having the same issue here.

I'm also having this issue. I have to detach the volume manually in the AWC Console for Terraform to complete my apply operation.

I too am having this problem. Would it be enough to destroy the instance rather than trying to destroy the volume association?

We're also having the same issue.

One solution is to stop the instance that has mounted the volume before running terraform apply. From the AWS CLI documentation:
"Make sure to unmount any file systems on the device within your operating system before detaching the volume. Failure to do so results in the volume being stuck in a busy state while detaching."

This might be what we are seeing here.

This bug has become quite critical to us. Is anyone looking into this currently?

Same issue here. Any update ? Thanks

One solution would be to stop the associated instance before removing the volume attachment. Perhaps this is to intrusive to do automatically, though.

same issue... and I don't think udev helps here (does udev publish an event when a device is _attempting_ to detach?)

EDIT: tried adding force_detach option... no dice

Same issue here :cry:

I guess terraform should order terminating instances before removing attachments, by default on a full terraform destroy ?

@JesperTerkelsen As long as your application can shutdown gracefully within the 20 seconds given by AWS that makes sense.

Me too!

I also needed to persist ebs volumes between instance re-creates and experienced this problem when trying to use volume_attachments. My workaround solution is to drop the "aws_volume_attachment"s and have each instance use the aws cli at bootup time to self-attach the volume it is paired with. When the instance is re-created terraform first destroys the instance which detaches the volume and makes it available for the next instance coming up.

In the instance user-data include the following template script
elasticsearch_mount_vol.sh

INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`

# wait for ebs volume to be attached
while :
do
    # self-attach ebs volume
    aws --region us-east-1 ec2 attach-volume --volume-id ${volume_id} --instance-id $INSTANCE_ID --device ${device_name}

    if lsblk | grep ${lsblk_name}; then
        echo "attached"
        break
    else
        sleep 5
    fi
done

# create fs if needed
if file -s ${device_name} | grep "${device_name}: data"; then
    echo "creating fs"
    mkfs -t ext4 ${device_name}
fi

# mount it
mkdir ${mount_point}
echo "${device_name}       ${mount_point}   ext4    defaults,nofail  0 2" >> /etc/fstab
echo "mounting"
mount -a
resource "aws_ebs_volume" "elasticsearch_master" {
    count = 3
    availability_zone = "${lookup(var.azs, count.index)}"
    size = 8
    type = "gp2"
    tags {
        Name = "elasticsearch_master_az${count.index}.${var.env_name}"
    }
}

resource "template_file" "elasticsearch_mount_vol_sh" {
    filename = "${path.module}/elasticsearch_mount_vol.sh"
    count = 3
    vars {
        volume_id = "${element(aws_ebs_volume.elasticsearch_master.*.id, count.index)}"
        lsblk_name = "xvdf"
        device_name = "/dev/xvdf"
        mount_point = "/esvolume"
    }
}
resource "aws_instance" "elasticsearch_master" {
    count = 3
    ...
    user_data = <<SCRIPT
#!/bin/bash

# Attach and Mount ES EBS volume
${element(template_file.elasticsearch_mount_vol_sh.*.rendered, count.index)}

SCRIPT
}

Same issue here - would be nice if terraform had support for 'deprovisioners' so that we could execute some steps (such as a shutdown -h now) before machine destruction is attempted. We did find that if we did a terraform taint on the instance before terraform destroy then the destruction is completed successfully, so we'll use that as a workaround for now.

I have a related issue with instance and EBS volume. I think a solution to my problem my fix this as well. With version 0.6.3 when destroying it seems that the volume attachment is always destroyed before the instance.

consul_keys.ami: Refreshing state... (ID: consul)
aws_security_group.elb_sg: Refreshing state... (ID: sg-xxxx)
aws_ebs_volume.jenkins_master_data: Refreshing state... (ID: vol-xxxx)
aws_security_group.jenkins_sg: Refreshing state... (ID: sg-xxxx)
aws_instance.jenkins_master: Refreshing state... (ID: i-xxxx)
aws_elb.jenkins_elb: Refreshing state... (ID: jniesen-jenkins-master-elb)
aws_volume_attachment.jenkins_master_data_mount: Refreshing state... (ID: vai-xxxx)
aws_route53_record.jenkins: Refreshing state... (ID: xxxx)
aws_volume_attachment.jenkins_master_data_mount: Destroying...
aws_route53_record.jenkins: Destroying...
aws_route53_record.jenkins: Destruction complete
aws_elb.jenkins_elb: Destroying...
aws_elb.jenkins_elb: Destruction complete
Error applying plan:

1 error(s) occurred:

* aws_volume_attachment.jenkins_master_data_mount: Error waiting for Volume (vol-xxxx) to detach from Instance: i-xxxx

I thought that I could get around this by having a systemd unit stop the process using the mounted ebs volume and then unmount whenever the instance receives a halt or shutdown. The problem is that doesn't ever happen before the EBS volume destroy is attempted. I think if the order could be forced, and I could have the instance destroyed before the volume, things would go more smoothly.

If you use 'depends_on' in the instance definition to depend on the ebs volume, then the destroy sequence will destroy the instance before trying to destroy the volume.

The error comes when destroying the volume_attachment which would cause the volume to just detach. I mis-spoke in my last paragraph. I can't make the instance depend on the attachment explicitly because the attachment already depends on the instance implicitly because I'm referencing the instances id.

+1 agree with @jniesen

A persistent data disk, separate from OS/instance would be a great feature, if it worked!

Creation of related aws_ebs_volume, aws_instance and aws_volume_attachment resources work fine.

Any apply that involves the re-creation of the aws_instance hangs, as the aws_volume_attachment implicitly depends on the aws_instance references , and is destroyed first - causing the volume unmount to hang.

For this to work in an elegant fashion, the VM would have to be destroyed first, to get a clean unmount.

got the same problem. Workaround with taint+debug is work fine, thanks @jimconner

+1 to a fix. If the attached EBS volume is in use by the OS by say a daemon process (e.g., Docker) then some mechanisms has to be provided by Terraform to allow OS level calls for clean service stop and umount. Some of the ideas listed herein are possible works around, but not tenable long term solutions.

+1 Same problem here. Thanks for the workaround @jimconner

I'm also running into this issue. If both the aws_instance as well as the linked aws_volume_attachment are scheduled to be deleted, the instance needs to be deleted first.

See #4643 for a similar problem, and the feature request in #622 which would provide an easy fix for this.

Hey folks, thanks for the reports and apologies for the trouble.

Restating and summarizing the above thread:

This is an interesting declarative modelling problem: we separated out aws_volume_attachment as its own resource which is a strategy we've consistently taken to declaratively model links between two resources (aws_instance and aws_ebs_volume) in this case.

Terraform's dependency graph currently includes the assumption that destroy order should be strictly reverse to that of create order.

So as you all have noted the create order (along with the equivalent AWS API calls) is:

  1. Create EC2 instance (ec2.RunInstances)
  2. Create EBS volume (ec2.CreateVolume)
  3. Attach EBS volume to instance (ec2.AttachVolume)

And the destroy order is reversed:

  1. Detach EBS volume from instance (ec2.DetachVolume)
  2. Destroy EBS volume (ec2.DeleteVolume)
  3. Destroy EC2 instance (ec2.TerminateInstances)

Generally speaking this is what you'd want, and it works well for other resources, but in this case since the volume is mounted in the instance we end up with problems calling ec2.DetachVolume.

Options for solutions include:

A. Add core support for selectively re-ordering the dependency graph, so the instance is destroyed before the attachment
B. Rework the ec2.AttachInstance stuff back inside the aws_instance resource (per #622)
C. Support provisioners that run on destroy so the attachment can be unmounted as it is destroyed (#386)

So (A) would involve a bunch tricky core work that I'm worried would be too delicate to be worthwhile, (B) loses the benefits we get from having a separate resource, so I think the best move here is going to be (C) - supporting "deprovisioners" in the config.

Does that sound like a reasonable approach?

@phinze Agree C is best so long as deprovision is flexible (e.g., stop service, umount, etc.).

@phinze Disagree C, because volume freeing from inside of instance may be too complicated. I think "deprovisioners" is good idea at itself anyway, but they may be not a appropriate solution in this case.

I think solution A will be the most flexible long term. I can imagine similar cases might crop up in other terraform resources as well, even if adding the feature initially will be quite a lot of work.

A failure of ec2.DetachVolume followed by a failure of ec2.DeleteVolume does not feel like a critical failure, assuming the intent is to run ec2.TerminateInstances -- an action which should succeed regardless of state of the volume attachment.

Would there be any benefit to this arrangement:
• failures in DetachVolume are ignored
• failures in DeleteVolume throw an exception
• the exception is caught in TerminateInstances which causes DeleteVolume to be called a 2nd time iff TerminateInstances succeeds

C. is unworkable in many environments, as not all VMs can be contacted by direct provisioners. Eg. VMs in isolated subnets. Don't assume that Terraform can directly contact every VM it creates.

Personally I'd prefer A or B.

@phinze , that is an interesting observation, based on that there can be another solution which doesn't require deep rework of dependency and start/stop logic. It seems that all is needed is to have ec2.RunInstances step last in the sequence. Only possible change required is for plan executor to be able to detect that op is a no-op and don't try to execute it (like running instance which is already running).

  1. Create EC2 instance (ec2.RunInstances)
  2. Create EBS volume (ec2.CreateVolume)
  3. Attach EBS volume to instance (ec2.AttachVolume)
  4. Run EC2 instance (since it is already running should render in no-op operation)

Then reverse of this sequence will be:

  1. Stop EC2 instance
  2. Detach volume
  3. Destroy volume
  4. Stop EC2 instance (no-op)

@redbaron: That would be quite an elegant solution indeed.

I'm running into this problem too, and I would much prefer not C. B is not great; I really like that ebs volumes are separate resources. My preference would be for A or @redbaron's suggestion.

Have you a plan to fix this issue? For us it is blocking.

Have the same issue, at the moment manually terminate/stop the instance before applying any changes. If we have a strategy sorted out would be happy to send a PR.

This is a big problem to me too. What's the go?

Same problem here, we're mitigating the issue with the taint trick for the time being...

Running into this one too.

:+1: to the approach suggested by @tamsky. If Terraform's plan is to destroy all these things, than DetachVolume/DeleteVolume failures are not fatal results.

@tamsky's approach would work for our use case too where the persistent EBS volumes survive when the instance is recycled. We resort to stopping or terminating the instance manually before applying the Terraform plan.

I'm still bumping up against this. Use of an EBS volume to hold state whilst instances themselves can be recycled is a really common pattern for us.

It is interesting that I can make it go away by simply tainting the instance.

Interesting, I've upgraded to Terraform v0.6.15 and the tainting the instance work around no longer works. Even when the instance is marked as tainted terraform still tries to remove the aws_volume_attachment before destroying the instance.

My thoughts on solutions... I believe it would not be ok to ignore a failure to detach a volume, if a resource failed to be deleted I would like it to error and let me know asap. A provisioner that allows executing code on the resource before destroying sounds neat, but I wouldn't want to set that up for every aws_instance resource that has ebs volumes attached to it. I like @phinze option A, maybe it doesn't need to involve reworking the dependency graph but it would be nice to at least say if I'm destroying resource A destroy resources [b, c, d] first.

Another option that I think hasn't been suggested is adding a core retry attribute to resources, so that if vol_attachment resource A had this attribute set to true it would try to detach and timeout out because the aws_instance was not deleted first and this timeout is ignored and vol_attachment A is added to a cleanup queue where after all other resources have been successfully destroyed terraform then goes through the cleanup queue and tries destroying those resources again.

So, vol A (times out, gets added to queue) -> destroy instance B (which A was attached to) -> read the queue, find A and try destroying it again (which should succeed if B was destroyed).

I agree with the view that, as voiced by @tamsky, a good way to deal with this issue is to change how aws_volume_attachment behaves during the destroy phase, _as long as doing so is the user's explicit intent_.

However, rather than configuring aws_volume_attachment to silently ignore failures when invoking the AWS DetachVolume API call during the destroy phase, I would instead suggest adding an optional attribute that prevents this API call from being invoked at all.

Given that the termination of AWS instances implies the detachment of any associated EBS volumes (subtly hinted, I think, by @tobyclemson), and that attempting to explicitly detach volumes is the very thing preventing users affected by this issue from achieving either, it seems to me that an option to disable explicit detachment of volumes in aws_volume_attachment is a succint and non-invasive way to unblock the termination (and re-creation) of instances.

Implementation-wise, this could be accomplished by adding a boolean attribute named explicit_detach (defaults to true) to aws_volume_attachment such that, when set to false, the destroy phase merely removes the resource from Terraform state without attempting an AWS DetachVolume API call. (It would also make sense to change the create phase to not attempt to attach V to I when V is already attached to I.)

This approach would resolve this issue without breaking workflows for users who rely on the current behavior, without the aws_volume_attachment resource needing to be aware of the context in which it is being destroyed, without silently ignoring any API call failures, without requiring any change to core, and without having to introduce de-provisioners.

As another data point, on 0.7.0, tainting a resource no longer works.

Has anyone found a workaround for .7 ?

@maxenglander if explicit_detach was set to true and the volume was not being deleted but instead reattached to another instance what would happen? If the volume would be successfully attached to the new instance I'm wondering why we need attachment delete at all. To keep track of a partially detached state in the state file?

I pretty strongly feel that terraform is the wrong way to solve this problem.

I wouldn't use terraform for cfg management -- it's not built for that, you're just gonna get in to trouble. It's also not built for keeping track of the state in stateful services and editing or revising existing resources -- it's not built for that either. I would not want to run terraform to detach and reattach EBS volumes for a whole pile of reasons. I would create the EC2 instances with TF -- probably in an ASG so I can control rolling them -- and probably create the EBS volumes with TF too, so they get tagged and tracked correctly.

But anything that happens past that -- formatting, installing software, etc should really be done by something else. I guess I generally think that reusing volumes in this way is an antipattern _period_ -- I prefer to start with fresh resources and have a way of syncing / initializing data -- but if you _are_ reusing them in this way, I feel like it should be done by an offline process. Probably a script that doesn't save any kind of state but just takes arguments (whether that's instance ids, roles, whatever) and does all the detach, retries, reattachment actions. Something with this much ordering belongs in a script, not a tfstate.

There are plenty of things that you _can_ do with TF that will just be really painful and timeouty and errory and will lead you to hate your life, I think this is one of them. :)

I have to partially agree and partially disagree.

I don't think that terraform should format, install, mount, etc volumes. I do need to have some persistant storage (and an EBS volume is perfect) for a number of instances. We're talking non-root attached EBS volumes. Terraform has the resource of aws_volume_attachment to attach the volume, and it would be nice if after a volume is attached to be able to delete the instance that the volume is attached to and be able to reuse the volume.

If I can't, I'd have to have a completely separate system (maybe cloudformation? homegrown shell scripts?) for any machines which attach to an EBS volume and that system would have to store the state of the machines, and state and information of the ebs volumes... just like terraform.

I think if the resource aws_volume_attachment exists, it needs to be fixed (you have to be able to delete the instance). If this isn't going to be fixed, I think we should remove aws_volume_attachment, so that it's obvious that you can't attach a volume and then delete an instance.

I agree with @LeslieCarr.

While I personally think that long-lived EBS volumes that move between instances are a painful, compromise solution that rarely address the root issues of data-persistence, to the extent that there are many scenarios where people choose to do this, I absolutely expect my provisioner to be capable of managing those.

@Jonnymcc In my clone, I ended up calling this skip_detach (the inverse of explicit_detach).

if explicit_detach was set to true and the volume was not being deleted but instead reattached to another instance what would happen?

When skip_detach is not set, or is set to false (the equivalent of explicit_detach being true), what happens is no different than the current behavior: AWS fails to detach the volume due to the mount, and the Terraform execution fails. When set to true, this problem goes away.

If the volume would be successfully attached to the new instance I'm wondering why we need attachment delete at all. To keep track of a partially detached state in the state file?

I honestly don't know why the current behavior is useful for anyone, but I didn't feel comfortable saying that it isn't or couldn't be for some users. So, I thought that the best solution was to add a field (skip_detach) that would not, by default, have any impact on these speculative users, and (when set to true) a lot of beneficial impact for users like me who are nettled by the current behavior.

In my imagination, there's some Terraformer out there regularly detaches and re-attaches volumes between running instances as part of some unusual warehousing process, and would, rightly, not want the default value of skip_detach to change the current destroy behavior of aws_volume_attachment.

A provisioner is not a manager. It's actually not reasonable to expect your provisioner to handle a bunch of ordered multiphase state changes between multiple components. Esp when persistent state and mounted volumes are involved.

And I super specifically sketched out a solution that did _not_ involve saving another state elsewhere, because I agree, that would be dumb. I would probably use tags, defined by tf, and a rolling script to resolve/pair those tags by detaching / reattaching.

I can see where you're coming from conceptually: if tf has an aws_volume_attachment, you kinda expect it to have an explicit aws_volume_detachment command. But detaching is harder and has a different set of dependencies that TF can't perform to help prevent your data from corruption and your nodes from hanging. TF can't check to make sure no processes are holding files open or writing to the volume, safely unmount it, etc.

I was asked to comment on this but I think I'm giving advice that's more abstract/best practicesy than is appropriate for a TF github ticket, so I can back out of the convo. :)

Just saying, if i was a TF maintainer, i would be cringing and deprioritizing this just thinking of all the tickets that are gonna be opened in panic and anger if i wrote this. It's just a bad way to manage state, shit's gonna get stuck and corrupted a lot without host-level visibility and safeguards and retries and exception handling, which tf can't and shouldn't try to do.

I mostly agree. We are using Chef. I'd really love to use Chef to manage AWS SG (for example) because TF is awful at that, but no progress at this time. EBS, though, I would like to create during provision. I'm probably going to rewrite my code to have Chef do that instead of TF.

I'd also often like those EBS to stay around during instance destruction. There are several reasons for this, one being that TF does enjoy resource destruction quite a lot. Of course, that comes about when I need to make a infrastructure change, so if TF is meant only for initial deployments then what am I expected to use when I need to modify my infrastructure? Am I really expected to redeploy an entire set of large scientific computing instances? I would not like that but I would dislike that activity less if I am able to reuse the multi-TB EBS that I store data on.

Immutable infrastructure with separate long-lived data disks is a great design pattern, and should definitely be encouraged. Kudos to Google Borg / Pivotal BOSH for making it more widespread.

My current workaround for this issue is to use Terraform to provision all the objects:

Instance, EBS volume ( set to not allow destruction) , IAM profile (to allow attachment of the two)...

... and then use a cloud-init bootstrap to have the instance securely associate it's ( restricted through IAM policy) EBS volume. Because terraform has the overview of which instance is notionally tied to which EBS volume, it can set all the right metadata to make that relationship visible to cloud-init and userspace.
Shutdown is a no-op, and instance destruction clears the Instance <> EBS relationship anyway.

It's really sad that the EBS association still has to be handled separately. The current aws_volume_attachment is pretty much useless for any infrastructure that needs to change regularly.

The ideal solution for me would be that aws_volume_attachment only activates when the instance is in the shutdown state (for both attach/detach), although I realise that this is non-trivial given the symmetric nature of terraform's create/destroy process.

Perhaps the concept of aws_volume_attachment as a separate resource is the wrong way to think about it? Perhaps a tweak to aws_instance resource native EBS handling would be a better way to achieve this concept.

Adding our bit: we're planning to use volumes as a way to send data in through a "data transfer vm" instance and then move them to a computing cluster, thus detaching the volume from the first instance and re-attaching it to another VM. That's probably not the best "cloud-friendly" approach, but that's the easiest way to get the ball rolling. Up to now we were using the taint trick to solve this, but after 0.7 landed that doesn't work anymore (as @LeslieCarr said). We're now pretty much blocked, as we haven't found a workaround yet, apart from logging into the machine and unmounting the volume.

@maxenglander, any ETA on when your changes might land on a release? This is now critical to us!

I do agree with the group that says OS-layer actions are outside the scope of TF; this includes quiescing IO and umounting fs. I've managed to avoid (so far) the need to move EBS around but it's a tactic that I can see value in. Along that line, I've gotten a Chef cookbook at 90% which will do all the EBS deployment work (including LVM and fs) and I expect I'll use that as a base if I need to move things around. I'm only using TF to spec the ephemerals at this point.

Now what I really need is a cookbook that manages security groups, but that's a separate TF issue.

Yes, I'm also fine with TF not messing with OS actions, but here we're talking about a bug/feature missing, I believe. It would just need to follow a simply different behaviour while calling the AWS API, not doing something OS-level.

I fully concur with those (@charity, @Gary-Armstrong) who have voiced that TF is not well-equipped to perform volume detachments because detachments have extra dependencies (such as disk mounts, running processes) which TF doesn't know about. I agree that it's generally inadvisable for TF to perform OS-level actions like quiescing IO and unmounting fs.

However, I don't think it is bad (even if it's not ideal) to allow TF to manage volume attachments explicitly (via aws_volume_attachment), and to implicitly detach volumes by destroying the instances they are attached to. I think that this approach is compatible with the view that TF shouldn't perform volume detachments: by relying on instance destruction to detach volumes, TF effectively delegates volume detachment to AWS.

I also think that the TF model of failure is perfectly well equipped to handle problems that may crop up while using this approach. For example: if, during an instance destroy phase, a disk fails to unmount, then the instance may remain running, and the volume remains attached to it. TF sees that the instance failed to be destroyed, and stops execution so that the succeeding instance is never created, and volume re-attachment is not attempted. TF reports the failure to the user, who must retry by running terraform plan and terraform apply. There isn't any forced detachment, there's no disk corruption, and no un-synced state between TF and AWS.

While this approach may be less elegant and less robust than using a CM tool like Chef to handle everything EBS-related, it is, I believe, a simple, clean, and predictable solution. For users like myself who simply aren't ready to introduce a CM tool into their operations, it is also a practical solution.

@dvianello I have no idea if/when HashiCorp would incorporate my changes, unfortunately. I haven't created a PR yet, since it's not clear what HashiCorp's stance on this issue is.

I've been using my patch for a while now, which you're free to try out (at your own risk, of course) if you need a stop-gap while we wait for an official solution. I've created a release with binaries, in case that's helpful. To use it you must first add "skip_detach": true, run plan and apply on any aws_volume_attachment for which you want to enable the new behavior _before_ trying to destroy and re-create instances.

Agree @maxenglander that TF could manage attachments as you say. Entire post is agreeable, in fact. I don't want to get off on a TF wish list, but it seems entirely reasonable to expect TF to detach and potentially preserve EBS when an instance is terminated.

I like 5c09bcc from @c4milo. I've tested it in our environment for some days now. It's the best solution for this issue. I suggest to cherry-pick that one.

how is this still unsolved after a year?

Although the volume attachment resources above might not work, we have the whole thing working a slightly different way (although its using aws) - we define an aws_instance , and an aws_ebs_volume, and no attachement information, however we tag the aws_instance with the aws_ebs_volume resource.

Then on the instance bootup, we read the tag and attach and mount the disk. On the instance shutdown the reverse (although you dont need to)

It all works fine. - change the details of the instance and everything detaches and reattaches as intended in the immutable infra way.

Sure, it would be nice to have it in terraform, but you dont need it to get the basics working.

We have also tested @c4milo's commit https://github.com/hashicorp/terraform/commit/5c09bcc1debafd895423e1e2df0c5da4930468bc on our setup and have had great results in resolving our problem. We're going keep using this patch until this hopefully gets merged.

@c4milo thank you for adding this!

I'm also hitting this issue. @c4milo: have you sent a PR with https://github.com/hashicorp/terraform/commit/5c09bcc1debafd895423e1e2df0c5da4930468bc?

I did send a https://github.com/hashicorp/terraform/pull/5364 but closed it since it isn't the ideal solution to this problem as discussed in that thread.

This is pretty much the same as #2761, I'm sure there are other places this is being tracked too... going to close this one. (The reference here will link them, too)

@mitchellh , arguably this issue has bigger "community" and should be considered main point of contact to track all dependency problems which can't be expressed using simplistic graph model TF is currently using.

2761 is valid issue too,but it has got 5 comments and 9 subscribers, strange choice to keep that one and close this.

I know this thread was closed in favor of #2761, but given that that issue is still open, I wanted to leave this here for anyone else still experiencing this particular issue.

I was able to set skip_destroy to true on the volume attachment to solve this issue.
Details here: https://www.terraform.io/docs/providers/aws/r/volume_attachment.html#skip_destroy

Note: in order for it to work, I had to do the following
1) set skip_destroy to true on the volume attachment
2) run terraform apply
3) make the other changes to the instance that caused it to be terminated/recreated (changing the AMI in my case)
4) run terraform apply again

Leaving this here in case anyone else finds it useful.

I can't get the above workaround to do the trick using 0.10.6. Looks like whatever bug was being exploited to make this work got closed.

I'm still only provisioning ephemerals in TF.

In fact, I am specifying four of them for every instance, every time. I then have some ruby/chef that will determine how many are really there (0-4) and do the needful to partition, lvm stripe, then mount as a single ext4.

I still use Chef to config all EBS from creation to fs mount. Works great. EBS persist unless defined otherwise. Mentally assigning all volume management to the OS arena has gotten me where I want to be.

This is still an issue 26 months after the issue was first created.

@exolab, It is not. You need to use destroy-time provisioners in order to unmount the EBS volume.

Sorry if I am a bit daft. How so?

Is this what you are suggesting?

provisioner "remote-exec" {
    inline = ["umount -A"]

    when   = "destroy"
  }

Also with @mpalmer not working fix with skip_destroy using terraform 10.6 😞

Fix with skip_destroy does not work using terraform 11.1 😢

+1

Still an issue (and a big issue for us) in v0.11.3

Still an issue in v0.11.4

terraform v0.11.7 -- have same issue with volumeattachment when running destroy;
skip_destroy = true in volume attachment resource is not helping either - destroy keeps trying.
went ahead force detached from console - then tried destroy moved forward at that time.
Is there default timeout for TF - script kept running destroy until I ctrl C out of it -- trying to detach ebs ovl.

On Terraform v0.11.7 I was able to get around this by creating the volume attachment with

force_detach = true

if you created it without the force detach to be true it will still fail. I had to terminate the instance, allow the edit or recreation of the volume attachment to have force detach, and then all subsequent detaches work for me.

Using force_detect = true worked for me as well (v0.11.7).

Originally created the volume without force_detect so I had go manually force detach in the AWS console, then delete the volume (in Terraform) and re-create (also in Terraform) before it worked.

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

I know this issue is closed, but just as a example workaround for this for people finding this, I'll post what I've done. I have a volume I want to persist between machine rebuilds (gets rebuilt from a snapshot if deleted but otherwise persisted). What I did was grab the older instance id in TF, then a local-exec (can't use remote-exec with how direct access to the machine is gated) to use the aws cli to to shutdown the machine the volume is being detached from first before destroy and rebuild of the machine and the volume attachment:

//data source to get previous instance id for TF workaround below
data "aws_instance" "example_previous_instance" {
  filter {
    name = "tag:Name"
    values = ["${var.example_instance_values}"]
  }
}

//volume attachment
resource "aws_volume_attachment" "example_volume_attachment" {
  device_name = "/dev/xvdf"
  volume_id   = "${aws_ebs_volume.example_volume.id}"
  instance_id = "${aws_instance.example_instance.id}"
  //below is a workaround for TF not detaching volumes correctly on rebuilds.
  //additionally the 10 second wait is too short for detachment and force_detach is ineffective currently
  //so we're using a workaround: using the AWS CLI to gracefully shutdown the previous instance before detachment and instance destruction
  provisioner "local-exec" {
    when   = "destroy"
    command = "ENV=${var.env} aws ec2 stop-instances --instance-ids ${data.aws_instance.example_previous_instance.id}"
  }
}

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Was this page helpful?
0 / 5 - 0 ratings