Terraform-provider-aws: Feature Request: Stop, don't destroy instance on user-data update

Created on 13 Jun 2017  ·  58Comments  ·  Source: hashicorp/terraform-provider-aws

_This issue was originally opened by @jwaldrip as hashicorp/terraform#1887. It was migrated here as part of the provider split. The original body of the issue is below._


when updating user-data, its not required to destroy the instance. In fact, this can be devastating to a etcd cluster. What would be acceptable is for the machines to be stopped, userdata updated, and then started.

enhancement servicec2 upstream-terraform

Most helpful comment

@phinze, @bflad, or anyone else at Hashicorp, could you please ping someone internally to look into this. It's been open for almost 2 years now...

All 58 comments

_This comment was originally opened by @JeanMertz as https://github.com/hashicorp/terraform/issues/1887#issuecomment-100605772. It was migrated here as part of the provider split. The original comment is below._


This is interesting. I don't think Terraform currently has a stop/start cycle, only destroy or don't destroy.

I can see a lot of value for this use-case (since we're also using user-data heavily), but I can also see this becoming complicated.

If I am not mistaken, on AWS an instance without an EBS backed storage device will simply reset the storage, so in that case the stop/start would have the same result as a destroy.

Here's the list of properties that AWS allows you to change in a stopped state:

You can modify the following attributes of an instance only when it is stopped:

  • Instance type
  • User data
  • Kernel
  • RAM disk

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-101095776. It was migrated here as part of the provider split. The original comment is below._


So this particular issue is not a problem with GCE as you can update metadata without forceNew. However there is a similar problem if you want to change zone / scopes without losing the data on your disk.

The way I go about doing this is to use a separate disk resource in any situation where the content of the disk is precious in some way. E.g. obviously for databases this is true, but typically in a node of a load balanced cluster the disk is expendable. Having a separate disk allows Terraform to recreate the instance without losing the disk (as the disk is not recreated). I believe this is equivalent to a reboot cycle.

_This comment was originally opened by @Pryz as https://github.com/hashicorp/terraform/issues/1887#issuecomment-138493574. It was migrated here as part of the provider split. The original comment is below._


What about doing nothing when we update user_data ?

If you use Terraform and a configuration management solution, most of the time userdata are only for instance bootstrapping. So when you change userdata it's probably only for new instances not current ones.

_This comment was originally opened by @jwaldrip as https://github.com/hashicorp/terraform/issues/1887#issuecomment-138559228. It was migrated here as part of the provider split. The original comment is below._


Disagree. I use CoreOS in production and manage everything with user data. Even existing instances.

_This comment was originally opened by @Pryz as https://github.com/hashicorp/terraform/issues/1887#issuecomment-138591063. It was migrated here as part of the provider split. The original comment is below._


Ok, so what about having an argument to specify the strategy ? (Reboot, Recreate, None)

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-138654976. It was migrated here as part of the provider split. The original comment is below._


For the sake of comparison: For google compute instance metadata, we started off with updateable metadata (i.e. don't recreate the instance, but update the metadata so the instance can see the new values). This is actually very useful because you can do a hanging get on the metadata server which means you can run an agent in your VM that gets notified of changes to the metadata. This allows you to control the VM (e.g. roll out new application software) by updating the metadata. I don't know if AWS let's you do that but I assume it does as it's really useful.

However if you don't have such an agent, and the only thing you have to work with (if you want to do truly declarative node config) is the startup-script (userdata on AWS), then you want changing the startup script to recreate the instance so it will re-run the startup script. For lots of people, especially stateless servers, that is plenty good enough. The only cost is the time taken to recreate the instance (the agent is instantaneous).

So, I added a metadata_startup_script field in the google instance that is the same as metadata.startup-script except it is ForceNew. One is not allowed to specify both. The user can therefore choose whether to have the ForceNew behavior or not.

You could do the same thing for aws_instance: Have an updateable_user_data field that was not ForceNew to allow people to make changes to it without recreating the instance and run agents inside the VM to pick up these changes. And keep the current user_data field for the simple non-agent case.

_This comment was originally opened by @Pryz as https://github.com/hashicorp/terraform/issues/1887#issuecomment-138859642. It was migrated here as part of the provider split. The original comment is below._


I like the idea of having two different user_data fields since it's for two totally different ways to manage instances.

_This comment was originally opened by @FergusNelson as https://github.com/hashicorp/terraform/issues/1887#issuecomment-149802063. It was migrated here as part of the provider split. The original comment is below._


I think it would be nice to have stop and start command line options in general. The use case would be a staging or testing environment that is not needed all the time; rather than destroy and recreate the environment every time it is needed it would be nice to be able to do

terraform stop
terraform start

And have terraform start the machines in the correct order with respect to the dependency graph.

_This comment was originally opened by @c4milo as https://github.com/hashicorp/terraform/issues/1887#issuecomment-150636511. It was migrated here as part of the provider split. The original comment is below._


As far as I understand the sole purpose of cloud-init or user-data scripts is to do early initialization of instances. From that perspective, it may not make sense to use it as a way to re-provision or re-configure instances since that's what tools like Puppet, Chef, Ansiable and Salt are for. Terraform was thought out as a way of creating and destroying infrastructure resources, and resource immutability is all over the place. Perhaps @mitchellh can shed some more light here.

_This comment was originally opened by @jwaldrip as https://github.com/hashicorp/terraform/issues/1887#issuecomment-150648445. It was migrated here as part of the provider split. The original comment is below._


Sure, this is the case in some instances. But in our case, we use core-os
and use user-data to bootstrap the machines. We NEVER use puppet, chef,
salt etc, to instantiate the instance any further. Maybe this is bad
practice, but I know that user-data is subject to change. In addition we
cannot destroy our etcd instances to provision fresh user-data as we will
have a risk of losing quorom.

On Fri, Oct 23, 2015 at 11:13 AM, Camilo Aguilar [email protected]
wrote:

As far as I understand the sole purpose of cloud-init or user-data scripts
is to do early initialization of instances. From that perspective, it may
not make sense to use it as a way to re-provision or re-configure instances
since that's what tools like Puppet, Chef, Ansiable and Salt are for.
Terraform was thought out as a way of creating and destroying
infrastructure resources and resource immutability is all over the place.
Perhaps @mitchellh https://github.com/mitchellh can shed some more
light here.


Reply to this email directly or view it on GitHub
https://github.com/hashicorp/terraform/issues/1887#issuecomment-150636511
.

Jason Waldrip
m: 646-460-5959
e: [email protected]
http://www.facebook.com/jason.waldrip
http://www.linkedin.com/in/jasonwaldrip

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-150649603. It was migrated here as part of the provider split. The original comment is below._


It's not bad practice, it works very well and is increasingly common.

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-150662423. It was migrated here as part of the provider split. The original comment is below._


Did you try using a separate disk resource? If you also use a statically assigned ip then I think that should be equivalent to a reboot?

_This comment was originally opened by @JamiKarvanen as https://github.com/hashicorp/terraform/issues/1887#issuecomment-152194919. It was migrated here as part of the provider split. The original comment is below._


I have the same use case as @jwaldrip. When using AWS's CloudFormation, it updates the user_data but doesn't re-create or reboot the instances. Amazon provides a cnf-hup script that detects user_data changes in instances and updates them accordingly, so it would be great to be able to only update the user_data with terraform and let AWS handle the updating.

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-152232461. It was migrated here as part of the provider split. The original comment is below._


Upgrading to "It's not bad practice, it works very well, is increasingly common, and is actively encouraged by AWS and Google" :)

_This comment was originally opened by @manojlds as https://github.com/hashicorp/terraform/issues/1887#issuecomment-165428494. It was migrated here as part of the provider split. The original comment is below._


No movement on this?

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-168040430. It was migrated here as part of the provider split. The original comment is below._


I think it should be a fairly easy PR to add a new resource attribute for controlling userdata in a non force-new fashion. If someone were suitably motivated :)

_This comment was originally opened by @yogin as https://github.com/hashicorp/terraform/issues/1887#issuecomment-191015474. It was migrated here as part of the provider split. The original comment is below._


I'd be very interested in a way to prevent instances from being recreated when there's a change to userdata. I understand some people use it for different purposes, and I like the idea of specifying a strategy (recreate, none, ...).

In our case, we only use the userdata to bootstrap new instances, later on, chef will take over and do the rest, but occasionally the userdata template might get a small update, and if I could avoid having to recreate instances, that'd be amazing!

_This comment was originally opened by @mcortinas as https://github.com/hashicorp/terraform/issues/1887#issuecomment-198468223. It was migrated here as part of the provider split. The original comment is below._


No movement on this?

_This comment was originally opened by @modax as https://github.com/hashicorp/terraform/issues/1887#issuecomment-219944854. It was migrated here as part of the provider split. The original comment is below._


I'm no terraform developer but I had a quick look at the current terraform code and it seems that the only possible solution without (big) terraform core changes could be adding another attribute like live_user_data (since ForceNew value cannot depend on the context as far as can tell). The code might still end up being messy due to two attributes modifying the same thing (with regard to diff calculation). Hence I'm not sure if terraform team would accept such a PR.

_This comment was originally opened by @sparkprime as https://github.com/hashicorp/terraform/issues/1887#issuecomment-220068197. It was migrated here as part of the provider split. The original comment is below._


@modax As I said further up the thread, this is exactly what the google_compute_instance does and it works well

_This comment was originally opened by @markmaas as https://github.com/hashicorp/terraform/issues/1887#issuecomment-224507714. It was migrated here as part of the provider split. The original comment is below._


Destroying and recreating instances when the user-data changes makes it so I can either:

  • Not use user-data (And therefore cloud-config)
  • Not use Terraform (Meaning I have to learn cloud-formation: ugh)

Since cloud-config is beeing used more and more to do the initial bootstrapping of an instance, I would think this is no longer an "enhancement" but more along the lines of "blocking"

_This comment was originally opened by @cheungpat as https://github.com/hashicorp/terraform/issues/1887#issuecomment-224561306. It was migrated here as part of the provider split. The original comment is below._


If you don’t want your instance to re-create when user-data changes, you can try including this key in ignore_changes.

_This comment was originally opened by @br0ch0n as https://github.com/hashicorp/terraform/issues/1887#issuecomment-233416285. It was migrated here as part of the provider split. The original comment is below._


I agree with @cheungpat that ignore_changes probably handles this ticket (once #5627 is done)

_This comment was originally opened by @gdubicki as https://github.com/hashicorp/terraform/issues/1887#issuecomment-263675038. It was migrated here as part of the provider split. The original comment is below._


I confirm that what @cheungpat suggests is a good workaround. If you use GCE provider, then to ignore startup script changes use this code:

lifecycle {
  ignore_changes = ["metadata_startup_script"]
}

But it would be even better if you could set it once, globally - not in each resource of google_compute_instance type. Is that possible?

UPDATE: AFAIK it's not possible and you can't even use a variable to set ignore_changes. :( See #10730.

_This comment was originally opened by @kirkmadera as https://github.com/hashicorp/terraform/issues/1887#issuecomment-264654270. It was migrated here as part of the provider split. The original comment is below._


This also applies to resizing AWS instances. We end up going into the AWS admin, stopping the instance, changing the instance type, then starting it again. Then we update terraform configs to match the new instance type and run terraform apply which syncs the terraform state to the new size.

Ideally we would just resize from Terraform. Resizing an AWS server also changes it's public and private ips. Running the resize from terraform would allow us to immediately change DNS as well if needed, rather than waiting for us to run terraform after the resize completes.

_This comment was originally opened by @stack72 as https://github.com/hashicorp/terraform/issues/1887#issuecomment-282127593. It was migrated here as part of the provider split. The original comment is below._


Dearest Friends, now that I have tackled the issue of changing instance_type and it will be part of terraform 0.8.8, I am going to start prototyping on this

Paul

_This comment was originally opened by @cnoffsin as https://github.com/hashicorp/terraform/issues/1887#issuecomment-294035120. It was migrated here as part of the provider split. The original comment is below._


The use case for me here is I have to make type changes in AWS to a cloudera cluster.

The engineers shut it down gracefully so I could make the change.

I want to make the type change, but LEAVE them off. So that they can bring them up in the proper order.

Any news on this feature request? I am in the same position and this feature (stop/start when changing user-data in AWS) will alleviate a lot our hot deploy patterns.

If I understand correctly then the TL;DR so far is this:

Using ignore_changes (while good to be aware of) doesn't satisfy the use-case of having Terraform actually manage the user_data of an existing resource without recreating it. In other words, the desired workflow is to "stop, update, start" instead of "destroy, update, create".

With ignore_changes = ["user_data"], if the user_data content changes then TF just ignores it -- which would be fine for hosts that only use user_data for provisioning on first-boot. But if user_data is your mechanism for updating the config of a running host (and restarting instead of re-creating) then the changes won't get pushed.

Am I right that the main blocker is Terraform doesn't (AFAIK) currently support managing the up/down state of an EC2 instance (to achieve "stop, update, start")?

The end goal of this for me is to be able to change the ec2 instance type and have Terraform modify it instead of destroying and rebuilding it. I believe the latest version does this already actually.

I need it.

This would be nice

I'd like to toss in my vote. Please stop marking these resources as changed. If you really have to, you could resort to an attribute to toggle this behavior, but I argue that a user can deliberately taint a resource if they need it changed. As always, this is about working around humans, even though I love the rigidity, I have found it break down when work gets distributed and a few months have passed. You learn or change things, and want to adjust code accordingly, probably something 100% verified using a different node, but schedule the change at a later date. Currently our strategy has been to leave a huge comment block, but it feels messy.

FWIW, I needed something similar and ended up with this script to stop the instance and do what we needed outside of terraform. Might help someone off google ending up here for aws.

terraform apply -auto-approve \
    -var "instance_name=${INSTANCENAME}" \
    -var "environment=${DEPLOY_ENV}" \
    -var "security_group=sg-0a9cf274" \
    -var "ami_name=${AMINAME}" \
    -var "vpc_id=${VPCID}" \
    -var "keypair_name=bastion.${DEPLOY_ENV}"

....................

INSTANCEID=`aws ec2 describe-instances --filters "Name=tag:Name,Values=${INSTANCENAME}" \
            --query "Reservations[0].Instances[0].InstanceId" --output text | tail -1`

if [[ "${INSTANCEID}" == "" ]]; then
    echo "There was a problem getting instance id from AWS. Exiting."
    exit 1
fi

aws ec2 stop-instances --instance-ids ${INSTANCEID}

Not sure how my comment adds to the discussion, but I stumbled upon this issue after being hit with user_data change. Seems it somehow changed "from the outside" (?), AWS did something overnight maybe? No clue.

I don't use nor rely on user_data I do not have it declared under my aws_inctance resources etc. I had everything working fine (terraform plan and remote state). I haven't done any updates to terraform nor provider. After running terraform plan Today most of my EC2s are marked to be recreated due to user_data change. Even few EC2 which are stopped 🤷‍♂️

I ended up doing ignore_changes, but remain clueless what happened.

@rmldsky that sounds terrifying. please file a separate issue/bug, and reference this one.

Please note that there are currently 46 votes for this issue on https://github.com/hashicorp/terraform/issues/1887

Would be useful, do we have any up to date informations on this one?

How is this still not a thing? This is one of the very few places were CloudFormation is vastly superior to Terraform. A user data change does NOT require destroying an instance and Terraform should be updated to error on the side of non-destruction.

I have the same issue here - use case:
We have a number of systems that are stopped/started on an automated schedule (based on tag) and we use the <persist>true</persist> to allow userdata to run on each of those boots to perform various tasks. If we need to add, remove or adjust those tasks then a userdata update is needed. Currently if we do this via Terraform (without manual interaction) the instance is marked for destruction as noted above.

The manual work around is to update the userdata yourself outside of terraform and then on the next apply/plan the state will match, requiring no destruction. This is tedious and time consuming across many instances, especially if you make use of the template provider to fill in values in the userdata (we do..) since you also must script those values to be filled or manually adjust before applying to each instance.

I would think this should work similar to the change instance type as mentioned above - stop instance (if running), change userdata, start instance (if it was running prior to change).

Any updates on this? Is there anything we could do to move forward with this issue?

Why you can't just use option lifecycle { ignore_changes = [ what-to-ignore ] }?
https://www.terraform.io/docs/configuration/resources.html#ignore_changes

@Andor because we want to apply the user data updates. The update does not need to destroy the instance.

@et304383 are you sure it will works? Because of userdata (as far as I know) used only for instance initialisation and nothing more.

Bottom line is updating user data is supported without destorying the instance (it simply has to be stopped during the update). Terraform should support this like CloudFormation.

I agree with @et304383. Userdata is evaluated by cloud-init on _every_ boot and it is possible to update it when the host is stopped. It's rare that it's useful beyond the _first_ boot but there are some use-cases (eg vyos runtime config).

@boweeb that's not true. It only runs on first launch. It will run if you create a new instance via an AMI for Linux, and only for Windows if you schedule it to run prior to creating the AMI:

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-windows-user-data.html#user-data-execution

@et304383 thanks for bringing that up, this should be clarified. You are correct: Windows does not run cloud-init on every boot. This is an implementation-specific issue.

On at least most Linux AMI's (RedHat/CentOS and Amazon Linux confirmed), cloud-init is configured to run on every boot and evaluate what steps it needs to run. For example, runcmd runs on first boot only and bootcmd runs (somewhat early in the boot process) on _every_ boot.

https://cloudinit.readthedocs.io/en/latest/topics/examples.html#run-commands-on-first-boot

Yup. The focus here is that user data only runs once. Updating it on an existing instance does not trigger its execution.

Summary/Notes from above -

  • AWS API supports updating userdata (when instance is in a stopped state)
  • Other AWS instance updates support stopping, updating and starting instances, example: instance type
  • Instance userdata can be dynamic using data from state/tfvars and TF template functionality. This makes manually updating via API/GUI much more difficult as number of instances grow.
  • Both Windows and Linux are capable of being configured to run userdata on every launch or when scheduled. There are complexities and work required but it is possible for both.
  • There are various deployment styles and some include using userdata for more then just first launch initialization (multi-boot launches, update on launch (OS, AV), notify/action on launch). While this is situational it is still a supported and documented feature of AWS.

Given the above does it not make sense to include this functionality in the provider?

Since there are multiple use cases this could be implemented with an additional argument, such as:
'user_data_update_action'
Example options: reboot, recreate, ignore

To maintain current functionality this could default to 'recreate' and then allow those that need it to adjust from there.

@phinze, @bflad, or anyone else at Hashicorp, could you please ping someone internally to look into this. It's been open for almost 2 years now...

This would be very useful for me as well.

++^
This would be very helpful. I currently have this issue and I need a work around so it doesn't destroy my needed directory servers.
I'm not sure how the user data got out of sync either.

Yup. The focus here is that user data only runs once. Updating it on an existing instance does not trigger its execution.

This is not strictly true. To understand why, it's important to consider the difference between user data, and what any given image _does_ with the user data. This is true across many different clouds, not just AWS.

In AWS, when you specify user data, it becomes available on the metadata endpoint for the instance. The fact that cloudinit does not respond to updates is entirely orthogonal to the ability to update this configuration at the provider - there are many images in use which do not use cloudinit at all, and instead pass other types of configuration data via user data. One such example is modern Windows images, which use ec2config instead of cloudinit.

_However_, Terraform does not have the concept of and update which requires a restart of a resource in this context - but it probably should learn to do so and allow opting in where it makes sense.

I guess that's fair, but my point is that an admin is aware of this context and probably just wants to record changes intended for the next invocation of an instance, but I've started using lifecycle rules to ignore this change, which suits my use case. At this point, I'd rather y'all focus on cleaning up your backlog.

Would a PR to add a distinct aws_instance_user_data resource be likely to be accepted?

Updating instance_type (which also requires a similar stop-update-start cycle) has been supported for a while: https://github.com/terraform-providers/terraform-provider-aws/issues/4838

Could we adopt a similar solution here? We had to engineer a different way around updating instance configuration (via SSM Document attachment), in large part because of this limitation in Terraform for user_data.

EDIT: for the context, we have a third-party proprietary application that has to pass some instance configuration via user_data, and it makes any updates to this configuration highly problematic. This application also can't tolerate an instance replacement, since it is tied to a particular instance ID for license activation..

Yup. The focus here is that user data only runs once. Updating it on an existing instance does not trigger its execution.

This statement is not true on AWS. It is possible to add a tag persist, which ensures that user data is executed on each boot.

Example of user data with the tag persist:

<script>
net start codedeployagent
net start AmazonCloudWatchAgent
</script>
<persist>true</persist>
Was this page helpful?
0 / 5 - 0 ratings