I'm using Terraform to deploy a Chef cookbook which is mostly a series of Chef bash commands like
bash 'install_command1' do
.
.
.
end
There are 8 Chef bash scripts that are run in a series by 1 Terraform Chef provisioner. The Chef cookbook executes the bash scripts in order based on options set in the Chef provisioner. I'm using a Terraform Chef provisioner in an aws_instance resource. If any of the Chef bash commands takes longer than 6 minutes when the bash script returns, Terraform says the chef provisioner is complete but the Chef-client is still running and moves to the next bash script. Neither Terraform or Chef are returning an error.
The time is not cumulative. Multiple bash commands adding up to 6 minutes is not a problem. It only occurs when one of the bash commands takes longer than 6 minutes. The problem occurs when the bash command completes not at the 6 minute mark. If the bash command takes 12 minutes to run, then Terraform says chef-client is complete when the bash script returns after 12 minutes. The chef-client is not complete and continues to run the series of bash scripts.
I tried to draw a picture of what I'm seeing
provision
Terraform -----------------------> Chef
--- ---
--- ---
--- bash script
--- ---
--- --- <- 6 minute mark
--- ---
--- ---
apply <--------------------- bash returns
complete ---
---
next bash script runs
---
---
next bash script
When the problem occurs Terraform debug output has the following lines
(internal) 2017/07/11 19:08:14 remote command exited with '0': sudo chef-client -j "/etc/chef/first-boot.json" -E "dev-us-east-1"
(chef): execute "bash" -x 2>>/tmp/chef.log 1>>/tmp/chef.log "/tmp/chef-script20170711-9485-1e5a8v5"
The remote command did not exit. The chef-client is still running and has the same PID. Only the bash script exited.
When I set the following option in the Chef provisioner
client_options = ["log_level :debug"]
The problem doesn't seem to occur. Which makes me think that the problem is related to having a long running script that doesn't return any output on the ssh connection.
Terraform version: 0.9.11 running on MacOS X to AWS
I'm experimenting with creating a simple Chef cookbook that replicates the issue. I did see the problem with a simple recipe that just loops waiting. I could build a simple cookbook with the loop and post the debug output.
To be provided, This is just one part of a large system with shared remote state, etc. Attempting to create a simpler test case.
Terraform doesn't state that the apply is complete before chef-client completes
Terraform says that the apply is compete but the chef-client is still running
I added a simple chef recipe on top of my cookbook and was able to reproduce the problem. When the script looped for more than 6 minutes I saw the problem. This is how I figured out that the problem occurred when a script runs longer that 6 minutes.
bash 'loop_command' do
flags "#{node['BASH_ARGS']}"
code <<-FOH
i=0
while [ "$i" -ne #{node['LOOP']} ]
do
sleep 60
date
i=$((i+1))
done
FOH
end
Nothing important, just using a aws_instance Terraform resource with a chef provisioner.
@rsgoodman we are evaluating Terraform - is this still an issue for you?
@begleybrothers-dev I'm no longer doing dev-ops, so I don't know whether this is still an issue.
I'm closing this issue because we announced tool-specific (vendor or 3rd-party) provisioner deprecation in mid-September 2020. Additionally, we added a deprecation notice for tool-specific provisioners in 0.13.4. On a practical level this means we will no longer be reviewing or merging PRs for built-in plugins like the chef provisioner.
The discuss post linked above explains this in more depth, but the basic reason we're making this change is that these vendor provisioners have been extremely challenging for us to maintain, and are a weak spot in the terraform user experience. People reach for them not realizing the bugs and UX limitations, and they're areas that are difficult for us to maintain because of the huge surface area of integrating with a bunch of different tools (Puppet, Chef, Salt, etc) that each require deep domain knowledge to do right. For example, testing each of these against all the versions of those tools, on multiple platforms, is prohibitive, and so we don't - but users have a reasonable expectation that everything in the Terraform Core codebase is well tested. Similarly, it's tough to accept PRs, even for useful improvements, because we don't have anyone on the core team with deep Chef knowledge, and we have not been able to get community volunteers to own PR review for this codebase, so it's a shot in the dark whether a given PR makes things better or worse from the perspective of an experienced Chef + Terraform user.
For the time being, the best option if you want to fix this bug, is to work with the community and build a standalone chef provisioner, fix this in it, and distribute it as a plugin binary, similar to how the ansible provisioner is distributed.
I'm aware of the limitations of this approach, but it's the best option compared to coupling provisioner development to the Terraform Core release lifecycle. We believe the benefit to users of having provisioner development decoupled from core, exceeds the convenience of having these provisioners built in to core. We want to provide a better user experience in the future, and our hope here is that the ability to improve, fix and repair provisioners without us blocking their development, much like providers, will help make a strong case for what's next.
I think it’s also important to highlight that we have no plans to remove the generic provisioners or the pluggable functionality during Terraform's 1.0 lifecycle.
I appreciate your input here to improve Terraform, and am always happy to talk. Please feel free to reach out to me or Petros Kolyvas if you would like to talk more about this change.
@danieldreier, happy with that. On our limited experience we think Hashicorp-Packer is the proper way to use provisioning.
The issue here is passing Terraform apply time data to an instance. For an obvious use case think of standing up a WireGuard client and server VM instances - where their public IPs are assigned by 3rd parties.
At this point we suspect that any setup where we need to pass Terraform apply time data to a Packer build, is a code smell that we are doing it wrong.
However, we haven't run into any best practice statement that confirms our instinct (from limited experience).
Until suggested otherwise we are reworking things so that Terraform and Packer builds are un-tethered to each other - is this always possible?
Well as the Wireguard use case makes clear - not always. If the VM vendor does not provide static IPs you're pretty much hosed. Workarounds such as providing a public DNSaren't always available, but right now we don't see an obvious way Terraform can resolve this other than hacking with remote_exec and local_exec wrapped in null_resource
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.