This is a follow on from https://github.com/mitchellh/vagrant/pull/5551, which was an initial attempt to fix a race condition in the ansible provisioner.
The race occurs when bringing multiple VM's up in parallel and using the ansible provisioner as the first provisioner. It will only appear with providers where the SSH information relies on interrogating the system after it is booted (vagrant-libvirt behaves this way), although it is likely that a different race condition of not being able to connect to other machines would occur if the SSH information was available from a provider as soon as it started booting, but SSH was not accessible until it was finished booting and the Vagrant environment wished to override the '--limit' option to reference more than one host.
Using the vagrant-libvirt provider, the race condition will exhibit the follow errors from the ansible subprocess:
provided hosts list is empty
or
Specified --limit does not match any hosts
The author of the recent rewrite of the ansible provisioner to support in guest execution of ansible appears to be familiar with some parts of the issue given the comment at https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/plugins/provisioners/ansible/provisioner/base.rb#L84
The race condition remains due to the following:
This is due to the file being truncated by another machine thread while an ansible subprocess in the current thread is attempting to read the file.
Likely the solution will come down to the following items to look at:
There may be other items worth looking at, but think it would be useful for some guidance here as to best way to proceed in fixing.
See also https://groups.google.com/forum/#!topic/vagrant-up/X8-boUpvJfU
@electrofelix Thanks for your detailed report. I'll try to give some elements of answer within the next days, but in the meantime can you please give more insights about the following points:
vagrant up --no-parallel ?Could you please also share a sample project that we can use to reproduce the problem?
Many thanks in advance :-)
Unfortunately this is very difficult to tickle in a reliable way. I've been gathering some stats, and we're seeing this occur less than 1 in 25 runs (approx 600+ runs done a day). But that's still enough for it to concern us.
Following is something along the lines of what I think is needed to tickle it, assuming using vagrant-libvirt. I need to go through the various network options we configure for vagrant-libvirt to use. But basically what's below should eventually hit it provided you run it often enough.
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.provider "libvirt" do |v, override|
override.vm.box = "baremettle/ubuntu-14.04"
end
config.vm.synced_folder ".", "/vagrant", disabled: true
(1..7).each do |i|
config.vm.define "machine#{i}" do |machine|
machine.vm.provision "ansible" do |ansible|
ansible.playbook = "sleep.yml"
ansible.host_key_checking = false
end
end
end
end
I suspect that length of time for the image to boot to provide SSH is likely a factor, and with the above config I've seen the job trip over a different issue when run enough times. Somehow there is multiple SSH connections made to the same machine from different threads that trips up the communicator with regards to swapping the insecure public ssh key with a generated one in one thread while the other thinks the swap has already been made. But that's a separate issue.
vagrant up. I think since we've got an idea on what is causing this it would be preferred if we help fix the issue in Vagrant for future releases.I'll update with a better config once I've got it reproducing with a minimal config a little better.
@electrofelix thanks for the very informative update! I'll answer you as soon as I can (hopefully this week). I do wish that we can find out how to quick-fix it soon, and maybe elaborate a way to provide a "full clean" solution in the future.
@electrofelix I聽propose you below a possible solution with existing Vagrant capabilities.
The idea is to fully take advantage of Ansible parallelism, instead of running ansible-playbook N times. Note that it still must be considered as a _trick_, since the provision command is strongly tied to a single target machine, in the (current) Vagrant perspective. That said, I聽think it is a fast and robust way to achieve your goals...
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.provider "libvirt" do |v, override|
override.vm.box = "baremettle/ubuntu-14.04"
end
config.vm.synced_folder ".", "/vagrant", disabled: true
# optionally disable ssh key replacement (for little speed up)
# config.ssh.insert_key = false
N = 7
(1..N).each do |i|
config.vm.define "machine#{i}" do |machine|
if i == N
machine.vm.provision "ansible" do |ansible|
ansible.playbook = "playbook.yml"
ansible.limit = "all"
end
end
end
end
end
And then, you run in two phases:
*聽first boot all the machines without starting the provision phase
*聽kick a single 虁provision` that will impacts all your nodes in parallel
$ vagrant up --provider=libvirt --no-provision && vagrant provision
Could you please give it a try, and compare the total provisioning time compared to the non-safe situation?
@electrofelix the race condition issue remains of course... (to be discussed later).
One more note: for "network availability" safety, I聽initially added the following pre-task in my testing playbook to be 100% to start ansible-playbook at the right moment (I noticed that libvirt and its vagrant provider can deliver the IP adress a time before this address is effectively reachable).
pre_tasks:
- name: "wait for the node to be ready"
local_action: wait_for host={{ ansible_ssh_host }} port={{ ansible_ssh_port }}
After having success, I blindly tried without this pre-task, and all my (few) test runs passed also very well. I guess that the vagrant up stage (or consequent delays) is enough to "ensure" that all the machines are reachable when vagrant provision starts. Sorry these are pure empirical results, as I聽didn't take time to dig into the relevant code. Hope it can help! I'm looking forward to getting news for your tests...
@electrofelix before having a fix (e.g. #7190), could you use the "parallel provisioning trick" mentioned above?
I'm curious about the performance difference between:
_parallel machines boot without provisioning, followed by a single Ansible parallel provisioning run_
$ vagrant up --provider=libvirt --no-provision && vagrant provision
and
_vagrant up in parallel, using a distinct Ansible provision run for each machine_ (with #7190 fix)
$ vagrant up --provider=libvirt --parallel
It's on my todo list, just have to refactor some stuff to do it that way, to use the same playbook for all 7 nodes (one of them has a different playbook, but is a superset of the generic playbook applied to the 6 others).
@electrofelix great news, looking forward to receiving your benchmark results :boom: