Vagrant: Ansible provisioner has a race when used with multi machine configuration

Created on 17 Nov 2015 · 9Comments · Source: hashicorp/vagrant

This is a follow on from https://github.com/mitchellh/vagrant/pull/5551, which was an initial attempt to fix a race condition in the ansible provisioner.

The race occurs when bringing multiple VM's up in parallel and using the ansible provisioner as the first provisioner. It will only appear with providers where the SSH information relies on interrogating the system after it is booted (vagrant-libvirt behaves this way), although it is likely that a different race condition of not being able to connect to other machines would occur if the SSH information was available from a provider as soon as it started booting, but SSH was not accessible until it was finished booting and the Vagrant environment wished to override the '--limit' option to reference more than one host.

Using the vagrant-libvirt provider, the race condition will exhibit the follow errors from the ansible subprocess:
provided hosts list is empty
or
Specified --limit does not match any hosts

The author of the recent rewrite of the ansible provisioner to support in guest execution of ansible appears to be familiar with some parts of the issue given the comment at https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/plugins/provisioners/ansible/provisioner/base.rb#L84

The race condition remains due to the following:

the same inventory file is used for all ansible provisioner subprocesses
each machine thread generates the inventory file contents and attempts to update the file if contents differ
the code incorrectly assumes that all valid running machines will be ready to provide ssh_info upon the first machine starting to execute the ansible provision

This is due to the file being truncated by another machine thread while an ansible subprocess in the current thread is attempting to read the file.

Likely the solution will come down to the following items to look at:

unique inventory file for each machine
- retain the inventory directory behaviour and generate a unique inventory file for each machine containing just that machine
- revert to passing the path to the inventory file and generate a unique inventory file for each machine containing information for all machines
wait till all machines that could be part of the inventory to be running before generating
- retry multiple times to get the ssh_info for each active machine
- this is somewhat inefficient without a way to determine if some machines will never be accessible because they failed to boot
- register an internal status, or have a code pattern to determine when each machine has reached certain builtin actions.
- internal status: currently the machine state is provided by provider, which means it can be unique to that provider forcing a provisioner to know what states providers will return in order to support them
- code pattern action mechanism: being able to determinately wait until machines will be ready to respond to ssh connections or spot that they have been destroyed so as to know not to wait for them to be ready

There may be other items worth looking at, but think it would be useful for some guidance here as to best way to proceed in fixing.

bug has-pr needs info provisioneansible

Source

electrofelix

All 9 comments

See also https://groups.google.com/forum/#!topic/vagrant-up/X8-boUpvJfU

@electrofelix Thanks for your detailed report. I'll try to give some elements of answer within the next days, but in the meantime can you please give more insights about the following points:

Can you work around this issue by running vagrant up --no-parallel ?
Have you considered using an Ansible custom dynamic inventory instead of the Vagrant provisioner auto-generated inventory ?

Could you please also share a sample project that we can use to reproduce the problem?

Many thanks in advance :-)

gildegoma on 17 Nov 2015

Unfortunately this is very difficult to tickle in a reliable way. I've been gathering some stats, and we're seeing this occur less than 1 in 25 runs (approx 600+ runs done a day). But that's still enough for it to concern us.

Following is something along the lines of what I think is needed to tickle it, assuming using vagrant-libvirt. I need to go through the various network options we configure for vagrant-libvirt to use. But basically what's below should eventually hit it provided you run it often enough.

Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.provider "libvirt" do |v, override|
    override.vm.box = "baremettle/ubuntu-14.04"
  end

  config.vm.synced_folder ".", "/vagrant", disabled: true

  (1..7).each do |i|
    config.vm.define "machine#{i}" do |machine|
      machine.vm.provision "ansible" do |ansible|
        ansible.playbook = "sleep.yml"
        ansible.host_key_checking = false
      end
    end
  end

end

I suspect that length of time for the image to boot to provide SSH is likely a factor, and with the above config I've seen the job trip over a different issue when run enough times. Somehow there is multiple SSH connections made to the same machine from different threads that trips up the communicator with regards to swapping the insecure public ssh key with a generated one in one thread while the other thinks the swap has already been made. But that's a separate issue.

Running without --parallel would avoid the problem, but is a significant slow down for us given the ansible we are running takes about 2-3 minutes per machine and we have 7. This results in it talking about 15 minutes to provision all the machines instead of 5.
Yes we could switch to not provisioning the machines with vagrant at all and then use a dynamic inventory script to perform the needed actions from outside, but as this is being used by developers internally I think we'd prefer if the vagrant environment is ready to go after a vagrant up. I think since we've got an idea on what is causing this it would be preferred if we help fix the issue in Vagrant for future releases.

I'll update with a better config once I've got it reproducing with a minimal config a little better.

electrofelix on 24 Nov 2015

@electrofelix thanks for the very informative update! I'll answer you as soon as I can (hopefully this week). I do wish that we can find out how to quick-fix it soon, and maybe elaborate a way to provide a "full clean" solution in the future.

gildegoma on 24 Nov 2015

@electrofelix I propose you below a possible solution with existing Vagrant capabilities.

The idea is to fully take advantage of Ansible parallelism, instead of running ansible-playbook N times. Note that it still must be considered as a _trick_, since the provision command is strongly tied to a single target machine, in the (current) Vagrant perspective. That said, I think it is a fast and robust way to achieve your goals...

Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.provider "libvirt" do |v, override|
    override.vm.box = "baremettle/ubuntu-14.04"
  end

  config.vm.synced_folder ".", "/vagrant", disabled: true

  # optionally disable ssh key replacement (for little speed up)
  # config.ssh.insert_key = false

  N = 7
  (1..N).each do |i|
    config.vm.define "machine#{i}" do |machine|

      if i == N
        machine.vm.provision "ansible" do |ansible|
          ansible.playbook = "playbook.yml"
          ansible.limit = "all"
        end
      end
    end
  end

end

And then, you run in two phases:
* first boot all the machines without starting the provision phase
* kick a single ̀provision` that will impacts all your nodes in parallel

$ vagrant up --provider=libvirt --no-provision && vagrant provision

Could you please give it a try, and compare the total provisioning time compared to the non-safe situation?

gildegoma on 25 Nov 2015

👍1

@electrofelix the race condition issue remains of course... (to be discussed later).

gildegoma on 25 Nov 2015

One more note: for "network availability" safety, I initially added the following pre-task in my testing playbook to be 100% to start ansible-playbook at the right moment (I noticed that libvirt and its vagrant provider can deliver the IP adress a time before this address is effectively reachable).

  pre_tasks:
    - name: "wait for the node to be ready"
      local_action: wait_for host={{ ansible_ssh_host }} port={{ ansible_ssh_port }}

After having success, I blindly tried without this pre-task, and all my (few) test runs passed also very well. I guess that the vagrant up stage (or consequent delays) is enough to "ensure" that all the machines are reachable when vagrant provision starts. Sorry these are pure empirical results, as I didn't take time to dig into the relevant code. Hope it can help! I'm looking forward to getting news for your tests...

gildegoma on 25 Nov 2015

@electrofelix before having a fix (e.g. #7190), could you use the "parallel provisioning trick" mentioned above?

I'm curious about the performance difference between:

_parallel machines boot without provisioning, followed by a single Ansible parallel provisioning run_

$ vagrant up --provider=libvirt --no-provision && vagrant provision

and

_vagrant up in parallel, using a distinct Ansible provision run for each machine_ (with #7190 fix)

$ vagrant up --provider=libvirt --parallel

gildegoma on 26 May 2016

It's on my todo list, just have to refactor some stuff to do it that way, to use the same playbook for all 7 nodes (one of them has a different playbook, but is a superset of the generic playbook applied to the 6 others).

electrofelix on 26 May 2016

👍1

@electrofelix great news, looking forward to receiving your benchmark results :boom:

gildegoma on 26 May 2016

Was this page helpful?

0 / 5 - 0 ratings