Rspec-core: --bisect deadlocks when reporting results

Created on 13 Jun 2019 · 20Comments · Source: rspec/rspec-core

--bisect deadlocks when reporting results

david@davids:~/git/puppet-resource_api$ bundle exec rspec --seed 40589 --bisect=verbose --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb  --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}' 
Bisect started using options: "--seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :fork
Running suite to find failures...^C

Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)


Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)

david@davids:~/git/puppet-resource_api$

at the point of ctrl+c, the process has been already sitting a lot longer than the test suite would run.

stracing the processes shows the following situation:

david@davids:~$ strace -s 10000 -p 6090 -p 6096
strace: Process 6090 attached
strace: Process 6096 attached
[pid  6090] wait4(6096,  <unfinished ...>
[pid  6096] ppoll([{fd=10, events=POLLOUT}], 1, NULL, NULL, 8^Cstrace: Process 6090 detached
strace: Process 6096 detached
 <detached ...>

david@davids:~$

where pid 6090 is the --bisect process and pid 6096 is the child rspec process. From other traces I've been running, I understand that the ppoll is waiting on an IO event after/while writing the results to fd 10. Meanwhile the main process is hanging in waitpid at https://github.com/rspec/rspec-core/blob/2800fe118fa2efeb87b7bdc7888476373debf6ca/lib/rspec/core/bisect/fork_runner.rb#L95

A common reason why this might not show up in testing is if the result report in the tests is smaller than the underlying OS's buffer size. In that case the runner process exits after writing to the buffer and the parent continues happily reading from the buffer. In my case the testsuite results are ~93kB and the processes deadlock.

Your environment

Ruby version: ruby 2.5.5p157 (2019-03-15 revision 67260) [x86_64-linux-gnu]
rspec-core version: 3.8.0

Steps to reproduce

checkout https://github.com/DavidS/puppet-resource_api/tree/rspec-core-repro
bundle install the gems according to the Gemfile.lock
the following command hangs:

david@davids:~/git/puppet-resource_api$ bundle exec rspec --bisect=verbose --seed 40589 --pattern 'spec/**{,/*/**}/*_spec.rb' --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :fork
Running suite to find failures...

while this creates results:

david@davids:~/git/puppet-resource_api$ bundle exec rspec --bisect=verbose --require shell --seed 40589 --pattern 'spec/**{,/*/**}/*_spec.rb' --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--require shell --seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :shell
Running suite to find failures... (2 minutes 48.2 seconds)
 - Failing examples (1):
    - ./spec/puppet/resource_api_spec.rb[1:16:2:1]
 - Non-failing examples (1398):
    - ./spec/acceptance/array_spec.rb[1:1:1]
[...]

Source

DavidS

Most helpful comment

@JonRowe Not in the next couple of weeks. I can try to find another _evil martian_ to help with this)

palkan on 8 Jul 2019

🚀2 ❤1

All 20 comments

Thanks a lot for this detailed answer and the clues.

I would love to take a deeper look on this one. In the meantime did you see this discussion https://github.com/rspec/rspec-rails/issues/1353 ?

benoittgt on 23 Jun 2019

@DavidS I'd love it if you could provide us with an isolated reproduction of this, my time is quite limited at the moment and I won't be able to get to a reproduction containing complex code like puppet, another reason for this is to exclude the possibility of puppet itself causing the deadlock.

If it is just a buffer issue as you suggest it should be possible to trigger with only "RSpec" code right?

Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself?

JonRowe on 23 Jun 2019

👍2

Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself?

Definitely 💚

benoittgt on 23 Jun 2019

require 'rspec'

RSpec.describe "a bunch of nothing" do
  (0...3000).each do |t|
    it { expect(t).to eq t }
  end
end

is an example that immediately deadlocks for me when running under rspec --bisect:

david@davids:~/tmp/rspec-deadlock-example$ bundle exec rspec spec/nil_spec.rb --bisect 
Bisect started using options: "spec/nil_spec.rb"
Running suite to find failures...^C

Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)


Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)

david@davids:~/tmp/rspec-deadlock-example$ cat spec/nil_spec.rb

root@davids:~# strace -p 27326 -p 27329
strace: Process 27326 attached
strace: Process 27329 attached
[pid 27326] wait4(27329,  <unfinished ...>
[pid 27329] ppoll([{fd=8, events=POLLOUT}], 1, NULL, NULL, 8^Cstrace: Process 27326 detached
strace: Process 27329 detached
 <detached ...>

root@davids:~#

By changing the 3000, you can make the result set arbitrarily large if you OS has a larger default buffer size.

DavidS on 24 Jun 2019

Thanks, yes that triggers the issue for me, as a work around the shell runner of course works, but then I guess thats why you have #2638 open :joy:

JonRowe on 24 Jun 2019

👍1

Exactly :-D

Thanks a lot for the time and work y'all put into rspec.

DavidS on 24 Jun 2019

❤2

Thanks. I was able to reproduce it on my mac with a very basic example like the one you mentioned.

https://github.com/benoittgt/rspec_repro_bisect_deadlock

benoittgt on 4 Jul 2019

A very interesting answer has been posted by @palkan here : https://github.com/benoittgt/rspec_repro_bisect_deadlock/issues/1

I had also success removing Process.waitpid but I was not sure it was a good idea. I didn't understand it's usage here.

benoittgt on 5 Jul 2019

@palkan comes to the same conclusion that is in my original analysis. Process.waitpid does some OS-level cleanup that is sometimes necessary for the overall health of the OS. In this case of a single child process, we can be lazy and not do it at all, or defer it until all output is read and processed.

DavidS on 5 Jul 2019

👍1

Thanks a lot @DavidS for those additionnals information.

What do you think about the proposal of @palkan

Use IO#write_nonblock should work. That would require changes in both send and receive methods.

I tried in utilities.rb

        def send(message)
          packet = Marshal.dump(message)
-         @write_io.write("#{packet.bytesize}\n#{packet}")
+         @write_io.write_nonblock("#{packet.bytesize}\n#{packet}")
        end

        # rubocop:disable Security/MarshalLoad
        def receive
          packet_size = Integer(@read_io.gets)
-         Marshal.load(@read_io.read(packet_size))
+         Marshal.load(@read_io.read_nonblock(packet_size))

but

Traceback (most recent call last):
    18: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/bin/rspec:23:in `<main>'
    17: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/bin/rspec:23:in `load'
    16: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/exe/rspec:4:in `<top (required)>'
    15: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/runner.rb:45:in `invoke'
    14: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/runner.rb:69:in `run'
    13: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/invocations.rb:36:in `call'
    12: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:17:in `bisect_with'
    11: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:27:in `bisect'
    10: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:49:in `start_bisect_runner'
     9: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:38:in `start'
     8: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:31:in `block in bisect'
     7: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:20:in `find_minimal_repro'
     6: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:117:in `prep'
     5: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:155:in `track_duration'
     4: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:118:in `block in prep'
     3: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:59:in `original_results'
     2: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:71:in `dispatch_run'
     1: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/utilities.rb:47:in `receive'
/Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/utilities.rb:47:in `load': marshal data too short (ArgumentError)

I am ok to remove the waitpid.

benoittgt on 5 Jul 2019

Reading through the docs of write_nonblock (https://apidock.com/ruby/IO/write_nonblock) I also think that in the original error case you would just get a Errno::EWOULDBLOCK instead of writing the data out.

DavidS on 5 Jul 2019

the original error case you would just get a Errno::EWOULDBLOCK

I tried this: it fills the buffer and returns 65536. Not sure when Errno::EWOULDBLOCK could happen.
So, that would be a partial write (and that's why we see marshal data too short).

My initial suggestion that we can leverage write_nonblock was not correct: we need to initiate the read before calling waitpid. That's the main problem.

Probably, the following refactoring could be a bit better than just dropping Process.waitpid:

def dispatch_run(run_descriptor)
- @run_dispatcher.dispatch_specs(run_descriptor)
+ pid = @run_dispatcher.dispatch_specs(run_descriptor)
  @channel.receive.tap do |result|
    # ...
    Process.waitpid(pid)
   end
end

palkan on 5 Jul 2019

@palkan do you feel like working up a patch with something like that?

JonRowe on 8 Jul 2019

@JonRowe Not in the next couple of weeks. I can try to find another _evil martian_ to help with this)

palkan on 8 Jul 2019

🚀2 ❤1

It is working with @palkan proposal

@@ -1,9 +1,10 @@
         def dispatch_run(run_descriptor)
-          @run_dispatcher.dispatch_specs(run_descriptor)
+          pid = @run_dispatcher.dispatch_specs(run_descriptor)
           @channel.receive.tap do |result|
             if result.is_a?(String)
               raise BisectFailedError.for_failed_spec_run(result)
             end
+            Process.waitpid(pid)
           end
         end

@@ -23,6 +24,5 @@
           end

           def dispatch_specs(run_descriptor)
-            pid = fork { run_specs(run_descriptor) }
-            Process.waitpid(pid)
+            fork { puts run_specs(run_descriptor) }
           end

But I didn't verified yet what David mentioned.

Process.waitpid does some OS-level cleanup that is sometimes necessary for the overall health of the OS

I dig into MRI source code but at the moment I no clue of what can happens when it not "cleanup". I don't see it.

Also another proposal will be to use WNOHANG. From waitpid doc:

If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately (...)

--- a/lib/rspec/core/bisect/fork_runner.rb
+++ b/lib/rspec/core/bisect/fork_runner.rb
@@ -92,7 +92,7 @@ module RSpec

           def dispatch_specs(run_descriptor)
             pid = fork { run_specs(run_descriptor) }
-            Process.waitpid(pid)
+            Process.waitpid(pid, Process::WNOHANG)

It is working too.

benoittgt on 10 Jul 2019

I recently use my last proposition with the Process::WNOHANG in a project with stuck bisect. It helped us having a result instead of a locked bisect.

Don't you think a PR with that change will be a good idea? Not sure about how we can test this at the moment. I have to think about it.

benoittgt on 18 Sep 2019

@benoittgt 's Process::WNOHANG solution also worked for me.

hayesr on 14 Oct 2019

👍1

I just ran into this—I'm on rspec-core 3.9.2 on OS X and still seem to get the hanging behavior right before it outputs the failing and non-failing examples. Using shell runner works... Does anyone have suggestions on how to debug what is going on?

agibralter on 4 Jun 2020

Hello @agibralter

There is many commands in the PR that may help you to understand what is happening.

If you succeed to find an easy way to reproduce it I would be very happy to patch it.

benoittgt on 4 Jun 2020

I've contributed a PR which addresses a specific situation in which rspec --bisect hangs.

Specifically when encoding is set to UTF-8 (eg. by Rails) and the forked child process tries to send back BINARY/ASCII-8BIT encoded data to the parent, but the communication channel (Bisect::Channel) rejects it silently due to the encoding mismatch.

The parent process, whose output you're seeing, waits indefinitely for the child process which has since errored out.