david@davids:~/git/puppet-resource_api$ bundle exec rspec --seed 40589 --bisect=verbose --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :fork
Running suite to find failures...^C
Bisect aborted!
The most minimal reproduction command discovered so far is:
(Not yet enough information to provide any repro command)
Bisect aborted!
The most minimal reproduction command discovered so far is:
(Not yet enough information to provide any repro command)
david@davids:~/git/puppet-resource_api$
at the point of ctrl+c, the process has been already sitting a lot longer than the test suite would run.
stracing the processes shows the following situation:
david@davids:~$ strace -s 10000 -p 6090 -p 6096
strace: Process 6090 attached
strace: Process 6096 attached
[pid 6090] wait4(6096, <unfinished ...>
[pid 6096] ppoll([{fd=10, events=POLLOUT}], 1, NULL, NULL, 8^Cstrace: Process 6090 detached
strace: Process 6096 detached
<detached ...>
david@davids:~$
where pid 6090 is the --bisect
process and pid 6096 is the child rspec process. From other traces I've been running, I understand that the ppoll
is waiting on an IO event after/while writing the results to fd 10. Meanwhile the main process is hanging in waitpid
at https://github.com/rspec/rspec-core/blob/2800fe118fa2efeb87b7bdc7888476373debf6ca/lib/rspec/core/bisect/fork_runner.rb#L95
A common reason why this might not show up in testing is if the result report in the tests is smaller than the underlying OS's buffer size. In that case the runner process exits after writing to the buffer and the parent continues happily reading from the buffer. In my case the testsuite results are ~93kB and the processes deadlock.
bundle install
the gems according to the Gemfile.lock
david@davids:~/git/puppet-resource_api$ bundle exec rspec --bisect=verbose --seed 40589 --pattern 'spec/**{,/*/**}/*_spec.rb' --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :fork
Running suite to find failures...
while this creates results:
david@davids:~/git/puppet-resource_api$ bundle exec rspec --bisect=verbose --require shell --seed 40589 --pattern 'spec/**{,/*/**}/*_spec.rb' --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--require shell --seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :shell
Running suite to find failures... (2 minutes 48.2 seconds)
- Failing examples (1):
- ./spec/puppet/resource_api_spec.rb[1:16:2:1]
- Non-failing examples (1398):
- ./spec/acceptance/array_spec.rb[1:1:1]
[...]
Thanks a lot for this detailed answer and the clues.
I would love to take a deeper look on this one. In the meantime did you see this discussion https://github.com/rspec/rspec-rails/issues/1353 ?
@DavidS I'd love it if you could provide us with an isolated reproduction of this, my time is quite limited at the moment and I won't be able to get to a reproduction containing complex code like puppet, another reason for this is to exclude the possibility of puppet itself causing the deadlock.
If it is just a buffer issue as you suggest it should be possible to trigger with only "RSpec" code right?
Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself?
Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself?
Definitely 馃挌
require 'rspec'
RSpec.describe "a bunch of nothing" do
(0...3000).each do |t|
it { expect(t).to eq t }
end
end
is an example that immediately deadlocks for me when running under rspec --bisect
:
david@davids:~/tmp/rspec-deadlock-example$ bundle exec rspec spec/nil_spec.rb --bisect
Bisect started using options: "spec/nil_spec.rb"
Running suite to find failures...^C
Bisect aborted!
The most minimal reproduction command discovered so far is:
(Not yet enough information to provide any repro command)
Bisect aborted!
The most minimal reproduction command discovered so far is:
(Not yet enough information to provide any repro command)
david@davids:~/tmp/rspec-deadlock-example$ cat spec/nil_spec.rb
root@davids:~# strace -p 27326 -p 27329
strace: Process 27326 attached
strace: Process 27329 attached
[pid 27326] wait4(27329, <unfinished ...>
[pid 27329] ppoll([{fd=8, events=POLLOUT}], 1, NULL, NULL, 8^Cstrace: Process 27326 detached
strace: Process 27329 detached
<detached ...>
root@davids:~#
By changing the 3000
, you can make the result set arbitrarily large if you OS has a larger default buffer size.
Thanks, yes that triggers the issue for me, as a work around the shell runner of course works, but then I guess thats why you have #2638 open :joy:
Exactly :-D
Thanks a lot for the time and work y'all put into rspec.
Thanks. I was able to reproduce it on my mac with a very basic example like the one you mentioned.
A very interesting answer has been posted by @palkan here : https://github.com/benoittgt/rspec_repro_bisect_deadlock/issues/1
I had also success removing Process.waitpid
but I was not sure it was a good idea. I didn't understand it's usage here.
@palkan comes to the same conclusion that is in my original analysis. Process.waitpid
does some OS-level cleanup that is sometimes necessary for the overall health of the OS. In this case of a single child process, we can be lazy and not do it at all, or defer it until all output is read and processed.
Thanks a lot @DavidS for those additionnals information.
What do you think about the proposal of @palkan
Use IO#write_nonblock should work. That would require changes in both send and receive methods.
I tried in utilities.rb
def send(message)
packet = Marshal.dump(message)
- @write_io.write("#{packet.bytesize}\n#{packet}")
+ @write_io.write_nonblock("#{packet.bytesize}\n#{packet}")
end
# rubocop:disable Security/MarshalLoad
def receive
packet_size = Integer(@read_io.gets)
- Marshal.load(@read_io.read(packet_size))
+ Marshal.load(@read_io.read_nonblock(packet_size))
but
Traceback (most recent call last):
18: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/bin/rspec:23:in `<main>'
17: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/bin/rspec:23:in `load'
16: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/exe/rspec:4:in `<top (required)>'
15: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/runner.rb:45:in `invoke'
14: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/runner.rb:69:in `run'
13: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/invocations.rb:36:in `call'
12: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:17:in `bisect_with'
11: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:27:in `bisect'
10: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:49:in `start_bisect_runner'
9: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:38:in `start'
8: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:31:in `block in bisect'
7: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:20:in `find_minimal_repro'
6: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:117:in `prep'
5: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:155:in `track_duration'
4: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:118:in `block in prep'
3: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:59:in `original_results'
2: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:71:in `dispatch_run'
1: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/utilities.rb:47:in `receive'
/Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/utilities.rb:47:in `load': marshal data too short (ArgumentError)
I am ok to remove the waitpid
.
Reading through the docs of write_nonblock (https://apidock.com/ruby/IO/write_nonblock) I also think that in the original error case you would just get a Errno::EWOULDBLOCK instead of writing the data out.
the original error case you would just get a Errno::EWOULDBLOCK
I tried this: it fills the buffer and returns 65536. Not sure when Errno::EWOULDBLOCK could happen.
So, that would be a partial write (and that's why we see marshal data too short
).
My initial suggestion that we can leverage write_nonblock
was not correct: we need to initiate the read before calling waitpid
. That's the main problem.
Probably, the following refactoring could be a bit better than just dropping Process.waitpid
:
def dispatch_run(run_descriptor)
- @run_dispatcher.dispatch_specs(run_descriptor)
+ pid = @run_dispatcher.dispatch_specs(run_descriptor)
@channel.receive.tap do |result|
# ...
Process.waitpid(pid)
end
end
@palkan do you feel like working up a patch with something like that?
@JonRowe Not in the next couple of weeks. I can try to find another _evil martian_ to help with this)
It is working with @palkan proposal
@@ -1,9 +1,10 @@
def dispatch_run(run_descriptor)
- @run_dispatcher.dispatch_specs(run_descriptor)
+ pid = @run_dispatcher.dispatch_specs(run_descriptor)
@channel.receive.tap do |result|
if result.is_a?(String)
raise BisectFailedError.for_failed_spec_run(result)
end
+ Process.waitpid(pid)
end
end
@@ -23,6 +24,5 @@
end
def dispatch_specs(run_descriptor)
- pid = fork { run_specs(run_descriptor) }
- Process.waitpid(pid)
+ fork { puts run_specs(run_descriptor) }
end
But I didn't verified yet what David mentioned.
Process.waitpid does some OS-level cleanup that is sometimes necessary for the overall health of the OS
I dig into MRI source code but at the moment I no clue of what can happens when it not "cleanup". I don't see it.
Also another proposal will be to use WNOHANG
. From waitpid doc:
If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately (...)
--- a/lib/rspec/core/bisect/fork_runner.rb
+++ b/lib/rspec/core/bisect/fork_runner.rb
@@ -92,7 +92,7 @@ module RSpec
def dispatch_specs(run_descriptor)
pid = fork { run_specs(run_descriptor) }
- Process.waitpid(pid)
+ Process.waitpid(pid, Process::WNOHANG)
It is working too.
I recently use my last proposition with the Process::WNOHANG
in a project with stuck bisect. It helped us having a result instead of a locked bisect.
Don't you think a PR with that change will be a good idea? Not sure about how we can test this at the moment. I have to think about it.
@benoittgt 's Process::WNOHANG
solution also worked for me.
I just ran into this鈥擨'm on rspec-core 3.9.2 on OS X and still seem to get the hanging behavior right before it outputs the failing and non-failing examples. Using shell runner works... Does anyone have suggestions on how to debug what is going on?
Hello @agibralter
There is many commands in the PR that may help you to understand what is happening.
If you succeed to find an easy way to reproduce it I would be very happy to patch it.
I've contributed a PR which addresses a specific situation in which rspec --bisect
hangs.
Specifically when encoding is set to UTF-8 (eg. by Rails) and the forked child process tries to send back BINARY/ASCII-8BIT encoded data to the parent, but the communication channel (Bisect::Channel
) rejects it silently due to the encoding mismatch.
The parent process, whose output you're seeing, waits indefinitely for the child process which has since errored out.
Most helpful comment
@JonRowe Not in the next couple of weeks. I can try to find another _evil martian_ to help with this)