Google-cloud-ruby: Subscriptions on threads segfault on exit

Created on 14 Mar 2017 · 21Comments · Source: googleapis/google-cloud-ruby

Minimal repro:

ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-darwin15]
google-cloud-core (0.21.1)
google-cloud-pubsub (0.23.2)
google-protobuf (3.2.0.1)
googleapis-common-protos (1.3.5)
grpc (1.1.2)

require 'google/cloud/pubsub'
s = Google::Cloud::Pubsub::Service.new("czi-dev", Google::Cloud::Credentials.default)
t = Thread.new { puts s.pull("test.walter.pull").inspect }
t.join
# Hit ctrl-C while waiting for pull to return

Crash:

Exception Type:        EXC_BAD_ACCESS (SIGABRT)
Exception Codes:       KERN_INVALID_ADDRESS at 0x00007000089af808

VM Regions Near 0x7000089af808:
    Stack                  00007000088ac000-00007000088b2000 [   24K] rw-/rwx SM=COW  thread 1
--> 
    Stack                  0000700008b39000-0000700008b3b000 [    8K] rw-/rwx SM=PRV  

...
5   libruby.2.3.0.dylib             0x000000010f9fe754 sigsegv + 68
6   libsystem_platform.dylib        0x00007fffb5e50bba _sigtramp + 26
7   ???                             000000000000000000 0 + 0
8   grpc_c.bundle                   0x0000000110405c54 grpc_exec_ctx_flush + 84
9   grpc_c.bundle                   0x0000000110414313 grpc_call_cancel_with_status + 323
10  grpc_c.bundle                   0x00000001104140e8 grpc_call_destroy + 376
11  grpc_c.bundle                   0x00000001103f3ddb 0x1103f1000 + 11739
12  libruby.2.3.0.dylib             0x000000010f953ff0 finalize_list + 64 (gc.c:2678)
13  libruby.2.3.0.dylib             0x000000010f947ef8 rb_gc_call_finalizer_at_exit + 1272 (gc.c:2842)
14  libruby.2.3.0.dylib             0x000000010f931c61 ruby_cleanup + 721 (eval.c:224)

It appears that c->cb points to unallocated address space here: https://github.com/grpc/grpc/blob/46357c882df1afc28f7a5228c40fde522093fa32/src/core/lib/iomgr/exec_ctx.c#L76

Perhaps a cleanup issue similar to #1096?

pubsub p1 acknowledged blocked bug

Source

wrs

Most helpful comment

@apolcyn 1.2.1.pre2 no longer faults in my test program or in my actual program. :) 👍

wrs on 27 Mar 2017

🎉2

All 21 comments

Thanks for the issue, and especially the repro, @wrs! This is great.

@swcloud Is this something that you can escalate to the GRPC team?

blowmage on 14 Mar 2017

Adding @apolcyn

swcloud on 14 Mar 2017

This does sound very similar to root cause of #1096. When the process terminates before gRPC connection is able to terminate properly (ctrl C before thr.join returns in this case) results in connection leak and segfault.

@wrs Follow up quesiton, does the thr.join eventually return, or it hangs indefinitely?

hxiong388 on 14 Mar 2017

@hxiong388 When the pull completes, the join returns and there's a normal process exit.

wrs on 14 Mar 2017

This bug indeed looks within grpc, and it does looks similar to #1096.

This repros every time on OS X with this plain grpc repro:

tweaking the greeter_client.rb(https://github.com/grpc/grpc/blob/master/examples/ruby/greeter_client.rb) example main to:

def main
  stub = Helloworld::Greeter::Stub.new('localhost:50051', :this_channel_is_insecure)
  user = ARGV.size > 0 ?  ARGV[0] : 'world'
  t = Thread.new do
    message = stub.say_hello(Helloworld::HelloRequest.new(name: user)).message
    p "Greeting: #{message}"
  end
  t.join
end

and change the greeter_server.rb "greeter handler" to:

# GreeterServer is simple server that implements the Helloworld Greeter server.
class GreeterServer < Helloworld::Greeter::Service
  # say_hello implements the SayHello rpc method.
  def say_hello(hello_req, _unused_call)
    sleep 10
    Helloworld::HelloReply.new(message: "Hello #{hello_req.name}")
  end
end

and then ctrl-c client while the server handler.

This sort of thing should be probably be handled within grpc though - can't be easily avoided by joining threads. cc @murgatroid99

(again, it looks like thread is getting garbage collected before the call on it completes. then references on that thread's stack get invalidly uses it when call objects GC function is called).

apolcyn on 20 Mar 2017

❤1

I have to ask: how important is it really that gRPC exits cleanly when the process is interrupted in the middle of execution? Does it really matter, as long as it exits in a timely manner?

It may be possible to correct this in gRPC, depending on what exactly is happening here. It looks like the thread's stack is getting deallocated while something has a reference to it. If that is indeed the case, fixing this is just a matter of allocating some things on the heap instead of the stack. But if Ruby is deallocating objects without running destructors, or something like that, there may not be much we can do about it.

murgatroid99 on 20 Mar 2017

Well, there's "not exiting cleanly", and then there's segfaulting! I don't think having services that core dump when you stop them with a simple SIGINT is a normal or reasonable situation.

wrs on 20 Mar 2017

@apolcyn Nice example, very easy to repro the leak. On my Ubuntu machine, the example doesn't produce a segfault, but I still get the "connection leaked" error message.

Besides termination with SIGINT, I'm also experiencing the connection leak with other thread terminations. For example, if we tweak the greeter_client.rb example with a t.kill:

def main
  stub = Helloworld::Greeter::Stub.new('localhost:50051', :this_channel_is_insecure)
  user = ARGV.size > 0 ?  ARGV[0] : 'world'
  t = Thread.new do
    message = stub.say_hello(Helloworld::HelloRequest.new(name: user)).message
    p "Greeting: #{message}"
  end
  t.run
  sleep 1
  t.kill
  t.join
end

The child thread would return normally in this example. But when the process exits, the "connection leak" error message is still thrown.

hxiong388 on 20 Mar 2017

Just on the feasibility though, it looks like the crash on invalid stack reference might be doable to fix, but it looks like getting rid of the connection leaked error would be complicated

apolcyn on 20 Mar 2017

also one other idea in the meantime: is it doable to trap the SIGINT and wait for a bit for the call to complete before exiting?

apolcyn on 24 Mar 2017

I'm not sure how that would work…if you trap the SIGINT then how would grpc know it was supposed to be exiting? Especially with blocking APIs like pubsub's listen.

wrs on 24 Mar 2017

@apolcyn I tried the trap approach and tried to explicitly terminate/destroy the grpc connection. The closest destructor I could find was grpc_stub.instance_variable_get(:@ch).close, assuming this would close the active communication channel. But the connection still leaks. Do you happen to know a better way to close the grpc connection without waiting for it to complete?

hxiong388 on 24 Mar 2017

@hxiong388 there isn't such an API if you're making the call through the standard high-level stub.

but https://github.com/grpc/grpc/pull/10306#pullrequestreview-28975934 should fix this crash anyways, once that is in.

apolcyn on 24 Mar 2017

@apolcyn Is it possible to get that released in a 1.2.1.pre2 gem to make it easier for folks to verify?

blowmage on 24 Mar 2017

To be clear, the "connection leaked" message is just a warning from gRPC. It's pretty much unavoidable if you're forcibly killing the process, and it shouldn't cause any problems. Any open sockets will still be closed by the kernel when the process dies.

murgatroid99 on 24 Mar 2017

@blowmage I think that can be done soon.

apolcyn on 24 Mar 2017

@hxiong388 @blowmage personally I'd tentatively think that the pre2 with that can be made on monday or early next week

apolcyn on 25 Mar 2017

@hxiong388 @blowmage the grpc-1.2.1-pre2 was just published with the fix for this (fix the client side segfault)

apolcyn on 27 Mar 2017

❤1 🎉1

@wrs Can you upgrade to grpc 1.2.1-pre2 and let us know if it addresses this issue?

blowmage on 27 Mar 2017

@apolcyn FWIW, I'm no longer seeing a segfault when using grpc 1.2.1-pre2. 👍

blowmage on 27 Mar 2017

@apolcyn 1.2.1.pre2 no longer faults in my test program or in my actual program. :) 👍

wrs on 27 Mar 2017

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Non-UTF8 strings in Google-Cloud-Language response

bendillinger · 3Comments

When can we expect AutoML support?

echan00 · 4Comments

In Pubsub: NameError: uninitialized constant Google::Cloud::Pubsub::V1::Subscriber

mackeee-orange · 3Comments

Monitoring Docs example for MetricServiceClient#list_time_series returns an error

thagomizer · 4Comments

google-cloud-ruby doesn't bundle for ruby 2.5.0

shivanshgaur · 3Comments