Minimal repro:
require 'google/cloud/pubsub'
s = Google::Cloud::Pubsub::Service.new("czi-dev", Google::Cloud::Credentials.default)
t = Thread.new { puts s.pull("test.walter.pull").inspect }
t.join
# Hit ctrl-C while waiting for pull to return
Crash:
Exception Type: EXC_BAD_ACCESS (SIGABRT)
Exception Codes: KERN_INVALID_ADDRESS at 0x00007000089af808
VM Regions Near 0x7000089af808:
Stack 00007000088ac000-00007000088b2000 [ 24K] rw-/rwx SM=COW thread 1
-->
Stack 0000700008b39000-0000700008b3b000 [ 8K] rw-/rwx SM=PRV
...
5 libruby.2.3.0.dylib 0x000000010f9fe754 sigsegv + 68
6 libsystem_platform.dylib 0x00007fffb5e50bba _sigtramp + 26
7 ??? 000000000000000000 0 + 0
8 grpc_c.bundle 0x0000000110405c54 grpc_exec_ctx_flush + 84
9 grpc_c.bundle 0x0000000110414313 grpc_call_cancel_with_status + 323
10 grpc_c.bundle 0x00000001104140e8 grpc_call_destroy + 376
11 grpc_c.bundle 0x00000001103f3ddb 0x1103f1000 + 11739
12 libruby.2.3.0.dylib 0x000000010f953ff0 finalize_list + 64 (gc.c:2678)
13 libruby.2.3.0.dylib 0x000000010f947ef8 rb_gc_call_finalizer_at_exit + 1272 (gc.c:2842)
14 libruby.2.3.0.dylib 0x000000010f931c61 ruby_cleanup + 721 (eval.c:224)
It appears that c->cb points to unallocated address space here: https://github.com/grpc/grpc/blob/46357c882df1afc28f7a5228c40fde522093fa32/src/core/lib/iomgr/exec_ctx.c#L76
Perhaps a cleanup issue similar to #1096?
Thanks for the issue, and especially the repro, @wrs! This is great.
@swcloud Is this something that you can escalate to the GRPC team?
Adding @apolcyn
This does sound very similar to root cause of #1096. When the process terminates before gRPC connection is able to terminate properly (ctrl C before thr.join returns in this case) results in connection leak and segfault.
@wrs Follow up quesiton, does the thr.join eventually return, or it hangs indefinitely?
@hxiong388 When the pull completes, the join returns and there's a normal process exit.
This bug indeed looks within grpc, and it does looks similar to #1096.
This repros every time on OS X with this plain grpc repro:
tweaking the greeter_client.rb(https://github.com/grpc/grpc/blob/master/examples/ruby/greeter_client.rb) example main to:
def main
stub = Helloworld::Greeter::Stub.new('localhost:50051', :this_channel_is_insecure)
user = ARGV.size > 0 ? ARGV[0] : 'world'
t = Thread.new do
message = stub.say_hello(Helloworld::HelloRequest.new(name: user)).message
p "Greeting: #{message}"
end
t.join
end
and change the greeter_server.rb "greeter handler" to:
# GreeterServer is simple server that implements the Helloworld Greeter server.
class GreeterServer < Helloworld::Greeter::Service
# say_hello implements the SayHello rpc method.
def say_hello(hello_req, _unused_call)
sleep 10
Helloworld::HelloReply.new(message: "Hello #{hello_req.name}")
end
end
and then ctrl-c client while the server handler.
This sort of thing should be probably be handled within grpc though - can't be easily avoided by joining threads. cc @murgatroid99
(again, it looks like thread is getting garbage collected before the call on it completes. then references on that thread's stack get invalidly uses it when call objects GC function is called).
I have to ask: how important is it really that gRPC exits cleanly when the process is interrupted in the middle of execution? Does it really matter, as long as it exits in a timely manner?
It may be possible to correct this in gRPC, depending on what exactly is happening here. It looks like the thread's stack is getting deallocated while something has a reference to it. If that is indeed the case, fixing this is just a matter of allocating some things on the heap instead of the stack. But if Ruby is deallocating objects without running destructors, or something like that, there may not be much we can do about it.
Well, there's "not exiting cleanly", and then there's segfaulting! I don't think having services that core dump when you stop them with a simple SIGINT is a normal or reasonable situation.
@apolcyn Nice example, very easy to repro the leak. On my Ubuntu machine, the example doesn't produce a segfault, but I still get the "connection leaked" error message.
Besides termination with SIGINT, I'm also experiencing the connection leak with other thread terminations. For example, if we tweak the greeter_client.rb example with a t.kill:
def main
stub = Helloworld::Greeter::Stub.new('localhost:50051', :this_channel_is_insecure)
user = ARGV.size > 0 ? ARGV[0] : 'world'
t = Thread.new do
message = stub.say_hello(Helloworld::HelloRequest.new(name: user)).message
p "Greeting: #{message}"
end
t.run
sleep 1
t.kill
t.join
end
The child thread would return normally in this example. But when the process exits, the "connection leak" error message is still thrown.
Just on the feasibility though, it looks like the crash on invalid stack reference might be doable to fix, but it looks like getting rid of the connection leaked error would be complicated
also one other idea in the meantime: is it doable to trap the SIGINT and wait for a bit for the call to complete before exiting?
I'm not sure how that would work鈥f you trap the SIGINT then how would grpc know it was supposed to be exiting? Especially with blocking APIs like pubsub's listen.
@apolcyn I tried the trap approach and tried to explicitly terminate/destroy the grpc connection. The closest destructor I could find was grpc_stub.instance_variable_get(:@ch).close, assuming this would close the active communication channel. But the connection still leaks. Do you happen to know a better way to close the grpc connection without waiting for it to complete?
@hxiong388 there isn't such an API if you're making the call through the standard high-level stub.
but https://github.com/grpc/grpc/pull/10306#pullrequestreview-28975934 should fix this crash anyways, once that is in.
@apolcyn Is it possible to get that released in a 1.2.1.pre2 gem to make it easier for folks to verify?
To be clear, the "connection leaked" message is just a warning from gRPC. It's pretty much unavoidable if you're forcibly killing the process, and it shouldn't cause any problems. Any open sockets will still be closed by the kernel when the process dies.
@blowmage I think that can be done soon.
@hxiong388 @blowmage personally I'd tentatively think that the pre2 with that can be made on monday or early next week
@hxiong388 @blowmage the grpc-1.2.1-pre2 was just published with the fix for this (fix the client side segfault)
@wrs Can you upgrade to grpc 1.2.1-pre2 and let us know if it addresses this issue?
@apolcyn FWIW, I'm no longer seeing a segfault when using grpc 1.2.1-pre2. 馃憤
@apolcyn 1.2.1.pre2 no longer faults in my test program or in my actual program. :) 馃憤
Most helpful comment
@apolcyn 1.2.1.pre2 no longer faults in my test program or in my actual program. :) 馃憤