We are seeing segfaults on CI running Ruby 2.5. Currently only seeing this on OS X, but the segfaults are consistent. Every run segfaults.
Here are some of the failed builds:
cc @jbolinger
From https://travis-ci.org/GoogleCloudPlatform/google-cloud-ruby/jobs/341727015:
/Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/parse/lr_table.rb:265: [BUG] Segmentation fault at 0x00007000078c5390
ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
-- Crash Report log information --------------------------------------------
See Crash Report log file under the one of following:
* ~/Library/Logs/DiagnosticReports
* /Library/Logs/DiagnosticReports
for more details.
Don't forget to include the above Crash Report log file in bug reports.
-- Control frame information -----------------------------------------------
c:0045 p:0012 s:0261 e:000257 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/parse/lr_table.rb:265
c:0044 p:0009 s:0249 e:000248 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/parse/lr_table.rb:314 [FINISH]
c:0043 p:---- s:0244 e:000243 CFUNC :each_key
c:0042 p:0024 s:0240 e:000239 METHOD /Users/travis/.rvm/rubies/ruby-2.5.0/lib/ruby/2.5.0/set.rb:338
c:0041 p:0081 s:0235 E:000140 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/parse/lr_table.rb:313
c:0040 p:0047 s:0227 e:000226 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/parse/lr_table.rb:44
c:0039 p:0137 s:0217 e:000216 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/yacc.rb:316
c:0038 p:0064 s:0211 e:000210 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/rly-0.2.3/lib/rly/yacc.rb:17
c:0037 p:0008 s:0206 e:000205 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/google-gax-1.0.1/lib/google/gax/path_template.rb:55 [FINISH]
c:0036 p:---- s:0201 e:000200 CFUNC :new
c:0035 p:0020 s:0196 e:000195 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/google-gax-1.0.1/lib/google/gax/path_template.rb:134 [FINISH]
c:0034 p:---- s:0190 e:000189 CFUNC :new
c:0033 p:0095 s:0185 e:000183 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/google-gax-1.0.1/lib/google/gax/path_template.rb:180 [FINISH]
c:0032 p:---- s:0178 e:000177 CFUNC :each
c:0031 p:0015 s:0174 e:000173 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/google-gax-1.0.1/lib/google/gax/path_template.rb:173
c:0030 p:0016 s:0166 e:000165 METHOD /Users/travis/build/[secure]/google-cloud-ruby/google-cloud-spanner/lib/google/cloud/spanner/admin/database/v1/datab
It looks like this is happening in rly, rather than gRPC/protobuf, which is where we've frequently seen segfaults in dependencies in the past. This is pretty surprising, since rly appears to be a pure-Ruby library.
GAX uses rly to parse path templates. From practical experience, it seems like using a full-blown lexer/parser to handle path templates is probably overkill; I believe Go GAPICs were able to switch successfully to just using string concatenation to do the same tasks. If this winds up being a problem in rly removing this dependency might be a potential way forward.
Yes, @dazuma and I discussed removing rly and using addressable last week. This would be a workaround, and we are still unable to diagnose the cause.
@blowmage If you conclude that this is best worked around in GAX rather than fixed elsewhere, please file an issue in https://github.com/googleapis/gax-ruby/, and either @jbolinger or I can take a look at swapping out rly for addressable.
I removed all calls to rly and still see a segfault, which makes me think the issue is not specific to rly.
-- Control frame information -----------------------------------------------
c:0030 p:---- s:0163 e:000162 CFUNC :to_s
c:0029 p:---- s:0160 e:000159 CFUNC :String
c:0028 p:0012 s:0155 e:000153 BLOCK /Users/travis/build/[secure]/google-cloud-ruby/google-cloud-spanner/lib/google/cloud/spanner/data.rb:118 [FINISH]
c:0027 p:---- s:0150 e:000149 CFUNC :find_index
c:0026 p:0029 s:0146 e:000145 METHOD /Users/travis/.rvm/rubies/ruby-2.5.0/lib/ruby/2.5.0/forwardable.rb:229
c:0025 p:0092 s:0139 e:000138 METHOD /Users/travis/build/[secure]/google-cloud-ruby/google-cloud-spanner/lib/google/cloud/spanner/data.rb:118
c:0024 p:0421 s:0132 e:000131 METHOD /Users/travis/build/[secure]/google-cloud-ruby/google-cloud-spanner/test/google/cloud/spanner/snapshot/read_test.rb:
c:0023 p:0164 s:0125 e:000124 BLOCK /Users/travis/build/[secure]/google-cloud-ruby/google-cloud-spanner/test/google/cloud/spanner/snapshot/read_test.rb: [FINISH]
c:0022 p:0024 s:0119 e:000118 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest/test.rb:98
c:0021 p:0002 s:0116 e:000115 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest/test.rb:195
c:0020 p:0006 s:0111 e:000110 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest/test.rb:95
c:0019 p:0014 s:0108 e:000107 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:265
c:0018 p:0006 s:0103 e:000102 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest/test.rb:94
c:0017 p:0029 s:0100 E:000d60 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:360
c:0016 p:0044 s:0092 E:000a58 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest/test.rb:211
c:0015 p:0005 s:0085 E:002470 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest/test.rb:93
c:0014 p:0010 s:0081 e:000080 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:960
c:0013 p:0026 s:0074 e:000072 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:334
c:0012 p:0011 s:0066 e:000065 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:321 [FINISH]
c:0011 p:---- s:0062 e:000061 CFUNC :each
c:0010 p:0007 s:0058 e:000057 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:320
c:0009 p:0029 s:0055 E:002430 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:360
c:0008 p:0029 s:0047 E:001528 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:347
c:0007 p:0109 s:0040 E:000ee8 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:319
c:0006 p:0010 s:0031 e:000030 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:159 [FINISH]
c:0005 p:---- s:0027 e:000026 CFUNC :map
c:0004 p:0038 s:0023 e:000022 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:159
c:0003 p:0154 s:0014 e:000013 METHOD /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:136
c:0002 p:0059 s:0007 E:0019f8 BLOCK /Users/travis/.rvm/gems/ruby-2.5.0/gems/minitest-5.11.3/lib/minitest.rb:63 [FINISH]
c:0001 p:0000 s:0003 E:001780 (none) [FINISH]
Confirmed, I can still repro it on my macbook pro.
Now, I'm getting the crash from a variety of different tests and contexts, but according to the C backtrace, it's always coming from the same place in the GC.
-- C level backtrace information -------------------------------------------
0 ruby 0x00000001025aeee7 rb_vm_bugreport + 135
1 ruby 0x00000001024346b8 rb_bug_context + 472
2 ruby 0x00000001025234a1 sigsegv + 81
3 libsystem_platform.dylib 0x00007fff6d0fef5a _sigtramp + 26
4 ruby 0x000000010244dba3 rb_gc_mark_machine_stack + 99
5 ruby 0x000000010259df49 rb_execution_context_mark + 137
6 ruby 0x000000010241c3bb cont_mark + 27
7 ruby 0x0000000102458a12 gc_marks_rest + 146
8 ruby 0x00000001024573d0 gc_start + 2816
9 ruby 0x000000010245674f newobj_slowpath + 1055
10 ruby 0x0000000102456304 newobj_slowpath_wb_protected + 20
11 ruby 0x0000000102544716 rb_sym_to_s + 38
12 ruby 0x00000001025a9480 vm_call0_body + 560
13 ruby 0x00000001025aa208 rb_call0 + 152
14 ruby 0x0000000102597aae rb_funcall_with_block + 62
15 ruby 0x0000000102597fbe rb_yield + 158
(etc...)
This is the same C backtrace that is being reported in https://github.com/colszowka/simplecov/issues/652
I re-tried applying the patch for https://bugs.ruby-lang.org/issues/14357 but I can still reproduce the segfault. I also removed simplecov from the bundle, and can still reproduce the segfault. I also reconfirmed that the segfault goes away when I disable the GC completely.
Based on this, I reopened https://bugs.ruby-lang.org/issues/14334 and added some information.
Thank you, I鈥檓 still very weak from the flu. Can you run point on this for now?
Yeah, continuing to investigate.
For now, C backtrace when I recompile Ruby with -O0. The segfault still happens in the same place:
-- C level backtrace information -------------------------------------------
0 libruby.2.5.dylib 0x0000000103f93e19 rb_print_backtrace + 25
1 libruby.2.5.dylib 0x0000000103f93f28 rb_vm_bugreport + 136
2 libruby.2.5.dylib 0x0000000103d866f2 rb_bug_context + 450
3 libruby.2.5.dylib 0x0000000103ed84ee sigsegv + 94
4 libsystem_platform.dylib 0x00007fff6d0fef5a _sigtramp + 26
5 libruby.2.5.dylib 0x0000000103db65a1 mark_locations_array + 49
6 libruby.2.5.dylib 0x0000000103da75bb gc_mark_locations + 75
7 libruby.2.5.dylib 0x0000000103da77d9 mark_stack_locations + 41
8 libruby.2.5.dylib 0x0000000103da779f rb_gc_mark_machine_stack + 79
9 libruby.2.5.dylib 0x0000000103f75868 rb_execution_context_mark + 264
10 libruby.2.5.dylib 0x0000000103d5f63e cont_mark + 46
11 libruby.2.5.dylib 0x0000000103d5f572 fiber_mark + 146
12 libruby.2.5.dylib 0x0000000103dac4c6 gc_mark_children + 1094
13 libruby.2.5.dylib 0x0000000103db434c gc_mark_stacked_objects + 108
14 libruby.2.5.dylib 0x0000000103db4a5b gc_mark_stacked_objects_all + 27
15 libruby.2.5.dylib 0x0000000103db3cb1 gc_marks_rest + 129
16 libruby.2.5.dylib 0x0000000103db5787 gc_marks + 103
17 libruby.2.5.dylib 0x0000000103db22e2 gc_start + 802
18 libruby.2.5.dylib 0x0000000103db1d8c heap_prepare + 172
19 libruby.2.5.dylib 0x0000000103db1c7c heap_get_freeobj_from_next_freepage + 44
20 libruby.2.5.dylib 0x0000000103db1b9f heap_get_freeobj + 95
21 libruby.2.5.dylib 0x0000000103db1ab1 newobj_slowpath + 305
22 libruby.2.5.dylib 0x0000000103db191c newobj_slowpath_wb_protected + 76
23 libruby.2.5.dylib 0x0000000103da4ed9 newobj_of + 265
24 libruby.2.5.dylib 0x0000000103da4f52 rb_wb_protected_newobj_of + 50
25 libruby.2.5.dylib 0x0000000103ef773c str_alloc + 28
26 libruby.2.5.dylib 0x0000000103ef8763 str_duplicate + 51
27 libruby.2.5.dylib 0x0000000103ef898d rb_str_resurrect + 205
28 libruby.2.5.dylib 0x0000000103f6307f vm_exec_core + 2879
29 libruby.2.5.dylib 0x0000000103f73fe6 vm_exec + 182
30 libruby.2.5.dylib 0x0000000103f8b608 invoke_block + 216
31 libruby.2.5.dylib 0x0000000103f8b472 invoke_iseq_block_from_c + 434
32 libruby.2.5.dylib 0x0000000103f8b16e invoke_block_from_c_bh + 158
33 libruby.2.5.dylib 0x0000000103f8b0bb vm_yield + 107
34 libruby.2.5.dylib 0x0000000103f6ea43 rb_yield_0 + 35
35 libruby.2.5.dylib 0x0000000103f6ed3b rb_yield_values2 + 27
The C source suggests there is a bad pointer on the stack. As far as I can tell, this is the C instruction that is triggering the seg fault. (gc.c, around line 4064)
static void
mark_locations_array(rb_objspace_t *objspace, register const VALUE *x, register long n)
{
VALUE v;
while (n--) {
v = *x; // <----- Seems to be crashing here?
gc_mark_maybe(objspace, v);
x++;
}
}
Progress!
The following much smaller code snippet now reliably triggers the segfault on my machine (mid 2015 model Macbook Pro on OSX 10.13.3).
gem "concurrent-ruby" # gem version 1.0.5 installed
require "concurrent"
array = [nil]
enum = array.to_enum
thread_pool = Concurrent::FixedThreadPool.new 1
Concurrent::Future.new(executor: thread_pool) do
enum.peek
end.execute
thread_pool.shutdown
GC.start # Does not segfault. Just to prove that the next line sets up the correct conditions
thread_pool.wait_for_termination 60
GC.start # Segfault here.
The following is a C level backtrace on a Ruby 2.5.0 compiled with -O0:
-- C level backtrace information -------------------------------------------
0 libruby.2.5.dylib 0x0000000101e31e19 rb_print_backtrace + 25
1 libruby.2.5.dylib 0x0000000101e31f28 rb_vm_bugreport + 136
2 libruby.2.5.dylib 0x0000000101c246f2 rb_bug_context + 450
3 libruby.2.5.dylib 0x0000000101d764ee sigsegv + 94
4 libsystem_platform.dylib 0x00007fff6a779f5a _sigtramp + 26
5 libruby.2.5.dylib 0x0000000101c545a1 mark_locations_array + 49
6 libruby.2.5.dylib 0x0000000101c455bb gc_mark_locations + 75
7 libruby.2.5.dylib 0x0000000101c457d9 mark_stack_locations + 41
8 libruby.2.5.dylib 0x0000000101c4579f rb_gc_mark_machine_stack + 79
9 libruby.2.5.dylib 0x0000000101e13868 rb_execution_context_mark + 264
10 libruby.2.5.dylib 0x0000000101bfd63e cont_mark + 46
11 libruby.2.5.dylib 0x0000000101bfd572 fiber_mark + 146
12 libruby.2.5.dylib 0x0000000101c4a4c6 gc_mark_children + 1094
13 libruby.2.5.dylib 0x0000000101c5234c gc_mark_stacked_objects + 108
14 libruby.2.5.dylib 0x0000000101c52a5b gc_mark_stacked_objects_all + 27
15 libruby.2.5.dylib 0x0000000101c51cb1 gc_marks_rest + 129
16 libruby.2.5.dylib 0x0000000101c53787 gc_marks + 103
17 libruby.2.5.dylib 0x0000000101c502e2 gc_start + 802
18 libruby.2.5.dylib 0x0000000101c47a18 garbage_collect + 56
19 libruby.2.5.dylib 0x0000000101c4cf7d gc_start_internal + 493
20 libruby.2.5.dylib 0x0000000101e1cf2a call_cfunc_m1 + 42
21 libruby.2.5.dylib 0x0000000101e1bd1d vm_call_cfunc_with_frame + 605
22 libruby.2.5.dylib 0x0000000101e1741d vm_call_cfunc + 173
23 libruby.2.5.dylib 0x0000000101e168fe vm_call_method_each_type + 190
24 libruby.2.5.dylib 0x0000000101e16690 vm_call_method + 160
25 libruby.2.5.dylib 0x0000000101e165e5 vm_call_general + 53
26 libruby.2.5.dylib 0x0000000101e0284e vm_exec_core + 8974
27 libruby.2.5.dylib 0x0000000101e11fe6 vm_exec + 182
28 libruby.2.5.dylib 0x0000000101e12d5b rb_iseq_eval_main + 43
29 libruby.2.5.dylib 0x0000000101c2f208 ruby_exec_internal + 232
30 libruby.2.5.dylib 0x0000000101c2f111 ruby_exec_node + 33
31 libruby.2.5.dylib 0x0000000101c2f0d0 ruby_run_node + 64
32 ruby 0x0000000101b8bf2f main + 95
Note that I've been able to remove all google-cloud code from the repro case.
Update: concurrent-ruby is not the culprit. The following code triggers the segfault:
enumerator = Enumerator.new { |y| y << 1 }
thread = Thread.new do
enumerator.peek
end
thread.join
GC.start # <- Segfault here
This is now a Ruby-core bug. I'll report it upstream.
Waiting for action from Ruby-lang, should we keep this issue open until then?
I still think we should keep it open, but maybe mark it as on hold or something. We should still:
We don't have a label for "on hold", so I'll mark this issue as "blocked".
I've also seen this using Ruby 2.5.1.
@dazuma @blowmage Can we close this issue since we are unable to resolve it in google-cloud-ruby? Because it is a p1 bug older than 42 days, it is out of SLO.
Last time we discussed this we were told to keep it open, even if it was outside of SLO. But perhaps @JustinBeckwith would prefer we treat it differently now?
Update: There's finally a patch in ruby-core: https://bugs.ruby-lang.org/issues/15362 which they expect to backport to the 2.5 branch. I haven't verified it yet for our case, but assuming the fix is real, we should be okay on Ruby >= 2.5.4
Fixed in Ruby 2.5.4.