Scylla: Core dump after decommission (seems to be hinted handoff related)

Created on 8 Oct 2018  路  22Comments  路  Source: scylladb/scylla

Installation details
Scylla version (or git commit hash): 666.development-0.20181007.b839f551c
Cluster size: 5 nodes
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0b705281b4cd52622

Run longevity test using user profiles that creates materialized views and secondary indexes.
During I/O node4 decommission has been performed
About 1.5 min after the decommission has been finished, was created core dump and found segmentation fault in the journalctl on all nodes:

0x00000000006a79d2
0x00000000005bec5c
0x00000000005bef05
0x00000000005bef53
/lib64/libpthread.so.0+0x000000000000f6cf
0x0000000002d7f4bf
0x0000000002b14eda
0x0000000002b8a20e
0x00000000023d03dd
0x00000000006b4ff7
0x000000000059b874
0x000000000066e9ee
0x00000000006734ba
0x0000000000756b6d
/lib64/libpthread.so.0+0x0000000000007e24
/lib64/libc.so.6+0x00000000000febac

Decoded

void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /usr/src/debug/scylla-666.development/seastar/util/backtrace.hh:56
seastar::backtrace_buffer::append_backtrace() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:410
 (inlined by) print_with_backtrace at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:431
seastar::print_with_backtrace(char const*) at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:438
sigsegv_action at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:4022
 (inlined by) operator() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:4008
 (inlined by) _FUN at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:4004
?? ??:0
gms::gossiper::unregister_(seastar::shared_ptr<gms::i_endpoint_state_change_subscriber>) at /opt/scylladb/include/c++/7/bits/list.tcc:330
db::hints::manager::can_hint_for(gms::inet_address) const at /usr/src/debug/scylla-666.development/db/hints/manager.cc:484
db::hints::manager::store_hint(gms::inet_address, seastar::lw_shared_ptr<schema const>, seastar::lw_shared_ptr<frozen_mutation const>, tracing::trace_state_ptr) at /usr/src/debug/scylla-666.development/db/hints/manager.cc:280 (discriminator 2)
unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}::operator()(gms::inet_address) at /usr/src/debug/scylla-666.development/service/storage_proxy.cc:1810
 (inlined by) bool __gnu_cxx::__ops::_Iter_pred<unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>::operator()<std::__detail::_Node_const_iterator<gms::inet_address, true, true> >(std::__detail::_Node_const_iterator<gms::inet_address, true, true>) at /opt/scylladb/include/c++/7/bits/predefined_ops.h:283
 (inlined by) std::iterator_traits<std::__detail::_Node_const_iterator<gms::inet_address, true, true> >::difference_type std::__count_if<std::__detail::_Node_const_iterator<gms::inet_address, true, true>, __gnu_cxx::__ops::_Iter_pred<unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}> >(std::__detail::_Node_const_iterator<gms::inet_address, true, true>, std::__detail::_Node_const_iterator<gms::inet_address, true, true>, __gnu_cxx::__ops::_Iter_pred<unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>) at /opt/scylladb/include/c++/7/bits/stl_algo.h:3194
 (inlined by) std::iterator_traits<std::__detail::_Node_const_iterator<gms::inet_address, true, true> >::difference_type std::count_if<std::__detail::_Node_const_iterator<gms::inet_address, true, true>, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>(std::__detail::_Node_const_iterator<gms::inet_address, true, true>, std::__detail::_Node_const_iterator<gms::inet_address, true, true>, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}) at /opt/scylladb/include/c++/7/bits/stl_algo.h:4108
 (inlined by) boost::range_difference<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const>::type boost::range::count_if<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> >, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>(std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}) at /opt/scylladb/include/boost/range/algorithm/count_if.hpp:44
 (inlined by) unsigned long service::storage_proxy::hint_to_dead_endpoints<std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr) at /usr/src/debug/scylla-666.development/service/storage_proxy.cc:1809
 (inlined by) operator() at /usr/src/debug/scylla-666.development/service/storage_proxy.cc:467
 (inlined by) _M_invoke at /opt/scylladb/include/c++/7/bits/std_function.h:316
seastar::noncopyable_function<void ()>::operator()() const at /usr/src/debug/scylla-666.development/seastar/util/noncopyable_function.hh:145
 (inlined by) complete_timers<seastar::timer_set<seastar::timer<seastar::lowres_clock>, &seastar::timer<seastar::lowres_clock>::_link>, boost::intrusive::list<seastar::timer<seastar::lowres_clock>, boost::intrusive::member_hook<seastar::timer<seastar::lowres_clock>, boost::intrusive::list_member_hook<>, &seastar::timer<seastar::lowres_clock>::_link>, void, void, void>, seastar::reactor::do_expire_lowres_timers()::<lambda()> > at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:769
 (inlined by) seastar::reactor::do_expire_lowres_timers() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:2746
 (inlined by) seastar::reactor::lowres_timer_pollfn::poll() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:2912
seastar::reactor::poll_once() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:3366
 (inlined by) operator() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:3261
 (inlined by) _M_invoke at /opt/scylladb/include/c++/7/bits/std_function.h:302
std::function<bool ()>::operator()() const at /opt/scylladb/include/c++/7/bits/std_function.h:706
 (inlined by) seastar::reactor::run() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:3286
seastar::smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()() const at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:4336
std::function<void ()>::operator()() const at /opt/scylladb/include/c++/7/bits/std_function.h:706
 (inlined by) seastar::posix_thread::start_routine(void*) at /usr/src/debug/scylla-666.development/seastar/core/posix.cc:52
bug showstopper

All 22 comments

@juliayakovlev Just to clarify, the core has been created on node4 - the node that has been decommissioned, right?

no the core was created on another node - that was not decomissioned -- after the decomissioning has finished.

Please note Tomek is fixing a number of bugs related to gossip state on non shard 0 - not sure its related.

@slivne @tgrabiec Looking at the back trace it seems that HH is in it by accident. According to the backtrace the SIGSEG comes from the gms::gossiper::unregister_(...), which is called by GossipingPropertyFileSnitch.

I was unable to reproduce the issue by simply running c-s + decommissioning the one of 5 nodes under the load.
I suspect that this issue is not related to HH.
I'd advice to try to reproduce it and if it's reproducible - try to reproduce it with HH disabled.

@lauranovich
In general, I'd like to ask to add more details to issue description:

  • What is the exact cluster configuration:

    • Snitch type.

    • Any other non-default parameters.

  • How the test goes exactly and how one can reproduce it "at home".

Looking at the HH and GossipingPropertyFileSnitch code around calls present in the back trace above I couldn't find anything suspicious.
The crash may be caused by some memory trashing that took place before.
I'd also suggest to run this test with a scylla binary compiled in a debug mode.

Since I can't reproduce the issue nor I found anything obvious I don't know how to proceed further here.

I think the gossiper is there by accident, and that the fault is indeed somewhere in HH. That said, the backtrace is weird:

gms::gossiper::unregister_(seastar::shared_ptr<gms::i_endpoint_state_change_subscriber>) at /opt/scylladb/include/c++/7/bits/list.tcc:330

Since when is gms::gossiper::unregister_ at /opt/scylladb/include/c++/7/bits/list.tcc:330?

@duarten The rest of back trace hints that gossiper is there not by accident since gossiper::unregister_() is called by GossipingPropertyFileSnitch from a timer callback.

However I think the whole backtrace is irrelevant because the crash is just a side effect of a previous memory trashing.

Running scylla compiled in a debug mode resulted in the following catch (on at least 2 nodes):

Oct 11 10:08:02 ip-172-30-0-213 scylla: =================================================================
Oct 11 10:08:02 ip-172-30-0-213 scylla: ==3543==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6cb49d3dc4 at pc 0x00000798999a bp 0x7f6cb49d3690 sp 0x7f6cb49d3680
Oct 11 10:08:02 ip-172-30-0-213 scylla: READ of size 4 at 0x7f6cb49d3dc4 thread T1 (reactor-1)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #0 0x7989999  (/usr/bin/scylla+0x7989999)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #1 0x68012a2  (/usr/bin/scylla+0x68012a2)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #2 0x67f88b9  (/usr/bin/scylla+0x67f88b9)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #3 0x5f54115  (/usr/bin/scylla+0x5f54115)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #4 0x5f5dc6b  (/usr/bin/scylla+0x5f5dc6b)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #5 0x5f6baa8  (/usr/bin/scylla+0x5f6baa8)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #6 0x5f6a607  (/usr/bin/scylla+0x5f6a607)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #7 0x9d35801  (/usr/bin/scylla+0x9d35801)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #8 0x9d3a71c  (/usr/bin/scylla+0x9d3a71c)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #9 0x52f908e  (/usr/bin/scylla+0x52f908e)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #10 0x54a55f0  (/usr/bin/scylla+0x54a55f0)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #11 0x5496829  (/usr/bin/scylla+0x5496829)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #12 0x54810e8  (/usr/bin/scylla+0x54810e8)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #13 0x5459060  (/usr/bin/scylla+0x5459060)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #14 0x5459a17  (/usr/bin/scylla+0x5459a17)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #15 0x543227f  (/usr/bin/scylla+0x543227f)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #16 0x54818ef  (/usr/bin/scylla+0x54818ef)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #17 0x5459a17  (/usr/bin/scylla+0x5459a17)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #18 0x54c0131  (/usr/bin/scylla+0x54c0131)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #19 0x54c0276  (/usr/bin/scylla+0x54c0276)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #20 0x54c0459  (/usr/bin/scylla+0x54c0459)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #21 0x54bacac  (/usr/bin/scylla+0x54bacac)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #22 0xa6ae05  (/usr/bin/scylla+0xa6ae05)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #23 0xa72f66  (/usr/bin/scylla+0xa72f66)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #24 0x4814a2  (/usr/bin/scylla+0x4814a2)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #25 0x485b4b  (/usr/bin/scylla+0x485b4b)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #26 0x48c8fb  (/usr/bin/scylla+0x48c8fb)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #27 0x49e080  (/usr/bin/scylla+0x49e080)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #28 0x59a75a  (/usr/bin/scylla+0x59a75a)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #29 0x791d35  (/usr/bin/scylla+0x791d35)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #30 0xb55dda  (/usr/bin/scylla+0xb55dda)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #31 0x7f6cc0d25e24 in start_thread (/lib64/libpthread.so.0+0x7e24)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #32 0x7f6cc0a4fbac in __clone (/lib64/libc.so.6+0xfebac)
Oct 11 10:08:02 ip-172-30-0-213 scylla: Address 0x7f6cb49d3dc4 is located in stack of thread T1 (reactor-1) at offset 1716 in frame
Oct 11 10:08:02 ip-172-30-0-213 scylla: #0 0x7987584  (/usr/bin/scylla+0x7987584)
Oct 11 10:08:02 ip-172-30-0-213 scylla: This frame has 30 object(s):
Oct 11 10:08:02 ip-172-30-0-213 scylla: [32, 33) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [96, 97) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [160, 164) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [224, 228) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [288, 292) 'mixed_count'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [352, 356) 'mixed_surplus'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [416, 420) 'my_surplus'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [480, 484) 'i'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [544, 548) 'count'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [608, 612) 'diff'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [672, 676) 'i'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [736, 740) 'diff'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [800, 804) 'i'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [864, 868) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [928, 932) 'i'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [992, 996) 'mix_i'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1056, 1060) 'last_deficit'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1120, 1124) 'diff'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1184, 1188) 'i'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1248, 1252) 'j'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1312, 1320) 'it'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1376, 1384) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1440, 1448) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1504, 1512) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1568, 1576) '__for_begin'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1632, 1640) '__for_end'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1696, 1712) 'sorted_deficits' <== Memory access at offset 1716 overflows this variable
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1760, 1784) 'deficit'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1824, 1856) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: [1888, 1920) '<unknown>'
Oct 11 10:08:02 ip-172-30-0-213 scylla: HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
Oct 11 10:08:02 ip-172-30-0-213 scylla: (longjmp and C++ exceptions *are* supported)
Oct 11 10:08:02 ip-172-30-0-213 scylla: Thread T1 (reactor-1) created by T0 here:
Oct 11 10:08:02 ip-172-30-0-213 scylla: #0 0x7f6cc51f609f in pthread_create (/opt/scylladb/lib64/libasan.so.4+0x3809f)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #1 0xb56594  (/usr/bin/scylla+0xb56594)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #2 0xb55f8a  (/usr/bin/scylla+0xb55f8a)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #3 0x8a8a03  (/usr/bin/scylla+0x8a8a03)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #4 0x839113  (/usr/bin/scylla+0x839113)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #5 0x83937b  (/usr/bin/scylla+0x83937b)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #6 0x7c3162  (/usr/bin/scylla+0x7c3162)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #7 0x49c058  (/usr/bin/scylla+0x49c058)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #8 0x4a165b  (/usr/bin/scylla+0x4a165b)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #9 0xc804cf  (/usr/bin/scylla+0xc804cf)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #10 0x1b168f8  (/usr/bin/scylla+0x1b168f8)
Oct 11 10:08:02 ip-172-30-0-213 scylla: #11 0x7f6cc0973444 in __libc_start_main (/lib64/libc.so.6+0x22444)
Oct 11 10:08:02 ip-172-30-0-213 scylla: SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/bin/scylla+0x7989999)
Oct 11 10:08:02 ip-172-30-0-213 scylla: Shadow bytes around the buggy address:
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee16932760: f2 f2 f2 f2 f2 f2 04 f2 f2 f2 f2 f2 f2 f2 04 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee16932770: f2 f2 f2 f2 f2 f2 04 f2 f2 f2 f2 f2 f2 f2 04 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee16932780: f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee16932790: f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee169327a0: f2 f2 f2 f2 f2 f2 f8 f2 f2 f2 f2 f2 f2 f2 f8 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: =>0x0fee169327b0: f2 f2 f2 f2 f2 f2 00 00[f2]f2 f2 f2 f2 f2 00 00
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee169327c0: 00 f2 f2 f2 f2 f2 f8 f8 f8 f8 f2 f2 f2 f2 f8 f8
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee169327d0: f8 f8 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee169327e0: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 01 f2 f2 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee169327f0: f2 f2 f2 f2 f8 f2 f2 f2 f2 f2 f2 f2 f8 f2 f2 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: 0x0fee16932800: f2 f2 f2 f2 00 00 00 f2 f2 f2 f2 f2 00 00 00 f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: Shadow byte legend (one shadow byte represents 8 application bytes):
Oct 11 10:08:02 ip-172-30-0-213 scylla: Addressable:           00
Oct 11 10:08:02 ip-172-30-0-213 scylla: Partially addressable: 01 02 03 04 05 06 07
Oct 11 10:08:02 ip-172-30-0-213 scylla: Heap left redzone:       fa
Oct 11 10:08:02 ip-172-30-0-213 scylla: Freed heap region:       fd
Oct 11 10:08:02 ip-172-30-0-213 scylla: Stack left redzone:      f1
Oct 11 10:08:02 ip-172-30-0-213 scylla: Stack mid redzone:       f2
Oct 11 10:08:02 ip-172-30-0-213 scylla: Stack right redzone:     f3
Oct 11 10:08:02 ip-172-30-0-213 scylla: Stack after return:      f5
Oct 11 10:08:02 ip-172-30-0-213 scylla: Stack use after scope:   f8
Oct 11 10:08:02 ip-172-30-0-213 scylla: Global redzone:          f9
Oct 11 10:08:02 ip-172-30-0-213 scylla: Global init order:       f6
Oct 11 10:08:02 ip-172-30-0-213 scylla: Poisoned by user:        f7
Oct 11 10:08:02 ip-172-30-0-213 scylla: Container overflow:      fc
Oct 11 10:08:02 ip-172-30-0-213 scylla: Array cookie:            ac
Oct 11 10:08:02 ip-172-30-0-213 scylla: Intra object redzone:    bb
Oct 11 10:08:02 ip-172-30-0-213 scylla: ASan internal:           fe
Oct 11 10:08:02 ip-172-30-0-213 scylla: Left alloca redzone:     ca
Oct 11 10:08:02 ip-172-30-0-213 scylla: Right alloca redzone:    cb
Oct 11 10:08:02 ip-172-30-0-213 scylla: ==3543==ABORTING

Decoding the first back trace (T1):

redistribute(std::vector<float, std::allocator<float> > const&, unsigned int, unsigned int) at /usr/src/debug/scylla-666.development/db/heat_load_balance.cc:359
std::vector<gms::inet_address, std::allocator<gms::inet_address> > miss_equalizing_combination<gms::inet_address>(std::vector<std::pair<gms::inet_address, float>, std::allocator<std::pair<gms::inet_address, float> > > const&, unsigned int, int, bool) at /usr/src/debug/scylla-666.development/db/heat_load_balance.hh:14
db::filter_for_query(db::consistency_level, keyspace&, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::read_repair_decision, gms::inet_address*, table*) at /usr/src/debug/scylla-666.development/db/consistency_level.cc:2
service::storage_proxy::get_read_executor(seastar::lw_shared_ptr<query::read_command>, seastar::lw_shared_ptr<schema const>, nonwrapping_range<dht::ring_position>, db::consistency_level, db::read_repair_decision, tracing::trace_state_ptr, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&) at /
service::storage_proxy::query_singular(seastar::lw_shared_ptr<query::read_command>, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > >&&, db::consistency_level, service::storage_proxy::coordinator_query_options) at /usr/src/debug/scylla-666.development/service/
service::storage_proxy::do_query(seastar::lw_shared_ptr<schema const>, seastar::lw_shared_ptr<query::read_command>, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > >&&, db::consistency_level, service::storage_proxy::coordinator_query_options) at /usr/src/debug
service::storage_proxy::query(seastar::lw_shared_ptr<schema const>, seastar::lw_shared_ptr<query::read_command>, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > >&&, db::consistency_level, service::storage_proxy::coordinator_query_options) at /usr/src/debug/sc
service::pager::query_pager::do_fetch_page(unsigned int, std::chrono::time_point<gc_clock, std::chrono::duration<int, std::ratio<1l, 1l> > >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at /usr/src/debug/scylla-666.development/service/pager/query_pagers.cc:226
service::pager::query_pager::fetch_page_generator(unsigned int, std::chrono::time_point<gc_clock, std::chrono::duration<int, std::ratio<1l, 1l> > >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, cql3::cql_stats&) at /usr/src/debug/scylla-666.development/service/
cql3::statements::select_statement::do_execute(service::storage_proxy&, service::query_state&, cql3::query_options const&) at /usr/src/debug/scylla-666.development/cql3/statements/select_statement.cc:343
seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > std::__invoke_impl<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement::* const&)(service::stora
std::__invoke_result<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement::* const&)(service::storage_proxy&, service::query_state&, cql3::query_options const&), cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::que
decltype (__invoke((*this)._M_pmf, (forward<cql3::statements::select_statement*>)({parm#1}), (forward<service::storage_proxy&>)({parm#1}), (forward<service::query_state&>)({parm#1}), (forward<cql3::query_options const&>)({parm#1}))) std::_Mem_fn_base<seastar::future<seastar::shared_ptr<cql_transport::messages::result
seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::direct_vtable_for<std::_Mem_fn<seastar::future<seastar::shared_ptr<cql_transport::messages::r
seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::operator()(cql3::statements::select_statement*, service::storage_proxy&, service::query_state
seastar::inheriting_concrete_execution_stage<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>::make_stage_for_group(seastar::scheduling_group)::{lambda(cql3::statements::sele
seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::direct_vtable_for<seastar::inheriting_concrete_execution_stage<seastar::future<seastar::share
seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::operator()(cql3::statements::select_statement*, service::storage_proxy&, service::query_state
seastar::apply_helper<seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, std::tuple<cql3::statements::select_statement*, service::storage_proxy
auto seastar::apply<seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, cql3::statements::select_statement*, service::storage_proxy&, service::q
seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > seastar::futurize<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > >::apply<seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::selec
seastar::concrete_execution_stage<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>::do_flush() at /usr/src/debug/scylla-666.development/seastar/core/execution_stage.hh:242
operator() at /usr/src/debug/scylla-666.development/seastar/core/execution_stage.cc:140
run_and_dispose at /usr/src/debug/scylla-666.development/seastar/core/task.hh:48
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:2697
seastar::reactor::run_some_tasks() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:3120 (discriminator 2)
seastar::reactor::run() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:3267
operator() at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:4336 (discriminator 3)
_M_invoke at /opt/scylladb/include/c++/7/bits/std_function.h:316

Decoding the second backtrace (where T1 was created by T0):

seastar::posix_thread::posix_thread(seastar::posix_thread::attr, std::function<void ()>) at /usr/src/debug/scylla-666.development/seastar/core/posix.cc:83
seastar::posix_thread::posix_thread(std::function<void ()>) at /usr/src/debug/scylla-666.development/seastar/core/posix.cc:57
void __gnu_cxx::new_allocator<seastar::posix_thread>::construct<seastar::posix_thread, std::function<void ()> >(seastar::posix_thread*, std::function<void ()>&&) at /opt/scylladb/include/c++/7/ext/new_allocator.h:136
void std::allocator_traits<std::allocator<seastar::posix_thread> >::construct<seastar::posix_thread, std::function<void ()> >(std::allocator<seastar::posix_thread>&, seastar::posix_thread*, std::function<void ()>&&) at /opt/scylladb/include/c++/7/bits/alloc_traits.h:475
void std::vector<seastar::posix_thread, std::allocator<seastar::posix_thread> >::_M_realloc_insert<std::function<void ()> >(__gnu_cxx::__normal_iterator<seastar::posix_thread*, std::vector<seastar::posix_thread, std::allocator<seastar::posix_thread> > >, std::function<void ()>&&) at /opt/scylladb/include/c++/7/bits/vector.tcc:415
seastar::posix_thread& std::vector<seastar::posix_thread, std::allocator<seastar::posix_thread> >::emplace_back<std::function<void ()> >(std::function<void ()>&&) at /opt/scylladb/include/c++/7/bits/vector.tcc:105
seastar::smp::create_thread(std::function<void ()>) at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:3992
seastar::smp::configure(boost::program_options::variables_map) at /usr/src/debug/scylla-666.development/seastar/core/reactor.cc:4312 (discriminator 6)
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /usr/src/debug/scylla-666.development/seastar/core/app-template.cc:163 (discriminator 1)

@nyh Could you, please, take a look at the code?
@slivne Is there a simple way to disable the heat_load_balance? I can't find anything obvious in the config.hh. I'd like to disable it and rerun to verify if it's related.

@nyh For instance, the line it complains about is (heat_load_balance.cc, line 359):

auto last_deficit = sorted_deficits.back().second;

And I don't see any check that would ensure that sorted_deficits is not empty before this call.
We need somebody who knows this code better to take a look.

I think I was able to reproduce this with a dtest

Scylla version (or git commit hash): 98332de26897b6062caa16e1753fa56611453e74

dtest materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test

 gdb ../scylla/build/release/scylla core.5205 
#0  std::_Hashtable<gms::inet_address, std::pair<gms::inet_address const, std::chrono::time_point<seastar::lowres_system_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > > >, std::allocator<std::pair<gms::inet_address const, std::chrono::time_point<seastar::lowres_system_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > > > >, std::__detail::_Select1st, std::equal_to<gms::inet_address>, std::hash<gms::inet_address>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find (
    __k=<synthetic pointer>..., this=0x2b8) at /home/shlomi/scylla/seastar/net/ip.hh:118
#1  std::unordered_map<gms::inet_address, std::chrono::time_point<seastar::lowres_system_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, std::chrono::time_point<seastar::lowres_system_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > > > > >::find (__x=<synthetic pointer>..., 
    this=0x2b8) at /usr/include/c++/8/bits/unordered_map.h:924
#2  gms::gossiper::get_endpoint_downtime (this=0x0, ep=...) at gms/gossiper.cc:816
#3  0x000000000297feaa in db::hints::manager::can_hint_for(gms::inet_address) const () at ./db/hints/manager.hh:598
#4  0x00000000029e00ef in db::hints::manager::store_hint(gms::inet_address, seastar::lw_shared_ptr<schema const>, seastar::lw_shared_ptr<frozen_mutation const>, tracing::trace_state_ptr) ()
    at db/hints/manager.cc:280
#5  0x00000000022c01aa in unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}::operator()(gms::inet_address) (target=..., this=<optimized out>) at service/storage_proxy.cc:138
#6  __gnu_cxx::__ops::_Iter_pred<unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>::operator()<__gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > > >(__gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >) (__it=..., this=0x7ffd57d7c690) at /usr/include/c++/8/bits/predefined_ops.h:283
#7  std::__count_if<__gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, __gnu_cxx::__ops::_Iter_pred<unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}> >(__gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, __gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, __gnu_cxx::__ops::_Iter_pred<unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>) (__pred=..., __last={_addr = {ip = {raw = 24576}}}, __first=
      {_addr = {ip = {raw = 2130706435}}}) at /usr/include/c++/8/bits/stl_algo.h:3194
#8  std::count_if<__gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>(__gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, __gnu_cxx::__normal_iterator<gms::inet_address const*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}) (__pred=..., __last=..., __first=...) at /usr/include/c++/8/bits/stl_algo.h:4105
#9  boost::range::count_if<std::vector<gms::inet_address, std::allocator<gms::inet_address> >, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}>(std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, unsigned long service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > >(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >&, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, db::write_type, tracing::trace_state_ptr)::{lambda(gms::inet_address)#1}) (pred=..., rng=std::vector of length 1, capacity 1 = {...}) at /usr/include/boost/range/algorithm/count_if.hpp:44
#10 service::storage_proxy::hint_to_dead_endpoints<std::vector<gms::inet_address, std::allocator<gms::inet_address> > > (tr_state=..., type=<optimized out>, targets=std::vector of length 1, capacity 1 = {...}, 
    mh=std::unique_ptr<service::mutation_holder> = {...}, this=<optimized out>) at service/storage_proxy.cc:1884
#11 service::storage_proxy::hint_to_dead_endpoints(unsigned long, db::consistency_level) () at service/storage_proxy.cc:1306
#12 0x000000000234ac23 in service::storage_proxy::<lambda(service::storage_proxy::unique_response_handler&)>::operator() (protected_response=..., __closure=0x7ffd57d7ca80) at service/storage_proxy.cc:1342
#13 seastar::futurize<seastar::future<> >::apply<service::storage_proxy::mutate_begin(std::vector<service::storage_proxy::unique_response_handler>, db::consistency_level, std::experimental::fundamentals_v1::optional<std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > > >)::<lambda(service::storage_proxy::unique_response_handler&)>, service::storage_proxy::unique_response_handler&> (func=...) at ./seastar/core/future.hh:1399
#14 seastar::futurize_apply<service::storage_proxy::mutate_begin(std::vector<service::storage_proxy::unique_response_handler>, db::consistency_level, std::experimental::fundamentals_v1::optional<std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > > >)::<lambda(service::storage_proxy::unique_response_handler&)>, service::storage_proxy::unique_response_handler&> (
    func=...) at ./seastar/core/future.hh:1471
#15 seastar::future<> seastar::parallel_for_each<__gnu_cxx::__normal_iterator<service::storage_proxy::unique_response_handler*, std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > >, service::storage_proxy::mutate_begin(std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> >, db::consistency_level, std::experimental::fundamentals_v1::optional<std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > > >)::{lambda(service::storage_proxy::unique_response_handler&)#1}>(__gnu_cxx::__normal_iterator<service::storage_proxy::unique_response_handler*, std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > >, seastar::future<>, service::storage_proxy::mutate_begin(std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> >, db::consistency_level, std::experimental::fundamentals_v1::optional<std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > > >)::{lambda(service::storage_proxy::unique_response_handler&)#1}&&) () at ./seastar/core/future-util.hh:129
#16 0x0000000002398eaf in seastar::parallel_for_each<std::vector<service::storage_proxy::unique_response_handler>&, service::storage_proxy::mutate_begin(std::vector<service::storage_proxy::unique_response_handler>, db::consistency_level, std::experimental::fundamentals_v1::optional<std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > > >)::<lambda(service::storage_proxy::unique_response_handler&)> > (func=..., range=...) at service/storage_proxy.cc:1704
#17 service::storage_proxy::mutate_begin (timeout_opt=..., cl=db::consistency_level::ANY, ids=..., this=0x60000104d400) at service/storage_proxy.cc:1349
#18 service::storage_proxy::<lambda(std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> >)>::operator() (__closure=<optimized out>, 
    __closure=<optimized out>, ids=...) at service/storage_proxy.cc:1704
#19 seastar::apply_helper<service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder>, gms::inet_address, std::vector<gms::inet_address>, db::write_type, service::storage_proxy::write_stats&)::<lambda(std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> >)>, std::tuple<std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > >&&, std::integer_sequence<long unsigned int, 0> >::apply (args=..., func=...) at ./seastar/core/apply.hh:35
#20 seastar::apply<service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder>, gms::inet_address, std::vector<gms::inet_address>, db::write_type, service::storage_proxy::write_stats&)::<lambda(std::vector<service::storage_proxy::unique_response_handler>)>, std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > > (args=..., 
    func=...) at ./seastar/core/apply.hh:43
#21 seastar::futurize<seastar::future<> >::apply<service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder>, gms::inet_address, std::vector<gms::inet_address>, db::write_type, service::storage_proxy::write_stats&)::<lambda(std::vector<service::storage_proxy::unique_response_handler>)>, std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > > (args=..., func=...) at ./seastar/core/future.hh:1389
#22 seastar::future<std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > >::then<service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder>, gms::inet_address, std::vector<gms::inet_address>, db::write_type, service::storage_proxy::write_stats&)::<lambda(std::vector<service::storage_proxy::unique_response_handler>)> > (
    func=..., this=<optimized out>) at ./seastar/core/future.hh:952
#23 service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >, gms::inet_address, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, db::write_type, service::storage_proxy_stats::write_stats&) () at service/storage_proxy.cc:1703
#24 0x0000000002399e94 in service::storage_proxy::send_to_endpoint(mutation, gms::inet_address, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, db::write_type, service::storage_proxy_stats::write_stats&) () at /usr/include/c++/8/new:169
#25 0x0000000002aecc51 in db::view::mutate_MV(dht::token const&, std::vector<mutation, std::allocator<mutation> >, db::view::stats&) () at ./seastar/core/reactor.hh:1203
#26 0x000000000117fa7c in table::<lambda(auto:216&&)>::<lambda(auto:217)>::operator()<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> > (units=..., 
    this=<optimized out>) at ./seastar/core/future.hh:1341
#27 seastar::apply_helper<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)>, std::tuple<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >&&, std::integer_sequence<long unsigned int, 0> >::apply (args=..., func=...) at ./seastar/core/apply.hh:35
#28 seastar::apply<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)>, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> > (args=..., func=...) at ./seastar/core/apply.hh:43
#29 seastar::future<> seastar::do_void_futurize_helper<void>::apply_tuple<table::generate_and_propagate_view_updates(seastar::lw_shared_ptr<schema const> const&, std::vector<view_ptr, std::allocator<view_ptr> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader>, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) const::auto {lambda(auto:1&&)#1}::operator()<std::vector<mutation, std::allocator<mutation> > >(std::vector<mutation, std::allocator<mutation> >&&)::{lambda(auto:1)#1}, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >(std::vector<mutation, std::allocator<mutation> >&&, std::tuple<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >&&) () at ./seastar/core/future.hh:1343
#30 0x0000000001237efb in seastar::futurize<void>::apply<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)>, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> > (args=..., 
    func=...) at ./seastar/core/future.hh:346
#31 seastar::future<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >::then<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)> > (func=..., this=0x7ffd57d7cf90)
    at ./seastar/core/future.hh:952
#32 table::<lambda(auto:216&&)>::operator()<std::vector<mutation> > (updates=..., __closure=0x7ffd57d7cfe8) at database.cc:4462
#33 seastar::apply_helper<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)>, std::tuple<std::vector<mutation, std::allocator<mutation> > >&&, std::integer_sequence<long unsigned int, 0> >::apply (args=..., func=...) at ./seastar/core/apply.hh:35
#34 seastar::apply<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)>, std::vector<mutation, std::allocator<mutation> > > (args=..., func=...) at ./seastar/core/apply.hh:43
#35 seastar::future<> seastar::futurize<seastar::future<> >::apply<table::generate_and_propagate_view_updates(seastar::lw_shared_ptr<schema const> const&, std::vector<view_ptr, std::allocator<view_ptr> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader>, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) const::{lambda(auto:1&&)#1}, std::vector<mutation, std::allocator<mutation> > >(table::generate_and_propagate_view_updates(seastar::lw_shared_ptr<schema const> const&, std::vector<view_ptr, std::allocator<view_ptr> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader>, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) const::{lambda(auto:1&&)#1}&&, std::tuple<std::vector<mutation, std::allocator<mutationorage_proxy::write_stats&)::<lambda(std::vector<service::storage_proxy::unique_response_handler>)>, std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > > (args=..., func=...) at ./seastar/core/future.hh:1389
#22 seastar::future<std::vector<service::storage_proxy::unique_response_handler, std::allocator<service::storage_proxy::unique_response_handler> > >::then<service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder>, gms::inet_address, std::vector<gms::inet_address>, db::write_type, service::storage_proxy::write_stats&)::<lambda(std::vector<service::storage_proxy::unique_response_handler>)> > (
    func=..., this=<optimized out>) at ./seastar/core/future.hh:952
#23 service::storage_proxy::send_to_endpoint(std::unique_ptr<service::mutation_holder, std::default_delete<service::mutation_holder> >, gms::inet_address, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, db::write_type, service::storage_proxy_stats::write_stats&) () at service/storage_proxy.cc:1703
#24 0x0000000002399e94 in service::storage_proxy::send_to_endpoint(mutation, gms::inet_address, std::vector<gms::inet_address, std::allocator<gms::inet_address> >, db::write_type, service::storage_proxy_stats::write_stats&) () at /usr/include/c++/8/new:169
#25 0x0000000002aecc51 in db::view::mutate_MV(dht::token const&, std::vector<mutation, std::allocator<mutation> >, db::view::stats&) () at ./seastar/core/reactor.hh:1203
#26 0x000000000117fa7c in table::<lambda(auto:216&&)>::<lambda(auto:217)>::operator()<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> > (units=..., 
    this=<optimized out>) at ./seastar/core/future.hh:1341
#27 seastar::apply_helper<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)>, std::tuple<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >&&, std::integer_sequence<long unsigned int, 0> >::apply (args=..., func=...) at ./seastar/core/apply.hh:35
#28 seastar::apply<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)>, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> > (args=..., func=...) at ./seastar/core/apply.hh:43
#29 seastar::future<> seastar::do_void_futurize_helper<void>::apply_tuple<table::generate_and_propagate_view_updates(seastar::lw_shared_ptr<schema const> const&, std::vector<view_ptr, std::allocator<view_ptr> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader>, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) const::auto {lambda(auto:1&&)#1}::operator()<std::vector<mutation, std::allocator<mutation> > >(std::vector<mutation, std::allocator<mutation> >&&)::{lambda(auto:1)#1}, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >(std::vector<mutation, std::allocator<mutation> >&&, std::tuple<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >&&) () at ./seastar/core/future.hh:1343
#30 0x0000000001237efb in seastar::futurize<void>::apply<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)>, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> > (args=..., 
    func=...) at ./seastar/core/future.hh:346
#31 seastar::future<seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock> >::then<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)> mutable [with auto:216 = std::vector<mutation>]::<lambda(auto:217)> > (func=..., this=0x7ffd57d7cf90)
    at ./seastar/core/future.hh:952
#32 table::<lambda(auto:216&&)>::operator()<std::vector<mutation> > (updates=..., __closure=0x7ffd57d7cfe8) at database.cc:4462
#33 seastar::apply_helper<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)>, std::tuple<std::vector<mutation, std::allocator<mutation> > >&&, std::integer_sequence<long unsigned int, 0> >::apply (args=..., func=...) at ./seastar/core/apply.hh:35
#34 seastar::apply<table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)>, std::vector<mutation, std::allocator<mutation> > > (args=..., func=...) at ./seastar/core/apply.hh:43
#35 seastar::future<> seastar::futurize<seastar::future<> >::apply<table::generate_and_propagate_view_updates(seastar::lw_shared_ptr<schema const> const&, std::vector<view_ptr, std::allocator<view_ptr> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader>, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) const::{lambda(auto:1&&)#1}, std::vector<mutation, std::allocator<mutation> > >(table::generate_and_propagate_view_updates(seastar::lw_shared_ptr<schema const> const&, std::vector<view_ptr, std::allocator<view_ptr> >&&, mutation&&, seastar::optimized_optional<flat_mutation_reader>, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) const::{lambda(auto:1&&)#1}&&, std::tuple<std::vector<mutation, std::allocator<mutation---Type <return> to continue, or q <return> to quit---
> > >&&) () at ./seastar/core/future.hh:1389
#36 0x000000000123c128 in seastar::future<std::vector<mutation, std::allocator<mutation> > >::<lambda(auto:2&&)>::operator()<seastar::future_state<std::vector<mutation, std::allocator<mutation> > > > (
    state=..., this=0x6000004048b0) at ./seastar/core/future.hh:413
#37 seastar::continuation<seastar::future<T>::then(Func&&) [with Func = table::generate_and_propagate_view_updates(const schema_ptr&, std::vector<view_ptr>&&, mutation&&, flat_mutation_reader_opt, seastar::lowres_clock::time_point) const::<lambda(auto:216&&)>; Result = seastar::future<>; T = {std::vector<mutation, std::allocator<mutation> >}]::<lambda(auto:2&&)>, std::vector<mutation, std::allocator<mutation> > >::run_and_dispose(void) (this=0x600000404880) at ./seastar/core/future.hh:414
#38 0x000000000097fe48 in seastar::reactor::run_tasks (this=this@entry=0x600000020000, tq=...) at core/reactor.cc:2699
#39 0x000000000098002f in seastar::reactor::run_some_tasks (this=this@entry=0x600000020000) at core/reactor.cc:3122
#40 0x0000000000a3b03e in seastar::reactor::run_some_tasks (this=0x600000020000) at core/reactor.cc:3269
#41 seastar::reactor::run() () at core/reactor.cc:3269
#42 0x0000000000b504c8 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) () at /home/shlomi/scylla/seastar/core/reactor.hh:1203
#43 0x00000000009135ca in main () at /usr/include/c++/8/bits/std_function.h:87
#44 0x00007feea69ca24b in __libc_start_main () from /lib64/libc.so.6
#45 0x0000000000979e7a in _start () at core/reactor.cc:4938

@slivne @duarten I was able to reproduce the issue (on Shlomi's machine) and I have a fix (hinted_handoff_dont_create_hints_until_started-1@https://github.com/vladzcloudius/scylla.git).
The root cause for the failure was that there was no protection in the hints::manager::store_hint() against a call before hints::manager::start() is complete. This is important because a _gosiper_anchor is dereferenced in the hints::manager::store_hint() and the _gossiper_anchor is set in the hints::manager::start().
So, regardless whether this kind of call is correct or not hints::manager::store_hint() should be prepared and be ready to drop hints in such a case.

So, patches on the branch above that add such protection fix the SIGSEG issue indeed, however the test itself starts failing after that.

I see a lot of these in the logs of node2:

ERROR 2018-10-15 23:53:18,169 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-15 23:53:18,169 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-15 23:53:18,169 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-15 23:53:18,169 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)

Maybe the original reason why hints were attempted to be stored (during node2.start(...)` at the materialized_views_test.py:line 3243) in the context of MV is bogus.
@duarte, could you take a look, please?

@vladzcloudius The test looks good. We wait for the hints::resource_manager::start in main(), no? Maybe this is a bug that my patches fix (regarding the not-registration of a hints::manager), which haven't been merged yet. Can you test with my tree applied? https://github.com/duarten/scylla hh-manager-backlog/v2

@slivne @duarten

Apparently MV is sending some updates before hints::manager::start() is done.

With this patch applied to Duarte's branch rebased on top of my branch:

diff --git a/db/view/view.cc b/db/view/view.cc
index 5e1c0b61e..8873c7d2d 100644
--- a/db/view/view.cc
+++ b/db/view/view.cc
@@ -934,6 +934,11 @@ future<> mutate_MV(const dht::token& base_token, std::vector<mutation> mutations
         auto keyspace_name = mut.schema()->ks_name();
         auto paired_endpoint = get_view_natural_endpoint(keyspace_name, base_token, view_token);
         auto pending_endpoints = service::get_local_storage_service().get_token_metadata().pending_endpoints_for(view_token, keyspace_name);
+        if (paired_endpoint) {
+            vlogger.info("Going to apply mutation for token {} and endpoint {}", view_token, *paired_endpoint);
+        } else {
+            vlogger.info("Going to apply mutation for token {} and endpoint {}", view_token, "null endpoint");
+        }
         if (paired_endpoint) {
             // When paired endpoint is the local node, we can just apply
             // the mutation locally, unless there are pending endpoints, in
diff --git a/main.cc b/main.cc
index 0c416d0ea..91b979a98 100644
--- a/main.cc
+++ b/main.cc
@@ -740,6 +740,7 @@ int main(int ac, char** av) {
             api::set_server_gossip_settle(ctx).get();

             supervisor::notify("starting hinted handoff manager");
+            printf("starting hinted handoff manager\n");
             if (hinted_handoff_enabled) {
                 db::hints::manager::rebalance(cfg->hints_directory()).get();
             }
@@ -752,6 +753,7 @@ int main(int ac, char** av) {
             static sharded<db::view::view_builder> view_builder;
             if (cfg->view_building()) {
                 supervisor::notify("starting the view builder");
+                printf("starting the view builder\n");
                 view_builder.start(std::ref(db), std::ref(sys_dist_ks), std::ref(mm)).get();
                 view_builder.invoke_on_all(&db::view::view_builder::start).get();
             }

this is what we get in the logs:

$ egrep "view - |starting" ~/.dtest/dtest-2B27Kd/test/node2/logs/system.log 
<-- snip --->
DEBUG 2018-10-16 18:03:48,369 [shard 0] view - Sending view update to endpoint 127.0.0.1, with pending endpoints = {}
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token -8710206672207430939 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token 5315054604842538780 and endpoint 127.0.0.3
DEBUG 2018-10-16 18:03:48,369 [shard 0] view - Sending view update to endpoint 127.0.0.3, with pending endpoints = {}
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token 7837921608087054476 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token -945925827106861732 and endpoint 127.0.0.3
DEBUG 2018-10-16 18:03:48,369 [shard 0] view - Sending view update to endpoint 127.0.0.3, with pending endpoints = {}
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token 6090196138962193393 and endpoint 127.0.0.3
DEBUG 2018-10-16 18:03:48,369 [shard 0] view - Sending view update to endpoint 127.0.0.3, with pending endpoints = {}
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token -1465220724065171608 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token -8464455714536249460 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token 2073066339800131626 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token -6686854263819514396 and endpoint 127.0.0.1
DEBUG 2018-10-16 18:03:48,369 [shard 0] view - Sending view update to endpoint 127.0.0.1, with pending endpoints = {}
INFO  2018-10-16 18:03:48,369 [shard 0] view - Going to apply mutation for token -3426264072449612391 and endpoint 127.0.0.3
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.3, with pending endpoints = {}
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token 6369203241176819985 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token -8991797711124883163 and endpoint 127.0.0.1
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.1, with pending endpoints = {}
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token 7509725172173062567 and endpoint 127.0.0.1
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.1, with pending endpoints = {}
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token -8096226462579055331 and endpoint 127.0.0.1
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.1, with pending endpoints = {}
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token 5424744040590144439 and endpoint 127.0.0.3
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.3, with pending endpoints = {}
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token 6796959405005444069 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token -2131870702971550566 and endpoint 127.0.0.3
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.3, with pending endpoints = {}
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token 4876065684923426064 and endpoint 127.0.0.2
INFO  2018-10-16 18:03:48,370 [shard 0] view - Going to apply mutation for token 4004946518944965628 and endpoint 127.0.0.1
DEBUG 2018-10-16 18:03:48,370 [shard 0] view - Sending view update to endpoint 127.0.0.1, with pending endpoints = {}
starting hinted handoff manager
starting the view builder
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,353 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
ERROR 2018-10-16 18:03:58,363 [shard 0] view - Error applying view update to 127.0.0.3: exceptions::mutation_write_timeout_exception (Operation timed out for ks.users_by_state - received only 0 responses from 1 CL=ANY.)
<-- snip --->

@slivne I'll send my patches but unfortunately they are not going to fix the issue.

We setup the hints manager after joining the cluster and starting the messaging service. This means a base replica can generate view updates to a view replica that is unavailable, and also fail to store a hint.

The cleanest thing we can do is to start HH before initing the messaging service, so we are able to store hints, but defer sending out hints to after the messaging service is inited, so we can receive replies. @vladzcloudius suggested this so he gets to write it! :)

On Thu, Oct 11, 2018 at 5:51 PM vladzcloudius notifications@github.com
wrote:

@nyh https://github.com/nyh For instance, the line it complains about
is (heat_load_balance.cc, line 359):

auto last_deficit = sorted_deficits.back().second;

And I don't see any check that would ensure that sorted_deficits is not
empty before this call.
We need somebody who knows this code better to take a look.

If I remember correctly, we know at this point that it cannot be empty. But
I have to admit, I don't remember why. Looking at the rest of this issue,
it seems you no longer think it's related to head_load_balance, but if you
still suspect this, let me know, and I'll review the whole code again and
try to remember why at this point in the code we can assume what we did.

@duarten / @vladzcloudius whats the status ?

@vladzcloudius did you send your patches ? if so can you provide the patch name

@duarten - what is missing is it in MV/SI or in HH

@slivne I think it's in HH. We should have HH working when we start to receive mutations, so that if we generate view updates but fail to deliver them, we can store them as hints and not lose them.

@nyh We know about another issue. We do not know for sure that's the only issue. Verifying that the code in question is ok wouldn't hurt. If it is - you can also put a tiny comment there so that we don't need to bother you the next time. ;)

@slivne It's work in progress. Patches are not ready yet.

Closing this as I think Vlad's patches fixed the issue. If it happens again, we can reopen.

Was this page helpful?
0 / 5 - 0 ratings