Installation details
Scylla version (or git commit hash): 3.2.2-0.20200222.0b23e7145d0
Cluster size: 3 nodes
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 18.04
My cluster of 3 nodes is misbehaving since an upgrade from 2.3 to 3.0 to 3.1 to 3.2 I've performed over the week-end.
It's segfaulting every 15min to 1 hour.
Attached a logfile
I've also submitted a coredump from one of my nodes, the report_uuid is 541efd51-0c6f-4744-bc0a-ed42c130c5c9
It stopped happening from 6pm to 00am, but happened last night anyway
Here's a table to understand IPs in the logs
| name | IP |
|-|-|
| scylla-1 | 192.168.0.1 |
| scylla-2 | 192.168.0.2 |
| scylla-3 | 192.168.0.3 |
No prior logs
Mar 03 00:06:25 scylla-02 scylla[29120]: [shard 0] gossip - InetAddress 192.168.0.3 is now DOWN, status = NORMAL
Mar 03 00:06:42 scylla-02 scylla[29120]: [shard 5] rpc - client 192.168.0.1:7000: client connection dropped: read: Connection reset by peer
Mar 03 00:06:42 scylla-02 scylla[29120]: [shard 1] rpc - client 192.168.0.1:7000: client connection dropped: read: Connection reset by peer
Mar 03 00:06:42 scylla-02 scylla[29120]: [shard 5] rpc - client 192.168.0.1:7000: client connection dropped: read: Connection reset by peer
Mar 03 00:06:42 scylla-02 scylla[29120]: [shard 0] rpc - client 192.168.0.1:63168: server connection dropped: read: Connection reset by peer
Mar 03 00:06:42 scylla-02 scylla[29120]: [shard 0] rpc - client 192.168.0.1:7000: fail to connect: Connection reset by peer
Mar 03 00:06:42 scylla-02 scylla[29120]: Segmentation fault on shard 1.
@gleb-cloudius Looks like there's a crash in storage_proxy:
#0 seastar::net::inet_address::operator== (this=0x0, o=...) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/net/inet_address.cc:117
#1 0x0000000001498b3c in std::rel_ops::operator!=<seastar::net::inet_address> (__y=..., __x=...) at /usr/include/c++/9/bits/stl_relops.h:87
#2 gms::operator!= (y=..., x=...) at ./gms/inet_address.hh:84
#3 service::storage_proxy::get_read_executor(seastar::lw_shared_ptr<query::read_command>, seastar::lw_shared_ptr<schema const>, nonwrapping_range<dht::ring_position>, db::consistency_level, db::read_repair_decision, tracing::trace_state_ptr, std::vector<gms::inet_address, std::allocator<gms::inet_address> > const&, bool&, service_permit) () at service/storage_proxy.cc:3397
#4 0x00000000014998f3 in service::storage_proxy::query_singular(seastar::lw_shared_ptr<query::read_command>, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > >&&, db::consistency_level, service::storage_proxy::coordinator_query_options) ()
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/shared_ptr.hh:289
#5 0x00000000014e9ea2 in service::storage_proxy::do_query(seastar::lw_shared_ptr<schema const>, seastar::lw_shared_ptr<query::read_command>, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > >&&, db::consistency_level, service::storage_proxy::coordinator_query_options) () at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/shared_ptr.hh:289
#6 0x00000000014ec7b6 in service::storage_proxy::query(seastar::lw_shared_ptr<schema const>, seastar::lw_shared_ptr<query::read_command>, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > >&&, db::consistency_level, service::storage_proxy::coordinator_query_options) () at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/shared_ptr.hh:289
#7 0x000000000206bb72 in service::pager::query_pager::do_fetch_page(unsigned int, std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) ()
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/shared_ptr.hh:289
#8 0x0000000002070a8d in service::pager::query_pager::fetch_page_generator(unsigned int, std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, cql3::cql_stats&) ()
at service/pager/query_pagers.cc:217
#9 0x00000000011c34ad in cql3::statements::select_statement::do_execute(service::storage_proxy&, service::query_state&, cql3::query_options const&) ()
at cql3/statements/select_statement.cc:370
#10 0x00000000011d777c in std::__invoke_impl<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement::* const&)(service::storage_proxy&, service::query_state&, cql3::query_options const&), cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&> (
__f=<optimized out>, __f=<optimized out>, __t=<synthetic pointer>) at /usr/include/c++/9/bits/invoke.h:89
#11 std::__invoke<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement::* const&)(service::storage_proxy&, service::query_state&, cql3::query_options const&), cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:96
#12 std::_Mem_fn_base<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement::*)(service::storage_proxy&, service::query_state&, cql3::query_options const&), true>::operator()<cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&> (this=<optimized out>) at /usr/include/c++/9/functional:114
#13 seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::direct_vtable_for<std::_Mem_fn<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement::*)(service::storage_proxy&, service::query_state&, cql3::query_options const&)> >::call(seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)> const*, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&) (func=<optimized out>, args#0=<optimized out>, args#1=..., args#2=..., args#3=...)
---Type <return> to continue, or q <return> to quit---
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/util/noncopyable_function.hh:99
#14 0x00000000011d779e in seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::operator()(cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&) const (args#3=..., args#2=..., args#1=..., args#0=<optimized out>,
this=<optimized out>) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/util/noncopyable_function.hh:181
#15 seastar::inheriting_concrete_execution_stage<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>::make_stage_for_group(seastar::scheduling_group)::{lambda(cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)#1}::operator()(cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&) const (this=<optimized out>, args#3=..., args#2=..., args#1=...,
args#0=<optimized out>) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/execution_stage.hh:329
#16 seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::direct_vtable_for<seastar::inheriting_concrete_execution_stage<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>::make_stage_for_group(seastar::scheduling_group)::{lambda(cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)#1}>::call(seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)> const*, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&) (func=<optimized out>, args#0=<optimized out>, args#1=...,
args#2=..., args#3=...) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/util/noncopyable_function.hh:99
#17 0x00000000011d9149 in seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>::operator()(cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&) const (args#3=..., args#2=..., args#1=..., args#0=0x60100d5d0710,
this=0x6010058673c0) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/util/noncopyable_function.hh:181
#18 seastar::apply_helper<seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, std::tuple<cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>&&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul> >::apply(seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, std::tuple<cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>&&) (args=..., func=...) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/apply.hh:36
#19 seastar::apply<seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>(seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, std::tuple<cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>&&) (args=..., func=...)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/apply.hh:44
#20 seastar::futurize<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > >::apply<seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>(seastar::noncopyable_function<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> > (cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&)>&, std::tuple<cql3::statements::select_statement*, service::storage_proxy&, service::query_st---Type <return> to continue, or q <return> to quit---
ate&, cql3::query_options const&>&&) (args=..., func=...) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:1537
#21 seastar::concrete_execution_stage<seastar::future<seastar::shared_ptr<cql_transport::messages::result_message> >, cql3::statements::select_statement*, service::storage_proxy&, service::query_state&, cql3::query_options const&>::do_flush (this=0x601005867348)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/execution_stage.hh:247
#22 0x0000000002a02e7d in seastar::execution_stage::<lambda()>::operator() (__closure=0x60100dbe4178)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/core/execution_stage.cc:140
#23 seastar::lambda_task<seastar::execution_stage::flush()::<lambda()> >::run_and_dispose(void) (this=0x60100dbe4168)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/task.hh:48
#24 0x0000000002a52f62 in seastar::reactor::run_tasks (this=this@entry=0x601000020000, tq=...)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:2109
#25 0x0000000002a53170 in seastar::reactor::run_some_tasks (this=this@entry=0x601000020000)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:2533
#26 0x0000000002abd0c6 in seastar::reactor::run_some_tasks (this=0x601000020000)
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:2674
#27 seastar::reactor::run() () at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/core/reactor.cc:2674
#28 0x0000000002ad185d in seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::{lambda()#3}::operator()() const ()
at /jenkins/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/reactor.hh:740
#29 0x0000000002a35bbe in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/9/bits/std_function.h:685
#30 seastar::posix_thread::start_routine (arg=<optimized out>) at /jenkins/workspace/scylla-3.2/build/scylla/seastar/src/core/posix.cc:52
#31 0x00007fe95f9925a2 in start_thread () from /opt/scylladb/libreloc/libpthread.so.0
#32 0x00007fe95f07c303 in clone () from /opt/scylladb/libreloc/libc.so.6
What is your RF and what king of driver are you using (token aware or not)?
It may be 56f3bda4c7aedfb852ca96afa998093f6e8bd0fc to blame since it added access to first element of all_replicas without checking that there is an element at all.
RF for all of my keyspace is replication = {'class': 'NetworkTopologyStrategy', 'fra1': '3'}
The driver used is https://rubygems.org/gems/cassandra-driver (https://github.com/datastax/ruby-driver)
I'm checking if the default setup involves token awareness
EDIT: it seems that token awareness is default behavior https://github.com/datastax/ruby-driver/blob/0a0be648115964d3cc9a45bf602286d8e5025849/lib/cassandra/execution/profile.rb#L58
Well, for RF 3 and 3 node cluster it is hard to see how all_replicas can
be empty.
--
Gleb.
should I do some tests or modifications on my side ? like disabling a feature or something ?
I'll try to see what I can find in the core.
--
Gleb.
While reading the commit message of 56f3bda4c7aedfb852ca96afa998093f6e8bd0fc I understand it involves query using IN operand
I'm using a lot of queries like this and we stumbled upon the clustering key cartesian product size 600 is greater than maximum 100 error
I reduced the number of values in the IN list and increased the number of queries
If it helps somehow
On Tue, Mar 03, 2020 at 04:58:20AM -0800, Solvik wrote:
If it helps somehow
I do not think so.
Is teeanalytics_staging keyspace is configured the way you described
above as well? Can you provide "describe keyspace teeanalytics_staging"?
--
Gleb.
CREATE KEYSPACE teeanalytics_staging WITH replication = {'class': 'NetworkTopologyStrategy', 'fra-1': '3'} AND durable_writes = true;
What kind of snitch are you using and can you send its configuration?
--
Gleb.
It looks like you are using property file snitch and you have fra1 DC
configured by for this keyspace you are using fra-1 and that is why you
are hitting the bug.
--
Gleb.
I've altered the keyspace to fix this typo, nice find !
@kostja we need to fix 56f3bda4c7aedfb852ca96afa998093f6e8bd0fc regardless to not access a non existing element. It may happen with correct configuration as well.
@gleb-cloudius / @kostja please assign to whoever is sending a fix.
Backported to 3.3 (newer versions have the fix).
Most helpful comment
It looks like you are using property file snitch and you have fra1 DC
configured by for this keyspace you are using fra-1 and that is why you
are hitting the bug.
--
Gleb.