Installation details
Scylla version (or git commit hash): 4.6.dev-0.20210406.5109bf8b9 with build-id a95fe5cafd391ffe03125747ffba753244dd6d06
Cluster size: 1
OS (RHEL/CentOS/Ubuntu/AWS AMI): dtest
Latest version from master (s3://downloads.scylladb.com/unstable/scylla/master/relocatable/2021-04-06T22:36:26Z/) seems to failing to boot with follow error when running any file from the thrift_test.py:
scylla: ./seastar/include/seastar/core/sharded.hh:523: seastar::sharded<service::memory_limiter>::~sharded() [Service = service::memory_limiter]: Assertion `_instances.empty()' failed.
Aborting on shard 0.
Backtrace:
0x3fc7988
0x3ff9ad2
0x7f234bb251df
/home/fruch/.ccm/scylla-repository/master_2021-04-06T22_36_26Z/scylla/libreloc/libc.so.6+0x3d9d4
/home/fruch/.ccm/scylla-repository/master_2021-04-06T22_36_26Z/scylla/libreloc/libc.so.6+0x268a3
/home/fruch/.ccm/scylla-repository/master_2021-04-06T22_36_26Z/scylla/libreloc/libc.so.6+0x26788
/home/fruch/.ccm/scylla-repository/master_2021-04-06T22_36_26Z/scylla/libreloc/libc.so.6+0x36025
0xe97ebb
/home/fruch/.ccm/scylla-repository/master_2021-04-06T22_36_26Z/scylla/libreloc/libc.so.6+0x281e1
0xe946ed
sounds like related to:
https://github.com/scylladb/scylla/commit/115324f71ab4263e127cd396e2c69f317fcc60e5
@psarna @nyh, can you take a look at it ?
let me see
I just checked locally multiple times and didn't manage to reproduce once - I ran thrift_tests.py file, with an s at the end, but I assume it's the one?
Could you provide the full log from a run? Maybe this error is just a symptom of something bad that happened before. Thrift uses the memory limiter semaphore in exactly the same way that CQL does, so I don't see any indication why it should cause such a failure. I suppose it might happen if the semaphore is broken, which causes memory_limiter::stop() to fail, but since I'm not able to reproduce this failure, I cannot be sure.
@psarna here is the log and the scylla.yaml
https://gist.github.com/fruch/775404707337c27bc06c9cac12bac407
@psarna I think I'm barking the wrong tree here... I'm trying out older version (4.6.dev-0.20210405.a9835ec128 with build-id bf30ff0eb01ebf52613317b2fc08ddce0d044eb3), still fails the same.
I'll continue "bisecting" it...
The whole service_memory_limiter thing is fairly new, added by @xemul in commit 359e9caf54df2a450bfbad8e753e7574735603d0 in January. The call to service_memory_limiter.stop().get() on shutdown is commented out. Doesn't this mean that we don't shut down the service, so this assertion should fail if main() ever exitts (this service_memory_limiter is a local variable in main())?
I'm puzzled how this ever worked, and why this crash is specific to the thrift test. The only thing I can think of is that usually, Scylla never exits (it is usually killed), certainly not during boot. If it crashes with this error during boot, maybe there is some boot error (did you check the logs for additional errors before this crash?) and when Scylla tries to clean up and exit after the error, it crashes in this manner? Maybe this boot failure is somehow specific to the Thrift test?
main() calls exit(0) before the service_memory_limiter destroys (@nyh )
It looks like this
int main() {
sharded<service::memory_limiter> ml;
return app.run([&] {
return async([&] {
try {
...
} catch (...) {
_exit(1);
return 1;
}
_exit(0);
return 0;
});
});
}
@xemul the code path that does _exit(1) prints "Startup failed" to the log, and the one that does _exit(0) prints "shutdown complete". None of these are in the log (https://gist.github.com/fruch/775404707337c27bc06c9cac12bac407). The log is suspiciously short, and indicates we are destroying a sharded<service::memory_limiter> in the beginning of the initialization.
The last message in the log is INFO 2021-04-07 10:32:40,472 [shard 0] database - Keyspace system_schema: Reading CF keyspaces id=abac5682-dea6-31c5-b535-b3d6cffd0fb6 version=e79ca8ba-6556-3f7d-925a-7f20cf57938c and I suspect we are failing in distributed_loader::init_system_keyspace(db).get(). However, I can't explain why we don't reach that _exit(1) case which you found... So I'm out of ideas.
@fruch can you please try to addr2line one of these failure reports, maybe we'll figure out where the destructor is called from, and maybe we'll find something surprising? Because we believe it should not have been called, and the question is who called it.
Seen a similar issue in 4.3 on startup:
https://jenkins.scylladb.com/view/scylla-4.3/job/scylla-4.3/job/dtest-release/66/artifact/logs-release.2/1622049382992_pushed_notifications_test.TestPushedNotifications.wait_and_restart_node_localhost_test/node1.log
INFO 2021-05-26 17:06:25,466 [shard 0] init - starting native transport
scylla: ./seastar/include/seastar/core/sharded.hh:518: seastar::sharded<T>::~sharded() [with Service = cql_transport::cql_server]: Assertion `_instances.empty()' failed.
Aborting on shard 0.
Backtrace:
0x0000000002b72b22
0x0000000002b731b0
0x0000000002b73455
0x0000000002b734a0
0x00007f496e273a8f
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6+0x000000000003c9e4
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6+0x0000000000025894
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6+0x0000000000025768
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6+0x0000000000034e75
0x0000000000ccd52a
0x0000000000ccd57f
0x0000000000d17fc1
0x0000000000d2a903
0x0000000002e0eb4c
Decoded from version d737d56a0 build-id 1b72726e11926c76c44aa89f11b442d89d23aa7c:
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at reactor.cc:?
seastar::print_with_backtrace(seastar::backtrace_buffer&) at reactor.cc:?
(inlined by) print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:752
seastar::print_with_backtrace(char const*) at reactor.cc:?
void seastar::install_oneshot_signal_handler<6, &seastar::sigabrt_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at reactor.cc:?
(inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3466
(inlined by) _FUN at ./build/release/seastar/./seastar/src/core/reactor.cc:3462
?? ??:0
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6 0x000000000003c9e4
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6 0x0000000000025894
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6 0x0000000000025768
/jenkins/workspace/scylla-4.3/dtest-release/scylla/.ccm/scylla-repository/d737d56a0894da5045090ac59e35d6a2916dcdde/scylla/libreloc/libc.so.6 0x0000000000034e75
seastar::sharded<cql_transport::cql_server>::~sharded() at ././seastar/include/seastar/core/sharded.hh:518 (discriminator 1)
std::default_delete<seastar::sharded<cql_transport::cql_server> >::operator()(seastar::sharded<cql_transport::cql_server>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
(inlined by) std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
(inlined by) cql_transport::controller::~controller() at ./transport/controller.hh:38
operator() at /usr/include/c++/10/any:289
__invoke_impl<int, main(int, char**)::<lambda()>::<lambda()> > at /usr/include/c++/10/bits/invoke.h:60
(inlined by) __invoke<main(int, char**)::<lambda()>::<lambda()> > at /usr/include/c++/10/bits/invoke.h:95
(inlined by) __apply_impl<main(int, char**)::<lambda()>::<lambda()>, std::tuple<> > at /usr/include/c++/10/tuple:1723
(inlined by) apply<main(int, char**)::<lambda()>::<lambda()>, std::tuple<> > at /usr/include/c++/10/tuple:1734
(inlined by) apply<main(int, char**)::<lambda()>::<lambda()> > at ././seastar/include/seastar/core/future.hh:2104
(inlined by) operator() at ././seastar/include/seastar/core/thread.hh:258
(inlined by) call at ././seastar/include/seastar/util/noncopyable_function.hh:116
seastar::noncopyable_function<void ()>::operator()() const at /usr/include/c++/10/bits/basic_string.h:323
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:297
As for this particular issue about the memory limiter, see https://github.com/scylladb/scylla/blob/359e9caf54df2a450bfbad8e753e7574735603d0/main.cc#L835-L838
It is not stopped yet...
@bhalevy still fails for me on ubuntu with a build from yesterday.
INFO 2021-06-09 17:14:31,401 [shard 0] init - Scylla version 4.6.dev-0.20210608.65bc81e0d with build-id 24a6413ca4bf0d78b40dac21cf663c36d212fc14 starting
...
scylla: ./seastar/include/seastar/core/sharded.hh:523: seastar::sharded<service::memory_limiter>::~sharded() [Service = service::memory_limiter]: Assertion `_instances.empty()' failed.
Aborting on shard 0.
Backtrace:
0x40e6ea8
0x4118912
0x7f4f479741df
/home/fruch/.ccm/scylla-repository/master_2021-06-08T23_06_06Z/scylla/libreloc/libc.so.6+0x3d9d4
/home/fruch/.ccm/scylla-repository/master_2021-06-08T23_06_06Z/scylla/libreloc/libc.so.6+0x268a3
/home/fruch/.ccm/scylla-repository/master_2021-06-08T23_06_06Z/scylla/libreloc/libc.so.6+0x26788
/home/fruch/.ccm/scylla-repository/master_2021-06-08T23_06_06Z/scylla/libreloc/libc.so.6+0x36025
0xed8c9c
/home/fruch/.ccm/scylla-repository/master_2021-06-08T23_06_06Z/scylla/libreloc/libc.so.6+0x281e1
0xed540d
What's this version (65bc81e0d)? I don't see it in the git tree?
@bhalevy http://downloads.scylladb.com/unstable/scylla/master/relocatable/2021-06-08T23%3A06%3A06Z/
also fail like that for any rc version of 4.4 or 4.5
looks like this version was never promoted so I don't know if it contains the fix or not
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3613/ failed
@penberg @hagitsegev how come the artifacts of a failed next job are being used?