Rethinkdb: Node crash with "Guarantee failed: [refcount == 0]"

Created on 30 Apr 2017 · 5Comments · Source: rethinkdb/rethinkdb

Production node, with some moderate load. Noting unusual about set up: a 5 node cluster, with around 5M values, in 8 tables. We do some long running operations, like iterating and/or replacing full tables now and then. Nothing else suspicious happened around the time. Rest of the nodes are fine.

2017-04-29T08:29:30.922582177 4104758.855155s notice: Disconnected from server "rethink_mb3z" 5953ad46-505c-4993-955a-da5352c2afaa
2017-04-29T08:29:30.931105390 4104758.863678s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T08:29:30.989628640 4104758.922201s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T08:29:31.158262058 4104759.090834s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T08:29:31.581387515 4104759.513960s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T08:29:31.967367774 4104759.899940s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T08:29:32.488237400 4104760.420810s notice: Connected to server "rethink_mb3z" 5953ad46-505c-4993-955a-da5352c2afaa
2017-04-29T10:19:34.761926839 4111362.694499s notice: Disconnected from server "rethink_mb3z" 5953ad46-505c-4993-955a-da5352c2afaa
2017-04-29T10:19:34.773095935 4111362.705668s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T10:19:34.780149345 4111362.712722s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one 
is open already.
2017-04-29T10:19:34.796365435 4111362.728938s error: Error in ./src/concurrency/auto_drainer.hpp at line 79:
2017-04-29T10:19:34.797976181 4111362.730548s error: Guarantee failed: [refcount == 0] 
2017-04-29T10:19:34.797996261 4111362.730568s error: Backtrace:
2017-04-29T10:19:34.812979423 4111362.745551s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:34.855369519 4111362.787942s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:35.449625188 4111363.382200s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:35.449704521 4111363.382277s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:35.720574621 4111363.653150s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:35.766105250 4111363.698678s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:35.870649279 4111363.803221s info: Rejected a connection from server 5953ad46-505c-4993-955a-da5352c2afaa since one is open already.
2017-04-29T10:19:36.020617137 4111363.953190s error: Sat Apr 29 10:19:34 2017\n\n1 [0xae7500]: backtrace_t::backtrace_t() at 0xae7500 (/usr/bin/rethinkdb)\n2 [0xae7879]: format_backtrace(bool) at 0xae7879 (/usr/bin/rethinkdb)\n3 [0xd9f6c3]: report_fatal_error(char const*, int, char const*, ...) at 0xd9f6c3 (/usr/bin/rethinkdb)\n4 [0x9d9335]: linux_tcp_conn_t::rethread(threadnum_t) at 0x9d9335 (/usr/bin/rethinkdb)\n5 [0xa4e5da]: rethread_tcp_conn_stream_t::~rethread_tcp_conn_stream_t() at 0xa4e5da (/usr/bin/rethinkdb)\n6 [0x9fb375]: connectivity_cluster_t::run_t::handle(keepalive_tcp_conn_stream_t*, boost::optional<peer_id_t>, boost::optional<peer_address_t>, auto_drainer_t::lock_t, bool*, int) at 0x9fb375 (/usr/bin/rethinkdb)\n7 [0x9fcb4a]: connectivity_cluster_t::run_t::on_new_connection(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t) at 0x9fcb4a (/usr/bin/rethinkdb)\n8 [0x9fdd9a]: std::_Function_handler<void (scoped_ptr_t<linux_tcp_conn_descriptor_t>&), std::_Bind<std::_Mem_fn<void (connectivity_cluster_t::run_t::*)(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t)> (connectivity_cluster_t::run_t*, std::_Placeholder<1>, int, auto_drainer_t::lock_t)> >::_M_invoke(std::_Any_data const&, scoped_ptr_t<linux_tcp_conn_descriptor_t>&) at 0x9fdd9a (/usr/bin/rethinkdb)\n9 [0x9df626]: linux_nonthrowing_tcp_listener_t::handle(int) at 0x9df626 (/usr/bin/rethinkdb)\n10 [0x9f2c47]: coro_t::run() at 0x9f2c47 (/usr/bin/rethinkdb)

high bug

Source

cristicbz

All 5 comments

Ah, seems related to #5783 . Nodes are running 2.3.5, btw.

cristicbz on 30 Apr 2017

Thanks for the bug report.

Here's the formatted backtrace:

error: Error in ./src/concurrency/auto_drainer.hpp at line 79:
error: Guarantee failed: [refcount == 0] 
error: Backtrace:
1 [0xae7500]: backtrace_t::backtrace_t() at 0xae7500 (/usr/bin/rethinkdb)
2 [0xae7879]: format_backtrace(bool) at 0xae7879 (/usr/bin/rethinkdb)
3 [0xd9f6c3]: report_fatal_error(char const*, int, char const*, ...) at 0xd9f6c3 (/usr/bin/rethinkdb)
4 [0x9d9335]: linux_tcp_conn_t::rethread(threadnum_t) at 0x9d9335 (/usr/bin/rethinkdb)
5 [0xa4e5da]: rethread_tcp_conn_stream_t::~rethread_tcp_conn_stream_t() at 0xa4e5da (/usr/bin/rethinkdb)
6 [0x9fb375]: connectivity_cluster_t::run_t::handle(keepalive_tcp_conn_stream_t*, boost::optional<peer_id_t>, boost::optional<peer_address_t>, auto_drainer_t::lock_t, bool*, int) at 0x9fb375 (/usr/bin/rethinkdb)
7 [0x9fcb4a]: connectivity_cluster_t::run_t::on_new_connection(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t) at 0x9fcb4a (/usr/bin/rethinkdb)
8 [0x9fdd9a]: std::_Function_handler<void (scoped_ptr_t<linux_tcp_conn_descriptor_t>&), std::_Bind<std::_Mem_fn<void (connectivity_cluster_t::run_t::*)(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t)> (connectivity_cluster_t::run_t*, std::_Placeholder<1>, int, auto_drainer_t::lock_t)> >::_M_invoke(std::_Any_data const&, scoped_ptr_t<linux_tcp_conn_descriptor_t>&) at 0x9fdd9a (/usr/bin/rethinkdb)
9 [0x9df626]: linux_nonthrowing_tcp_listener_t::handle(int) at 0x9df626 (/usr/bin/rethinkdb)
10 [0x9f2c47]: coro_t::run() at 0x9f2c47 (/usr/bin/rethinkdb)

It looks almost identical to the one in #5783. @danielmewes do you have any clue about this?

AtnNn on 8 May 2017

I am having apparently the same issue in my production application recently.

2020-01-22T12:42:35.497915211 418.411302s error: Error in ./src/concurrency/auto_drainer.hpp at line 79:
2020-01-22T12:42:35.497963701 418.411350s error: Guarantee failed: [refcount == 0]
2020-01-22T12:42:35.497973464 418.411359s error: Backtrace:
2020-01-22T12:42:35.611520361 418.524909s error: Wed Jan 22 12:42:35 2020

1 [0xb2bb5a]: backtrace_t::backtrace_t() at 0xb2bb5a (/usr/bin/rethinkdb)
2 [0xb2c03a]: format_backtrace[abi:cxx11](bool) at 0xb2c03a (/usr/bin/rethinkdb)
3 [0xb2d66c]: report_fatal_error(char const*, int, char const*, ...) at 0xb2d66c (/usr/bin/rethinkdb)
4 [0x7d2446]: linux_tcp_conn_t::rethread(threadnum_t) at 0x7d2446 (/usr/bin/rethinkdb)
5 [0x7c2b5a]: rethread_tcp_conn_stream_t::~rethread_tcp_conn_stream_t() at 0x7c2b5a (/usr/bin/rethinkdb)
6 [0x77760e]: connectivity_cluster_t::run_t::handle(keepalive_tcp_conn_stream_t*, boost::optional<peer_id_t>, boost::optional<peer_address_t>, auto_drainer_t::lock_t, bool*, int) at 0x77760e (/usr/bin/rethinkdb)
7 [0x777a69]: connectivity_cluster_t::run_t::on_new_connection(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t) at 0x777a69 (/usr/bin/rethinkdb)
8 [0x7789da]: std::_Function_handler<void (scoped_ptr_t<linux_tcp_conn_descriptor_t>&), std::_Bind<std::_Mem_fn<void (connectivity_cluster_t::run_t::*)(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t)> (connectivity_cluster_t::run_t*, std::_Placeholder<1>, int, auto_drainer_t::lock_t)> >::_M_invoke(std::_Any_data const&, scoped_ptr_t<linux_tcp_conn_descriptor_t>&) at 0x7789da (/usr/bin/rethinkdb)
9 [0x7d4220]: linux_nonthrowing_tcp_listener_t::handle(int) at 0x7d4220 (/usr/bin/rethinkdb)
10 [0x7f0ac8]: coro_t::run() at 0x7f0ac8 (/usr/bin/rethinkdb)
2020-01-22T12:42:35.611609676 418.524996s error: Exiting.

It happens when a periodical script is doing heavy operations. Setup is a cluster of 3 AWS m5d.2xlarge with attached nvme drives being accessed trough an rethinkdb proxy running on a AWS c5.large instance. At some point, the proxy is using all the memory (around 3GB) and being killed somehow. I don't know if the culprit is one of the shards crashing and then proxy takes all memory or instead proxy dying and then the shard crashes.

Anyone knows if this issue is reproducible on 2.4.0?

irae on 22 Jan 2020

I don't think that issue has been addressed for 2.4.0.

srh on 22 Jan 2020

😕1

I'm running 2.4.0 and this just started happening. I'm running 3 proxy instances and 15 nodes within Kubernetes. Here's the backtrace I got from a Node that seemed to cause all of the issues:

2020-03-31T12:54:16.927859265Z Version: rethinkdb 2.4.0~0bionic (CLANG 6.0.0 (tags/RELEASE_600/final))
2020-03-31T12:54:16.927897992Z error: Error in thread 1 in ./src/concurrency/auto_drainer.hpp at line 79:
2020-03-31T12:54:16.927903386Z error: Guarantee failed: [refcount == 0] 
2020-03-31T12:54:16.927907459Z error: Backtrace:
2020-03-31T12:54:17.857949185Z error: Tue Mar 31 12:54:16 2020
2020-03-31T12:54:17.858027606Z        1 [0xcd05a0]: backtrace_t::backtrace_t() at 0xcd05a0 (rethinkdb)
2020-03-31T12:54:17.858035827Z        2 [0xcd1216]: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at 0xcd1216 (rethinkdb)
2020-03-31T12:54:17.858044257Z        3 [0xccffa8]: format_backtrace[abi:cxx11](bool) at 0xccffa8 (rethinkdb)
2020-03-31T12:54:17.858052219Z        4 [0xc8898d]: report_fatal_error(char const*, int, char const*, ...) at 0xc8898d (rethinkdb)
2020-03-31T12:54:17.858060077Z        5 [0xd13b95]: linux_tcp_conn_t::rethread(threadnum_t) at 0xd13b95 (rethinkdb)
2020-03-31T12:54:17.858067740Z        6 [0xd31f0c]: rethread_tcp_conn_stream_t::~rethread_tcp_conn_stream_t() at 0xd31f0c (rethinkdb)
2020-03-31T12:54:17.858075610Z        7 [0x76c35d]: connectivity_cluster_t::run_t::handle(keepalive_tcp_conn_stream_t*, optional<peer_id_t>, optional<peer_address_t>, optional<server_id_t>, auto_drainer_t::lock_t, bool*, int) at 0x76c35d (rethinkdb)
2020-03-31T12:54:17.858103218Z        8 [0x769201]: connectivity_cluster_t::run_t::on_new_connection(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t) at 0x769201 (rethinkdb)
2020-03-31T12:54:17.858112564Z        9 [0x77385e]: std::_Function_handler<void (scoped_ptr_t<linux_tcp_conn_descriptor_t>&), std::_Bind<void (connectivity_cluster_t::run_t::*(connectivity_cluster_t::run_t*, std::_Placeholder<1>, int, auto_drainer_t::lock_t))(scoped_ptr_t<linux_tcp_conn_descriptor_t> const&, int, auto_drainer_t::lock_t)> >::_M_invoke(std::_Any_data const&, scoped_ptr_t<linux_tcp_conn_descriptor_t>&) at 0x77385e (rethinkdb)
2020-03-31T12:54:17.858141497Z        10 [0xd15fb1]: linux_nonthrowing_tcp_listener_t::handle(int) at 0xd15fb1 (rethinkdb)
2020-03-31T12:54:17.858149044Z        11 [0xd1e5a4]: coro_t::run() at 0xd1e5a4 (rethinkdb)
2020-03-31T12:54:17.858344507Z error: Exiting.

Right before that, my proxies reported: 2020-03-31 05:54:15.821 PDT error: Heartbeat timeout, killing connection to peer 10.44.2.5:29015. That particular IP Address is the one assigned to the Node which reported the above backtrace.

After that, each of the proxies and nodes all reported this within a few seconds of eachother.

2020-03-31T12:54:59.284085423Z Version: rethinkdb 2.4.0~0bionic (CLANG 6.0.0 (tags/RELEASE_600/final))
2020-03-31T12:54:59.284114034Z error: Error in thread 8 in src/rpc/connectivity/cluster.cc at line 395:
2020-03-31T12:54:59.284119110Z error: Guarantee failed: [peer.ips().size() > 0] 
2020-03-31T12:54:59.284123363Z error: Backtrace:
2020-03-31T12:54:59.313099384Z error: Tue Mar 31 12:54:59 2020
2020-03-31T12:54:59.313131288Z        
2020-03-31T12:54:59.313136138Z        1 [0xcd05a0]: backtrace_t::backtrace_t() at 0xcd05a0 (rethinkdb)
2020-03-31T12:54:59.313140469Z        2 [0xcd1216]: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at 0xcd1216 (rethinkdb)
2020-03-31T12:54:59.313158223Z        3 [0xccffa8]: format_backtrace[abi:cxx11](bool) at 0xccffa8 (rethinkdb)
2020-03-31T12:54:59.313161721Z        4 [0xc8898d]: report_fatal_error(char const*, int, char const*, ...) at 0xc8898d (rethinkdb)
2020-03-31T12:54:59.313165094Z        5 [0x76ce9a]: connectivity_cluster_t::run_t::join_blocking(peer_address_t const&, optional<peer_id_t>, optional<server_id_t>, int, auto_drainer_t::lock_t) at 0x76ce9a (rethinkdb)
2020-03-31T12:54:59.313169324Z        6 [0x7fec60]: rethinkdb() [0x7fec60] at 0x7fec60 ()
2020-03-31T12:54:59.313172441Z        7 [0xd1e5a4]: coro_t::run() at 0xd1e5a4 (rethinkdb)
2020-03-31T12:54:59.313175635Z error: Exiting.

They all eventually started back up and recovered. The whole incident took about 4 minutes to happen and recover from.