Hi,
Got this error on one of the nodes in production, I haven't seen it before. It's hard to say if there was anything unusual going on, but besides the usual production load, we've been doing two things:
r.db("foo")
.table("bar")
.between(r.iso8601("2016-07-15T00:00Z"), r.iso8601("2016-08-01T00:00Z"), {"index": "created"})
.filter(function (var127) { return r.not(r.row.hasFields("updated")); })
.update(function (var128) { return r.object("updated", r.row("created")); })
Which was updating hundreds of thousands of docs in batches, and
r.db("ormFoo")
.table("ormBar")
.update({
// 20 or so fields omitted
"id": "a87aef60-3533-415a-9769-5b16c5959380",
"created": r.iso8601("2017-04-18T12:46:40.462Z"),
"updated": r.iso8601("2017-04-18T12:46:40.660Z"))
which, of course, is a horrible thing as it tries to update ~20k documents one by one, and fails on all except for the one correct ID.
Here's the crash trace:
Version: rethinkdb 2.3.5~0jessie (GCC 4.9.2)
error: Error in src/rdb_protocol/query_cache.cc at line 325:
error: Guarantee failed: [entry->state == entry_t::state_t::STREAM]
error: Backtrace:
error: Tue Apr 18 12:31:15 2017
1 [0xae7500]: backtrace_t::backtrace_t() at ??:?
2 [0xae7879]: format_backtrace(bool) at ??:?
3 [0xd9f6c3]: report_fatal_error(char const*, int, char const*, ...) at ??:?
4 [0x72ffac]: ql::query_cache_t::ref_t::serve(ql::env_t*, ql::response_t*) at ??:?
5 [0x7323f2]: ql::query_cache_t::ref_t::fill_response(ql::response_t*) at ??:?
6 [0x805e63]: rdb_query_server_t::run_query(ql::query_params_t*, ql::response_t*, signal_t*) at ??:?
7 [0xae4317]: void save_exception<void query_server_t::connection_loop<json_protocol_t>(linux_tcp_conn_t*, unsigned long, ql::query_cache_t*, signal_t*)::{lambda()#1}::operator()() const::{lambda()#1}>(std::__exception_ptr::exception_ptr*, std::string*, cond_t*, void query_server_t::connection_loop<json_protocol_t>(linux_tcp_conn_t*, unsigned long, ql::query_cache_t*, signal_t*)::{lambda()#1}::operator()() const::{lambda()#1}&&) at ??:?
8 [0xae48ba]: void query_server_t::connection_loop<json_protocol_t>(linux_tcp_conn_t*, unsigned long, ql::query_cache_t*, signal_t*)::{lambda()#1}::operator()() const at ??:?
9 [0x9f2c47]: coro_t::run() at ??:?
error: Exiting.
ping @AtnNn @marshall007 @mglukhovsky
@analytik are you able to reproduce this reliably or was there just the one-off crash?
Luckily/sadly it hasn't happened again. We got rid of the horrible broken update queries, which were the only anomaly I can think of. Does the error message hint at anything useful? Some strange racing condition in query cache?
The cluster is 5 nodes, most tables have 5 shards, 5 replicas, there was enough of memory available at the time, RethinkDB CPU usage was about 700-1000% altogether, total server load was around 400% on each server (8 cores). There wasn't any spike preceding the crash that I can see.
I see a possible race condition in the query cache which could explain this error: query_cache_t::stop_query calls terminate_internal which modifies entry->state without holding a lock (ref_t) on the entry. This can happen asynchronously while query_cache_t::ref_t::serve is running.
Fixed by #6564. (Presumably.)