Rethinkdb: Guarantee failed: [entry->state == entry_t::state_t::STREAM]

Created on 18 Apr 2017 · 5Comments · Source: rethinkdb/rethinkdb

Hi,

Got this error on one of the nodes in production, I haven't seen it before. It's hard to say if there was anything unusual going on, but besides the usual production load, we've been doing two things:

r.db("foo")
 .table("bar")
 .between(r.iso8601("2016-07-15T00:00Z"), r.iso8601("2016-08-01T00:00Z"), {"index": "created"})
 .filter(function (var127) {             return r.not(r.row.hasFields("updated"));         })
 .update(function (var128) {             return r.object("updated", r.row("created"));         })

Which was updating hundreds of thousands of docs in batches, and

r.db("ormFoo")
 .table("ormBar")
 .update({
          // 20 or so fields omitted
          "id": "a87aef60-3533-415a-9769-5b16c5959380",
          "created": r.iso8601("2017-04-18T12:46:40.462Z"),
          "updated": r.iso8601("2017-04-18T12:46:40.660Z"))

which, of course, is a horrible thing as it tries to update ~20k documents one by one, and fails on all except for the one correct ID.

Here's the crash trace:

Version: rethinkdb 2.3.5~0jessie (GCC 4.9.2)
error: Error in src/rdb_protocol/query_cache.cc at line 325:
error: Guarantee failed: [entry->state == entry_t::state_t::STREAM]
error: Backtrace:
error: Tue Apr 18 12:31:15 2017

       1 [0xae7500]: backtrace_t::backtrace_t() at ??:?
       2 [0xae7879]: format_backtrace(bool) at ??:?
       3 [0xd9f6c3]: report_fatal_error(char const*, int, char const*, ...) at ??:?
       4 [0x72ffac]: ql::query_cache_t::ref_t::serve(ql::env_t*, ql::response_t*) at ??:?
       5 [0x7323f2]: ql::query_cache_t::ref_t::fill_response(ql::response_t*) at ??:?
       6 [0x805e63]: rdb_query_server_t::run_query(ql::query_params_t*, ql::response_t*, signal_t*) at ??:?
       7 [0xae4317]: void save_exception<void query_server_t::connection_loop<json_protocol_t>(linux_tcp_conn_t*, unsigned long, ql::query_cache_t*, signal_t*)::{lambda()#1}::operator()() const::{lambda()#1}>(std::__exception_ptr::exception_ptr*, std::string*, cond_t*, void query_server_t::connection_loop<json_protocol_t>(linux_tcp_conn_t*, unsigned long, ql::query_cache_t*, signal_t*)::{lambda()#1}::operator()() const::{lambda()#1}&&) at ??:?
       8 [0xae48ba]: void query_server_t::connection_loop<json_protocol_t>(linux_tcp_conn_t*, unsigned long, ql::query_cache_t*, signal_t*)::{lambda()#1}::operator()() const at ??:?
       9 [0x9f2c47]: coro_t::run() at ??:?
error: Exiting.

bug

Source

analytik

👍1

All 5 comments

ping @AtnNn @marshall007 @mglukhovsky

v3ss0n on 19 Apr 2017

@analytik are you able to reproduce this reliably or was there just the one-off crash?

marshall007 on 19 Apr 2017

Luckily/sadly it hasn't happened again. We got rid of the horrible broken update queries, which were the only anomaly I can think of. Does the error message hint at anything useful? Some strange racing condition in query cache?

The cluster is 5 nodes, most tables have 5 shards, 5 replicas, there was enough of memory available at the time, RethinkDB CPU usage was about 700-1000% altogether, total server load was around 400% on each server (8 cores). There wasn't any spike preceding the crash that I can see.

analytik on 26 Apr 2017

I see a possible race condition in the query cache which could explain this error: query_cache_t::stop_query calls terminate_internal which modifies entry->state without holding a lock (ref_t) on the entry. This can happen asynchronously while query_cache_t::ref_t::serve is running.