Scylla: Cannot bootstrap a node in presence of Everywhere strategy tables with RBO enabled

Created on 22 Apr 2021  路  10Comments  路  Source: scylladb/scylla

Commit: current master (2ad09d0bf83c5bd47b93efd3021858d69a00ab6d)

  1. create a single node cluster with repair-based node operations enabled
    2.
cqlsh> create keyspace ks with replication = {'class': 'EverywhereStrategy'};
cqlsh> create table ks.t (pk int primary key);
  1. try bootstrapping second node

Result (on the bootstrapping node):

INFO  2021-04-22 11:14:28,917 [shard 2] repair - Repair 30 out of 513 ranges, id=[id=3, uuid=c1c82429-eba5-4bbe-8421-9346132456e2], shard=2, keyspace=ks, table={t}, range=(7764568581638937715, 7776790348170289012]
scylla: ./seastar/include/seastar/core/gate.hh:101: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"' failed.
Aborting on shard 0.
Backtrace:
  0x24c6ffb
  0x24c6fbc
  0x2492c7d
  0x24b496c
  0x24b49ea
  0x24b49ba
  0x24b4985
  0x7fcb4144ea8f
  /lib64/libc.so.6+0x3c9e4
  /lib64/libc.so.6+0x25894
  /lib64/libc.so.6+0x25768
  /lib64/libc.so.6+0x34e75
  0xf36a61
  0x1e0c252
  0x1e0bf31
  0x1e0bddd
  0x1e0bae6
  0x1dcc587
  0x1dcb643
  0x244ed26
  0x2548656

decoded:

seastar::gate::close() at main.cc:?
repair_meta::stop() at row_level.cc:?
repair_meta::repair_row_level_stop(gms::inet_address, seastar::basic_sstring<char, unsigned int, 15u, true>, seastar::basic_sstring<char, unsigned int, 15u, true>, nonwrapping_interval<dht::token>) at row_level.cc:?
row_level_repair::run()::{lambda()#1}::operator()() const::{lambda(gms::inet_address const&)#1}::operator()(gms::inet_address const) const at row_level.cc:?
seastar::future<void> seastar::parallel_for_each<__gnu_cxx::__normal_iterator<gms::inet_address*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, row_level_repair::run()::{lambda()#1}::operator()() const::{lambda(gms::inet_address const&)#1}>(__gnu_cxx::__normal_iterator<gms::inet_address*, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, seastar::future, row_level_repair::run()::{lambda()#1}::operator()() const::{lambda(gms::inet_address const&)#1}&&) at row_level.cc:?
row_level_repair::run()::{lambda()#1}::operator()() const at row_level.cc:?
seastar::async<row_level_repair::run()::{lambda()#1}>(seastar::thread_attributes, std::decay&&, (std::decay<row_level_repair::run()::{lambda()#1}>::type&&)...)::{lambda()#1}::operator()() const at row_level.cc:?
seastar::noncopyable_function<void ()>::operator()() const at future.cc:?
seastar::thread_context::main() at thread.cc:?
Backport candidate bug high repair-based-operations

Most helpful comment

@asias Thanks. I checked simple bootstrapping of few nodes and operations with CDC, it seems to work. I'm gonna try rerunning all the tests I did with my patchset rebased on top of this PR with RBO enabled on Monday.

All 10 comments

@slivne @asias Could someone look at this? It is blocking CDC fix.

@haaawk The issue is gone with the PR https://github.com/scylladb/scylla/pull/8536. You guys can continue to test cdc + repair based node ops, on top of this. I see cdc fails with the PR.

INFO  2021-04-22 15:42:29,450 [shard 0] cdc - Inserting new generation data at UUID f5bf81d0-94a1-402d-aa65-ae2f86eb8883
INFO  2021-04-22 15:42:29,456 [shard 0] init - Shutting down storage service notifications
INFO  2021-04-22 15:42:29,456 [shard 0] init - Shutting down storage service notifications was successful
INFO  2021-04-22 15:42:29,456 [shard 0] init - Shutting down system distributed keyspace
INFO  2021-04-22 15:42:29,456 [shard 0] init - Shutting down system distributed keyspace was successful
INFO  2021-04-22 15:42:29,456 [shard 0] init - Shutting down gossiping 
...
INFO  2021-04-22 15:42:29,669 [shard 0] init - Shutting down sighup
INFO  2021-04-22 15:42:29,669 [shard 0] init - Shutting down sighup was successful
ERROR 2021-04-22 15:42:29,669 [shard 0] init - Startup failed: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 2, alive 0)

@asias
Is it ok that the bootstrapping node self netaddr appears in row_level_repair::run master.all_nodes()?
Is it ok that it appears there twice?

Is it ok for the self ip address be present in bootstrap_with_repair old_endpoints_in_local_dc?
And from there it gets to neighbors in the everywhere_topology case.

@asias

With #8536 Everywhere tables stop working completely (RBO and without RBO).

@asias

With #8536 Everywhere tables stop working completely (RBO and without RBO).

See https://github.com/scylladb/scylla/issues/8533.

I think 8536 exposed more problem with Everywhere topology.

@asias
Is it ok that the bootstrapping node self netaddr appears in row_level_repair::run master.all_nodes()?

yes.

Is it ok that it appears there twice?

no.

Is it ok for the self ip address be present in bootstrap_with_repair old_endpoints_in_local_dc?

No. We have bugs with Everywhere topolgy. https://github.com/scylladb/scylla/pull/8536. That's why we it showed up in old endpoints list.

And from there it gets to neighbors in the everywhere_topology case.

Why are you pointing to #8533? It refers to your own branch, how can you be so sure that you didn't break something on that branch with your custom code? The issue in #8533 may be with your code, not with how currently Everywhere works on master.

I tested Everywhere tables extensively without RBO and _everything worked._

Why are you pointing to #8533?

Because it shows the problem with everywhere strategy.

It refers to your own branch, how can you be so sure that you didn't break something on that branch with your custom code? The issue in #8533 may be with your code, not with how currently Everywhere works on master.

I did not say my customer code did not break anything. I was just saying I found issues.

I tested Everywhere tables extensively without RBO and _everything worked._

I tested latest https://github.com/scylladb/scylla/pull/8536. Everywhere strategy read issue is gone. Repair based node ops + cdc is working too. @kbr- @haaawk . Can you try?

@asias Thanks. I checked simple bootstrapping of few nodes and operations with CDC, it seems to work. I'm gonna try rerunning all the tests I did with my patchset rebased on top of this PR with RBO enabled on Monday.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dimaqq picture dimaqq  路  6Comments

avikivity picture avikivity  路  4Comments

Ritaja picture Ritaja  路  3Comments

duarten picture duarten  路  5Comments

hellowaywewe picture hellowaywewe  路  3Comments