Scylla: no_read_amplification_on_repair_with_mv_test dtest times out

Created on 7 Jun 2020  路  88Comments  路  Source: scylladb/scylla

Seen since https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/506/testReport/read_amplification_test/ReadAmplificationTest/no_read_amplification_on_repair_with_mv_test/
Scylla version 97cb2892b28222e680a68cdb210c5fe1770f731d
(Last successful run on master was at dtest-release #505 9d9d54c804dd400d90edfdb84c8f5319d06b1633)

Error Message
'no_read_amplification_on_repair_with_mv_test (read_amplification_test.ReadAmplificationTest)'
Stacktrace
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/nose/plugins/multiprocess.py", line 812, in run
    test(orig)
  File "/usr/local/lib/python3.7/site-packages/nose/case.py", line 46, in __call__
    return self.run(*arg, **kwarg)
  File "/usr/local/lib/python3.7/site-packages/nose/case.py", line 134, in run
    self.runTest(result)
  File "/usr/local/lib/python3.7/site-packages/nose/case.py", line 152, in runTest
    test(result)
  File "/usr/lib64/python3.7/unittest/case.py", line 693, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python3.7/unittest/case.py", line 645, in run
    testMethod()
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/read_amplification_test.py", line 40, in no_read_amplification_on_repair_with_mv_test
    self.no_read_amplification_on_repair(with_mv=True)
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/read_amplification_test.py", line 93, in no_read_amplification_on_repair
    time.sleep(3)
  File "/usr/local/lib/python3.7/site-packages/nose/plugins/multiprocess.py", line 276, in signalhandler
    raise TimedOutException()
nose.plugins.multiprocess.TimedOutException: 'no_read_amplification_on_repair_with_mv_test (read_amplification_test.ReadAmplificationTest)'

It appears like node repair does not return although node2.log indicates repair completes successfully.

bug dtest high materialized-views repair

All 88 comments

Correction, shard 1 completed successfully, but not shard 0.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/508/artifact/logs-release.2/orphaned/1591420389_read_amplification_test.ReadAmplificationTest.no_read_amplification_on_repair_with_mv_test/node2.log

INFO  2020-06-06 05:21:22,543 [shard 0] repair - Repair 317 out of 761 ranges, id=1, shard=0, keyspace=ks, table={cf}, range=(-1763854417102999149, -1732979021398154179]

(from here on only shard 1 shows progress)
...
INFO  2020-06-06 05:21:55,468 [shard 1] repair - repair id 1 on shard 1 completed successfully, keyspace=ks

I tried to reproduce the repair does not finish issue (main issue in this report). So far I haven't. I noticed that time to insert data in the test has big variation. Sometimes it took 1.5 minutes sometimes it took 6 minutes.

Another bigger issue in the test is that, the test checks streaming metrics. However with row level repair, streaming is not used any more.

Finally, I can reproduce the repair hang with the dtest with reloc binary only. Scylla commit 97cb2892b28222e680a68cdb210c5fe1770f731d. On node 2, repair on shard 0 hangs.

@denesb, in the test, there is only a single repair instance in parallel , 2 shards per node, each shard has 512MB of memory.

repair - Setting max_repair_memory=53687091, max_repair_memory_per_range=33554432, max_repair_ranges_in_parallel=1
(gdb) p/x _repair_metas_masters 
$1 = std::unordered_map with 1 element = {
  [{
    ip = {
      _addr = {
        _in_family = 0x2,
        {
          _in = {
            s_addr = 0x218007f
          },
          _in6 = {
            __in6_u = {
              __u6_addr8 = {0x7f, 0x0, 0x18, 0x2, 0x70, 0x72, 0x6f, 0x6d, 0x65, 0x74, 0x68, 0x65, 0x75, 0x73, 0x0, 0x0},
              __u6_addr16 = {0x7f, 0x218, 0x7270, 0x6d6f, 0x7465, 0x6568, 0x7375, 0x0},
              __u6_addr32 = {0x218007f, 0x6d6f7270, 0x65687465, 0x7375}
            }
          }
        },
        _scope = 0xffffffff,
        static invalid_scope = 0xffffffff
      }
    },
    repair_meta_id = 0x35d
  }] = {
    _p = 0x6000046a0800
  }
}

(gdb) p (*(seastar::shared_ptr_no_esft<repair_meta>*)0x6000046a0800)._value._state
4

State 4 means repair master stuck in negotiate_sync_boundary.

(gdb) p/x pending_reads
$4 = std::unordered_set with 1 element = {
  [0] = {
    initiator = {
      _addr = {
        _in_family = 0x2,
        {
          _in = {
            s_addr = 0x218007f
          },
          _in6 = {
            __in6_u = {
              __u6_addr8 = {0x7f, 0x0, 0x18, 0x2, 0x70, 0x72, 0x6f, 0x6d, 0x65, 0x74, 0x68, 0x65, 0x75, 0x73, 0x0, 0x0},
              __u6_addr16 = {0x7f, 0x218, 0x7270, 0x6d6f, 0x7465, 0x6568, 0x7375, 0x0},
              __u6_addr32 = {0x218007f, 0x6d6f7270, 0x65687465, 0x7375}
            }
          }
        },
        _scope = 0xffffffff,
        static invalid_scope = 0xffffffff
      }
    },
    repair_meta_id = 0x35d,
    read_id = 0x3cb
  }
}



md5-5d812f99dd04549c219db02f9650f8bb



(gdb) scylla databases 
    0 (database*)0x600003e46010
    1 (database*)0x601003c46010

(gdb)  p (*(database*)0x600003e46010)._streaming_concurrency_sem

$2 = {
  _resources = {
    count = 10,
    memory = -2416875
  },
  _wait_list = {
    _front = {
      _M_t = {
        <std::__uniq_ptr_impl<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry> >> = {
          _M_t = {
            <std::_Tuple_impl<0, seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry*, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry> >> = {
              <std::_Tuple_impl<1, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry> >> = {
                <std::_Head_base<1, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry>, true>> = {
                  <std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry>> = {<No data fields>}, <No data fields>}, <No data fields>}, 
              <std::_Head_base<0, seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry*, false>> = {
                _M_head_impl = 0x60000328da80
              }, <No data fields>}, <No data fields>}
        }, <No data fields>}
    },
    _list = {
      _front_chunk = 0x600004cd0000,
      _back_chunk = 0x600004cd0000,
      _nchunks = 1,
      _free_chunks = 0x0,
      _nfree_chunks = 0
    },
    _on_expiry = {
      _semaphore_name = "_streaming_concurrency_sem"
    },
    _size = 2
  },
  _name = "_streaming_concurrency_sem",
  _max_queue_length = 18446744073709551615,
  _prethrow_action = {
    <std::_Maybe_unary_or_binary_function<void>> = {<No data fields>}, 
    <std::_Function_base> = {
      static _M_max_size = 16,
      static _M_max_align = 8,
      _M_functor = {
        _M_unused = {
          _M_object = 0x600003e46010,
          _M_const_object = 0x600003e46010,
          _M_function_pointer = 0x600003e46010,
          _M_member_pointer = (void (std::_Undefined_class::*)(std::_Undefined_class * const)) 0x600003e46010, this adjustment 110367774633583
        },
        _M_pod_data = "\020`\344\003\000`\000\000on\000\000ad\000"
      },
      _M_manager = 0x0
    }, 
    members of std::function<void()>:
    _M_invoker = 0xc468a0 <std::_Function_handler<void(), database::database(const db::config&, database_config, service::migration_notifier&, gms::feature_service&, locator::token_metadata&, seastar::abort_source&)::<lambda()> >::_M_invoke(const std::_Any_data &)>
  },
--Type <RET> for more, q to quit, c to continue without paging--
  _next_id = 1,
  _inactive_reads = {
    _M_t = {
      _M_impl = {
        <std::allocator<std::_Rb_tree_node<std::pair<unsigned long const, std::unique_ptr<reader_concurrency_semaphore::inactive_read, std::default_delete<reader_concurrency_semaphore::inactive_read> > > > >> = {
          <__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<unsigned long const, std::unique_ptr<reader_concurrency_semaphore::inactive_read, std::default_delete<reader_concurrency_semaphore::inactive_read> > > > >> = {<No data fields>}, <No data fields>}, 
        <std::_Rb_tree_key_compare<std::less<unsigned long> >> = {
          _M_key_compare = {
            <std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}
        }, 
        <std::_Rb_tree_header> = {
          _M_header = {
            _M_color = std::_S_red,
            _M_parent = 0x0,
            _M_left = 0x600003e46f50,
            _M_right = 0x600003e46f50
          },
          _M_node_count = 0
        }, <No data fields>}
    }
  },
  _inactive_read_stats = {
    permit_based_evictions = 0,
    population = 0
  }
}

(gdb)  p (*(database*)0x601003c46010)._streaming_concurrency_sem

$3 = {
  _resources = {
    count = 10,
    memory = 10737418
  },
  _wait_list = {
    _front = {
      _M_t = {
        <std::__uniq_ptr_impl<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry> >> = {
          _M_t = {
            <std::_Tuple_impl<0, seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry*, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry> >> = {
              <std::_Tuple_impl<1, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry> >> = {
                <std::_Head_base<1, std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry>, true>> = {
                  <std::default_delete<seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry>> = {<No data fields>}, <No data fields>}, <No data fields>}, 
              <std::_Head_base<0, seastar::expiring_fifo<reader_concurrency_semaphore::entry, reader_concurrency_semaphore::expiry_handler, seastar::lowres_clock>::entry*, false>> = {
                _M_head_impl = 0x0
              }, <No data fields>}, <No data fields>}
        }, <No data fields>}
    },
    _list = {
      _front_chunk = 0x0,
      _back_chunk = 0x0,
      _nchunks = 0,
      _free_chunks = 0x0,
      _nfree_chunks = 0
    },
    _on_expiry = {
      _semaphore_name = "_streaming_concurrency_sem"
    },
    _size = 0
  },
  _name = "_streaming_concurrency_sem",
  _max_queue_length = 18446744073709551615,
  _prethrow_action = {
    <std::_Maybe_unary_or_binary_function<void>> = {<No data fields>}, 
    <std::_Function_base> = {
      static _M_max_size = 16,
      static _M_max_align = 8,
      _M_functor = {
        _M_unused = {
          _M_object = 0x601003c46010,
          _M_const_object = 0x601003c46010,
          _M_function_pointer = 0x601003c46010,
          _M_member_pointer = (void (std::_Undefined_class::*)(std::_Undefined_class * const)) 0x601003c46010, this adjustment 720686308153912943
        },
        _M_pod_data = "\020`\304\003\020`\000\000on\000\000ad\000\n"
      },
      _M_manager = 0x0
    }, 
    members of std::function<void()>:
    _M_invoker = 0xc468a0 <std::_Function_handler<void(), database::database(const db::config&, database_config, service::migration_notifier&, gms::feature_service&, locator::token_metadata&, seastar::abort_source&)::<lambda()> >::_M_invoke(const std::_Any_data &)>
  },

--Type <RET> for more, q to quit, c to continue without paging--
  _next_id = 1,
  _inactive_reads = {
    _M_t = {
      _M_impl = {
        <std::allocator<std::_Rb_tree_node<std::pair<unsigned long const, std::unique_ptr<reader_concurrency_semaphore::inactive_read, std::default_delete<reader_concurrency_semaphore::inactive_read> > > > >> = {
          <__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<unsigned long const, std::unique_ptr<reader_concurrency_semaphore::inactive_read, std::default_delete<reader_concurrency_semaphore::inactive_read> > > > >> = {<No data fields>}, <No data fields>}, 
        <std::_Rb_tree_key_compare<std::less<unsigned long> >> = {
          _M_key_compare = {
            <std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}
        }, 
        <std::_Rb_tree_header> = {
          _M_header = {
            _M_color = std::_S_red,
            _M_parent = 0x0,
            _M_left = 0x601003c46f50,
            _M_right = 0x601003c46f50
          },
          _M_node_count = 0
        }, <No data fields>}
    }
  },
  _inactive_read_stats = {
    permit_based_evictions = 0,
    population = 0
  }
}

  _resources = {
    count = 10,
    memory = -2416875
  },

This is strange. There are no active readers, yet there is no memory available. Can you run this gdb script on the shard where the memory of the semaphore is negative?

def restricted_readers():
    for obj_addr, vtable_addr in find_vptrs():
        vtable_pfx = 'vtable for restricting_mutation_reader'
        name = resolve(vtable_addr, startswith=vtable_pfx)
        if name is None:
            continue

        rr = gdb.parse_and_eval('$downcast_vptr({})'.format(int(obj_addr)))
        schema = schema_ptr(rr['_schema'])
        state = std_variant(rr['_state'])
        index = state.index()
        if index == 0:
            semaphore = rr['_semaphore']['_name']
        else:
            semaphore = 'Unknown'

        gdb.write('0x{:016x} {:30} {} {} \n'.format(int(obj_addr), str(semaphore), index, schema.table_name()))

Just copy this into scylla-gdb.py, source it in gdb, then run it with:

(gdb) python restricted_readers()

BTW do you happen to have any of the permit-dropping patches I developed for https://github.com/scylladb/scylla/issues/6272 applied?

BTW do you happen to have any of the permit-dropping patches I developed for #6272 applied?

No. I did not include your patches.

python restricted_readers()

(gdb) python restricted_readers()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "~/src/cloudius-systems/scylla/scylla-gdb.py", line 3704, in restricted_readers
gdb.error: There is no member or method named _semaphore.
Error while executing Python code.
def restricted_readers():
    for obj_addr, vtable_addr in find_vptrs():
        vtable_pfx = 'vtable for restricting_mutation_reader'
        name = resolve(vtable_addr, startswith=vtable_pfx)
        if name is None:
            continue

        rr = gdb.parse_and_eval('$downcast_vptr({})'.format(int(obj_addr)))
        schema = schema_ptr(rr['_schema'])
        state = std_variant(rr['_state'])
        index = state.index()
        if index == 0:
            semaphore = state.get()['_semaphore']['_name']
        else:
            semaphore = 'Unknown'

        gdb.write('0x{:016x} {:30} {} {} \n'.format(int(obj_addr), str(semaphore), index, schema.table_name()))

Try this.

def restricted_readers():
    for obj_addr, vtable_addr in find_vptrs():
        vtable_pfx = 'vtable for restricting_mutation_reader'
        name = resolve(vtable_addr, startswith=vtable_pfx)
        if name is None:
            continue

        rr = gdb.parse_and_eval('$downcast_vptr({})'.format(int(obj_addr)))
        schema = schema_ptr(rr['_schema'])
        state = std_variant(rr['_state'])
        index = state.index()
        if index == 0:
            semaphore = state.get()['_semaphore']['_name']
        else:
            semaphore = 'Unknown'

        gdb.write('0x{:016x} {:30} {} {} \n'.format(int(obj_addr), str(semaphore), index, schema.table_name()))

Try this.

(gdb) python restricted_readers_v2()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "~/src/cloudius-systems/scylla/scylla-gdb.py", line 3722, in restricted_readers_v2
gdb.error: There is no member or method named _semaphore.
Error while executing Python code.

I added this

3721         if index == 0:
3722             print(state.get())
3723             semaphore = state.get()['_semaphore']['_name']

to print state directly.

{
  reader_factory = {
    _ms = {
      _fn = {
        _p = 0x6000035aade8
      },
      _presence_checker_factory = {
        _p = 0x6000035aa140
      }
    },
    _s = {
      _p = 0x600003a42380
    },
    _permit = {
      _semaphore = 0x600003e46eb0
    },
    _range = {
      _M_data = 0x6000035c1ee8
    },
    _slice = {
      _M_data = 0x6000035c1e48
    },
    _pc = {
      _M_data = 0x7fd2b4e23308
    },
    _trace_state = {
      _state_ptr = {
        _p = 0x0
      }
    },
    _fwd = {
      _value = false,
      static yes = {
        _value = true,
        static yes = <same as static member of an already seen type>,
        static no = {
          _value = false,
          static yes = <same as static member of an already seen type>,
          static no = <same as static member of an already seen type>
        }
      },
      static no = <same as static member of an already seen type>
    },
    _fwd_mr = {
      _value = false,
      static yes = {
        _value = true,
        static yes = <same as static member of an already seen type>,
        static no = {
          _value = false,
          static yes = <same as static member of an already seen type>,
          static no = <same as static member of an already seen type>
        }
      },
      static no = <same as static member of an already seen type>
    }
--Type <RET> for more, q to quit, c to continue without paging--
  }
}
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "~/src/cloudius-systems/scylla/scylla-gdb.py", line 3723, in restricted_readers_v2
gdb.error: There is no member or method named _semaphore.
Error while executing Python code.

With:

3710 def restricted_readers_v2():
3711     for obj_addr, vtable_addr in find_vptrs():
3712         vtable_pfx = 'vtable for restricting_mutation_reader'
3713         name = resolve(vtable_addr, startswith=vtable_pfx)
3714         if name is None:
3715             continue
3716 
3717         rr = gdb.parse_and_eval('$downcast_vptr({})'.format(int(obj_addr)))
3718         schema = schema_ptr(rr['_schema'])
3719         state = std_variant(rr['_state'])
3720         index = state.index()
3721         if index == 0:
3722             #print(state.get())
3723             #semaphore = state.get()['_semaphore']['_name']
3724             semaphore = state.get()['reader_factory']['_permit']['_semaphore']['_name']
3725         else:
3726             semaphore = 'Unknown'
3727 
3728         gdb.write('0x{:016x} {:30} {} {} \n'.format(int(obj_addr), str(semaphore), index, schema.table_name()))
(gdb) python restricted_readers_v2()
0x00006000006813c0 "_streaming_concurrency_sem"   0 "ks"."cf" 
0x00006000037c65a0 "_streaming_concurrency_sem"   0 "ks"."cf"

Sorry it has to be:

semaphore = state.get()['reader_factory']['_permit']['_semaphore']['_name']

In the else branch, you can put:

semaphore = state.get()['units']['_semaphore']['_name']

Can you please generate a core and upload somewhere I can access it?
This is very strange, I'd like to have a look.

Sorry it has to be:

semaphore = state.get()['reader_factory']['_permit']['_semaphore']['_name']

In the else branch, you can put:

semaphore = state.get()['units']['_semaphore']['_name']

After changing the else branch:

(gdb) python restricted_readers_v2()
0x00006000006813c0 "_streaming_concurrency_sem"   0 "ks"."cf" 
0x00006000037c65a0 "_streaming_concurrency_sem"   0 "ks"."cf" 

Can you please generate a core and upload somewhere I can access it?
This is very strange, I'd like to have a look.

I will give you both the core and the reloc package.

Here it is:

gs://scratch.scylladb.com/asias/repair_hang_6603_for_botond

This is another case of https://github.com/scylladb/scylla/commit/0c6bbc84cd30df1c46a6cd81d87170b99d5a8144 uncovering a a bug that used to hide in dark corners before.

Initially I suspected a permit leak, there are no active restricted readers, but a lot of memory is consumed from the streaming concurrency semaphore. Searching for references to the semaphore revealed 1468 references to it. Digging deeper, I followed the references back to 3 combined mutation readers, one of them combining a whopping 207 sstable readers. So the amount of memory these three are consuming indirectly seems legit. Following references further I found that these readers are owned by a seastar thread. The thread is already joined, but its _done task is part of this fiber:

0x600000640a40: 0x4aa5a8 <_ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIJEEEZZNS_6futureIJEE14then_impl_nrvoIZNS_5asyncIZN2db4view21view_update_generator5startEvEUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSE_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSF_DpOSI_EUlvE0_S5_EET0_SS_ENKUlvE_clEvEUlRS3_ONS_12future_stateIJEEEE_JEEE+16>
(gdb) scylla fiber 0x600000640a40
Starting task: (task*) 0x0000600000640a40 0x00000000004aa5a8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<>, seastar::future<>::then_impl_nrvo<seastar::async<db::view::view_update_generator::start()::{lambda()#1}>(seastar::thread_attributes, std::decay&&, (std::decay<db::view::view_update_generator::start()::{lambda()#1}>::type&&)...)::{lambda()#2}, seastar::future<> >(seastar::thread_attributes)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, seastar::future_state<>&&)#1}> + 16 
#0  (task*) 0x0000600003302968 0x00000000004aa5c0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<>, seastar::future<>::then_wrapped_nrvo<seastar::future<>, seastar::future<>::finally_body<seastar::async<db::view::view_update_generator::start()::{lambda()#1}>(seastar::thread_attributes, std::decay&&, (std::decay<db::view::view_update_generator::start()::{lambda()#1}>::type&&)...)::{lambda()#3}, false> >(seastar::future<>::finally_body<seastar::async<db::view::view_update_generator::start()::{lambda()#1}>(seastar::thread_attributes, std::decay&&, (std::decay<db::view::view_update_generator::start()::{lambda()#1}>::type&&)...)::{lambda()#3}, false>&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, seastar::future_state<>&&)#1}> + 16 

The readers seems to be part of a view update generator, which explains how we can have sstable readers not going through a restricted reader. Reads executed on behalf of view update generation are not restricted, but since https://github.com/scylladb/scylla/commit/0c6bbc84cd30df1c46a6cd81d87170b99d5a8144 they correctly register their memory usage with the streaming read concurrency semaphore. In this test they happen to consume so much memory, that there is none left for repair, so the repair readers are not admitted.

We will need to think about whether this is acceptable or whether we need to restrict reads executed on behalf of view update generation.

/cc @nyh, @psarna

@denesb I'm not an expert in those permit details, nor of the view update generator (@psarna is the bigger expert there and recently fixed bugs there) so I have some questions / comments.

First, the observation that we are reading 207 sstables concurrently. This is bad. For example, compaction always limits the number of sstables read together, because we know that reading 207 sstables together is a bad idea. So why should the view update generator read so many at once, and why didn't we consider that a bad idea too? If understand correctly, we get all these separate sstables from repair - because separate token ranges etc. create separate sstables. But is it strictly necessary to build all the view updates together? @psarna I seem to remember you recently explained to me that we need these updates to happen atomically in some sense, not in steps, but I don't remember the details. In the past not only did we not have this "staging" directory collecting 207 sstables - we actually created each view update at a time - what was wrong with that?

Second, Botond, you said that even after the view building was done, the readers remained. Why is this? Is there something we can do to have these sstable readers disappear once we finished reading them? When we create staging_sstable_reader, we say forwarding::no, and we only consume from it once. What leads the readers to think they should stay behind - and cause the repair readers to be blocked indefinitely? If I understand correctly, this is not a classic deadlock - the repair is waiting for the staging_sstable_reader to finish reading and leave memory, but staging_sstable_reader doesn't (IIUC) wait for repair to finish. So if it's not a deadlock why do we have a hang? Is it just a bug that we forget to free the memory used by staging_sstable_reader?

Second, Botond, you said that even after the view building was done, the readers remained. Why is this?

I didn't say the view building was done, I said thread::join() was called, which just means that somebody is waiting on it. This may have been an irrelevant and confusing piece of information.

Is there something we can do to have these sstable readers disappear once we finished reading them? When we create staging_sstable_reader, we say forwarding::no, and we only consume from it once. What leads the readers to think they should stay behind - and cause the repair readers to be blocked indefinitely?

Again, I don't think the readers are staying around indefinitely. To be honest I don't know how long repair has been waiting on these readers, I received a core to look at, which doesn't contain this information.

If I understand correctly, this is _not_ a classic deadlock - the repair is waiting for the staging_sstable_reader to finish reading and leave memory, but staging_sstable_reader doesn't (IIUC) wait for repair to finish. So if it's not a deadlock why do we have a hang? Is it just a bug that we forget to free the memory used by staging_sstable_reader?

I think we don't have a hang or deadlock here at all, repair is just waiting on the view building and people were suprised by this new wait and though that something must have hung or deadlocked. We also don't forget to free memory. What I think the bug is -- and sorry for just throwing the "bug" word out there without explaining -- is that reads done on behalf of view generation are not restricted. They can start reading when they please, regardless of how many resourced dedicated to the streaming class (to which they belong) are available. By doing this they might starve repair. Consider the very relevant example of a store where only 10 people can be inside at any time and a queue formed outside with people who want to enter. Now if there is a stream of other people who are cutting the line and entering the store regardless of how many people are already inside, the ones standing politely in the queue will be left to wait endlessly.

Again, I don't think the readers are staying around indefinitely. To be honest I don't know how long repair has been waiting on these readers, I received a core to look at, which doesn't contain this information.

Normally, the test took 480 seconds. When it hangs, it does not finish in around 50 minutes.

First, the observation that we are reading 207 sstables concurrently. This is bad. For example, compaction always limits the number of sstables read together, because we know that reading 207 sstables together is a bad idea. So why should the view update generator read so many at once, and why didn't we consider that a bad idea too? If understand correctly, we get all these separate sstables from repair - because separate token ranges etc. create separate sstables. But is it strictly necessary to build all the view updates _together_? @psarna I seem to remember you recently explained to me that we need these updates to happen atomically in some sense, not in steps, but I don't remember the details. In the past not only did we not have this "staging" directory collecting 207 sstables - we actually created each view update at a time - what was wrong with that?

Two more notes on reading 207 sstables at the same time:

  1. Shouldn't all these tables have disjoint token ranges, so we can open and close them as needed instead opening all of them at once? Or are we lacking this feature (I'm not up to date there...).
  2. If there is no other alternative (I still hope there is) and we really need to read hundreds of sstables in parallel, another option is to reduce their number via compaction. As I noted earlier, this is what compaction itself does - it never compacts 200 tables into one, it first compacts groups of 16, and then continues to compact the result of these smaller compactions.

@nyh / @psarna / @eliransin I don;t understand view building is not restricted ?

@nyh / @psarna / @eliransin I don;t understand view building is not restricted ?

Unfortunately, I'm not sure how it ought to be "restricted". The existing staging-update code has:

                    flat_mutation_reader staging_sstable_reader = ::make_range_sstable_reader(s,
                            _db.make_query_class_config().semaphore.make_permit(),
                            std::move(ssts),
                            query::full_partition_range,
                            s->full_slice(),
                            service::get_local_streaming_priority(),
                            nullptr,
                            ::streamed_mutation::forwarding::no,
                            ::mutation_reader::forwarding::no);

Is this make_permit() thing good enough? Does it make sense to "make" a single permit and then read 207 sstables in parallel? I need advice from people who are more familiar with the permit mechanism on what's actually wrong in this code.

Two more notes on reading 207 sstables at the same time:

1. Shouldn't all these tables have disjoint token ranges, so we can open and close them as needed instead opening all of them at once? Or are we lacking this feature (I'm not up to date there...).

2. If there is no other alternative (I still hope there is) and we really need to read hundreds of sstables in parallel, another option is to reduce their number via compaction. As I noted earlier, this is what compaction itself does - it never compacts 200 tables into one, it first compacts groups of 16, and then continues to compact the result of these smaller compactions.

The core was generated from node2 (the repair master, 3 repair followers). node2 receives data from 2 nodes (node 1 and node3). node2 generates 1 sstables per vnode range per peer nodes. The sstables generated for each peer node is disjoint. But sstables for the same range but came from different nodes are not disjoint.

In the tests, there were 803 ranges to repair. On each shard, there was 803 sstables from node1 and 803 sstables from node3.

@nyh / @psarna / @eliransin I don;t understand view building is not restricted ?

Unfortunately, I'm not sure how it ought to be "restricted". The existing staging-update code has:

                    flat_mutation_reader staging_sstable_reader = ::make_range_sstable_reader(s,
                            _db.make_query_class_config().semaphore.make_permit(),
                            std::move(ssts),
                            query::full_partition_range,
                            s->full_slice(),
                            service::get_local_streaming_priority(),
                            nullptr,
                            ::streamed_mutation::forwarding::no,
                            ::mutation_reader::forwarding::no);

Is this make_permit() thing good enough? Does it make sense to "make" a single permit and then read 207 sstables in parallel? I need advice from people who are more familiar with the permit mechanism on what's actually wrong in this code.

make_permit() is not good enough, making a permit just ensures that used resources will be accounted for, but the reader will not magically wait for admission. make_range_sstable_reader() bypasses the restriction mechanism, table::make_sstable_reader() is the one that waits for admission before starting the actual reading.
Its fine to use a single permit for any number of actual sstable reads, the permit is per logical read, not per actual sstable read. We do have deficiencies around estimating the amount of resources that the read will use (https://github.com/scylladb/scylla/issues/4297 and https://github.com/scylladb/scylla/issues/4295).

@denesb I'm out of my depths here... Some questions:

  1. If "make_range_sstable_reader()" doesn't do the right thing, should we attempt to hide it and audit and fix all the places it is used, not just this one?
  2. If I understand correctly, if we make this specific code wait for the permit, all that would change is that the repair-caused view updates will sometimes block. But how does that help anything? I mean, the view updates already block - it happens when they have nothing to do - i.e., when the repair is not making any progress, no data is collected for view updates. The view updates cannot monopolize the permits because they don't have endless amount of work to do - they only do work that the repair previously generated for them. So I'm worried we still don't understand the real problem. I have the feeling it is not that this code is not blocking when it should.

The core was generated from node2 (the repair master, 3 repair followers). node2 receives data from 2 nodes (node 1 and node3). node2 generates 1 sstables per vnode range per peer nodes. The sstables generated for each peer node is disjoint. But sstables for the same range but came from different nodes are _not_ disjoint.

In the tests, there were 803 ranges to repair. On each shard, there was 803 sstables from node1 and 803 sstables from node3.

Doesn't this mean that while scanning this data, although we have 1606 (!) sstables, at any single point (i.e., token) during our scan we just need two reading sstables, because all the others are disjoint? Botond reported seeing 207 reading sstables, and I'm wondering why. I thought that long ago we added code which only starts reading from an sstable when its token range becomes relevant, and stops reading from it immediately when its token range becomes irrelevant. Or did we?

@denesb I'm out of my depths here... Some questions:

1. If "make_range_sstable_reader()" doesn't do the right thing, should we attempt to hide it and audit and fix all the places it is used, not just this one?

make_range_sstable_reader() is used by compaction (and maybe more, we should indeed check) which has its own ways to restrict concurrency. Also going through the semaphore might cause some of the reads belonging to a compaction to be admitted and some not, which is highly undesirable. So I think we should keep this function, but we should rename it. Despite what their name suggests the only difference between make_range_sstable_reader() and table::make_sstable_reader() is whether the read is restricted or not (there used to be more differences but they are gone now). So we should rename these two methods to make_unrestricted_sstable_reader() and make_restricted_sstable_reader() respectively to reflect this.

The core was generated from node2 (the repair master, 3 repair followers). node2 receives data from 2 nodes (node 1 and node3). node2 generates 1 sstables per vnode range per peer nodes. The sstables generated for each peer node is disjoint. But sstables for the same range but came from different nodes are _not_ disjoint.
In the tests, there were 803 ranges to repair. On each shard, there was 803 sstables from node1 and 803 sstables from node3.

Doesn't this mean that while scanning this data, although we have 1606 (!) sstables, at any single point (i.e., token) during our scan we just need _two_ reading sstables, because all the others are disjoint? Botond reported seeing 207 reading sstables, and I'm wondering why. I thought that long ago we added code which only starts reading from an sstable when its token range becomes relevant, and stops reading from it immediately when its token range becomes irrelevant. Or did we?

In practice, it is less than 1606 because not all ranges contain data the repair master needed.

2. If I understand correctly, if we make this specific code wait for the permit, all that would change is that the repair-caused view updates will sometimes block. But how does that help anything? I mean, the view updates already block - it happens when they have nothing to do - i.e., when the repair is not making any progress, no data is collected for view updates. The view updates cannot monopolize the permits because they don't have endless amount of work to do - they only do work that the repair previously generated for them. So I'm worried we still don't understand the real problem. I have the feeling it is not that this code is not blocking when it should.

This is a very good question. Indeed just changing make_range_sstable_reader() to table::make_sstable_reader() won't magically make this problem go away. Restricting the view readers won't make the competition for resources go away, it will just make it more explicit. AFAIK view building is driven by repair. Repair creates staging sstables which it registers with view building. View building then picks up these sstables (how exactly, does it use batching?) and generates view updates for them. So if I understand correctly, view update generation and repair is interleaved. Making the view readers restricted could possibly cause a situation, where repair grinds to a halt because view update readers took up all the available resources, this is especially likely as repair generates lots of sstables that then take a lot of memory to read. This is undesirable. What we should do instead is to defer view update building to after repair has finished and also after the generated sstables have been compacted off-strategy so reading them is cheaper. This way view building can block another repair which was started after the previous one has finished. This is actually desirable, we don't want our compaction and view-update backlog to grow, before we've dealt with the one generated by a previous repair.

The core was generated from node2 (the repair master, 3 repair followers). node2 receives data from 2 nodes (node 1 and node3). node2 generates 1 sstables per vnode range per peer nodes. The sstables generated for each peer node is disjoint. But sstables for the same range but came from different nodes are _not_ disjoint.
In the tests, there were 803 ranges to repair. On each shard, there was 803 sstables from node1 and 803 sstables from node3.

Doesn't this mean that while scanning this data, although we have 1606 (!) sstables, at any single point (i.e., token) during our scan we just need _two_ reading sstables, because all the others are disjoint? Botond reported seeing 207 reading sstables, and I'm wondering why. I thought that long ago we added code which only starts reading from an sstable when its token range becomes relevant, and stops reading from it immediately when its token range becomes irrelevant. Or did we?

We indeed have this code, but it is only used for LCS and ICS(?). We don't use it for compaction strategies which don't normally keep disjoint sstables. We could use it in this case, but better yet would be to do an off-strategy compaction on these sstables before we start generating view updates.

Doesn't this mean that while scanning this data, although we have 1606 (!) sstables, at any single point (i.e., token) during our scan we just need _two_ reading sstables, because all the others are disjoint? Botond reported seeing 207 reading sstables, and I'm wondering why. I thought that long ago we added code which only starts reading from an sstable when its token range becomes relevant, and stops reading from it immediately when its token range becomes irrelevant. Or did we?

We indeed have this code, but it is only used for LCS and ICS(?). We don't use it for compaction strategies which don't normally keep disjoint sstables. We could use it in this case, but better yet would be to do an off-strategy compaction on these sstables before we start generating view updates.

So I think beyond all the rest of the ideas raised in this thread about permits and why we even need to read all the "staging" sstables at once - we found a concrete bug. Repair is known to create hundreds of sstables with disjoint ranges, yet we read them without making use of code we have exactly for this case. I'll create separate issue about that.

The core was generated from node2 (the repair master, 3 repair followers). node2 receives data from 2 nodes (node 1 and node3). node2 generates 1 sstables per vnode range per peer nodes. The sstables generated for each peer node is disjoint. But sstables for the same range but came from different nodes are _not_ disjoint.
In the tests, there were 803 ranges to repair. On each shard, there was 803 sstables from node1 and 803 sstables from node3.

Doesn't this mean that while scanning this data, although we have 1606 (!) sstables, at any single point (i.e., token) during our scan we just need _two_ reading sstables, because all the others are disjoint? Botond reported seeing 207 reading sstables, and I'm wondering why. I thought that long ago we added code which only starts reading from an sstable when its token range becomes relevant, and stops reading from it immediately when its token range becomes irrelevant. Or did we?

We indeed have this code, but it is only used for LCS and ICS(?). We don't use it for compaction strategies which don't normally keep disjoint sstables. We could use it in this case, but better yet would be to do an off-strategy compaction on these sstables before we start generating view updates.

Cc @raphaelsc

Doesn't this mean that while scanning this data, although we have 1606 (!) sstables, at any single point (i.e., token) during our scan we just need _two_ reading sstables, because all the others are disjoint? Botond reported seeing 207 reading sstables, and I'm wondering why. I thought that long ago we added code which only starts reading from an sstable when its token range becomes relevant, and stops reading from it immediately when its token range becomes irrelevant. Or did we?

We indeed have this code, but it is only used for LCS and ICS(?). We don't use it for compaction strategies which don't normally keep disjoint sstables. We could use it in this case, but better yet would be to do an off-strategy compaction on these sstables before we start generating view updates.

So I think beyond all the rest of the ideas raised in this thread about permits and why we even need to read all the "staging" sstables at once - we found a concrete bug. Repair is known to create hundreds of sstables with disjoint ranges, yet we read them without making use of code we have exactly for this case. I'll create separate issue about that.

Note that this still doesn't solve the starving of repair readers. View generation can create readers during normal writes as well, that can run parallel to repair, yet these readers will still snatch the semaphore's resources from repair. It is a good question whether it would be acceptable to have such view update generation be queued up behind repair? The fact of the matter is that view update generation readers are unrestricted and can consume an unbounded amount of memory as of now. I think we will have to restrict them one way or another, but this will open a whole new can of worms, as these view updates are generated in the background, so if we suddenly restrict their concurrency, they might start queueing up without being able to provide back-pressure to the source of the writes.

I check how many sstables passed to staging_sstable_reader in the test.


@@ -44,6 +48,8 @@ future<> view_update_generator::start() {
                 // Copy what we have so far so we don't miss new updates
                 auto sstables = std::exchange(_sstables_with_tables[t], {});

+                vug_logger.info("Handle keyspace={}, table={}, sstables_nr={}", s->ks_name(), s->cf_name(), sstables.size());
+
                 try {



$ cat ~/.dtest/dtest-soo97tmf/test/node2/logs/system.log|grep sstables_nr|wc -l
549914


cat ~/.dtest/dtest-soo97tmf/test/node2/logs/system.log|grep sstables_nr|grep -v 'sstables_nr=0'|grep -v 'sstables_nr=1$'|less

INFO  2020-06-29 08:57:28,319 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=5
INFO  2020-06-29 08:57:29,469 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=9
INFO  2020-06-29 08:57:30,356 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=29
INFO  2020-06-29 08:57:30,762 [shard 1] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:57:33,043 [shard 1] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:57:34,690 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=15
INFO  2020-06-29 08:57:36,150 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=12
INFO  2020-06-29 08:57:37,242 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=15
INFO  2020-06-29 08:57:42,240 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=10
INFO  2020-06-29 08:57:43,602 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=64
INFO  2020-06-29 08:57:51,250 [shard 1] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:57:54,317 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=88
INFO  2020-06-29 08:58:05,138 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:05,372 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:05,972 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=9
INFO  2020-06-29 08:58:06,825 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=59
INFO  2020-06-29 08:58:10,330 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=155
INFO  2020-06-29 08:58:19,902 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=112
INFO  2020-06-29 08:58:24,825 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:24,968 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=3
INFO  2020-06-29 08:58:25,096 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:25,218 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:25,313 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:25,418 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:25,525 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:25,743 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=4
INFO  2020-06-29 08:58:25,873 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=2
INFO  2020-06-29 08:58:26,237 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=5
INFO  2020-06-29 08:58:26,500 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=8
INFO  2020-06-29 08:58:26,939 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=29
INFO  2020-06-29 08:58:27,993 [shard 0] view_update_generator - Handle keyspace=ks, table=cf, sstables_nr=44
Was this page helpful?
0 / 5 - 0 ratings