Scylla: Post-repair view updates keep too many sstable readers open

Created on 24 Jun 2020 · 14Comments · Source: scylladb/scylla

This is a spinoff from issue #6603.

During repair, the repair master creates hundreds of small sstables - one per each token range and remote replica - and puts them in a staging directory. It the table has a materialized view, later the code in db/view/view_update_generator.cc reads all these sstables together to generate view updates for this new data. It reads these sstables together using make_range_sstable_reader().

We know that during this scan, for each token position there are at most two (RF-1) relevant sstables. However, it turns out that make_range_sstable_reader() doesn't take advantage of this and still keeps readers open on all the hundreds of sstables, which uses a lot of memory.

We do have code that knows to only start reading an sstable when its token range becomes relevant, but as @denesb explained in #6603 it is only used in specific compaction strategies (ICS and LCS) and is not used generally by make_range_sstable_reader().

We should either make make_range_sstable_reader() always use this optimization (if it can), or, alternatively, we should have another reader function (or use one that already exists?) which explicitly asks for this optimization. We know it will be worthwhile in this case.

By the way, we should also consider what happens without materialized views. In that case, we move the hundreds of tables from the staging directory to the root directory, and we're supposed to start compacting them. In that case too, we could optimize the compaction to read all the sstables regardless of the usual tendency of the compaction strategy to only, say, compact 16 at a time.

materialized-views repair

Source

nyh

Most helpful comment

I just realized that we always use the incremental selector for the combining reader when reading from sstables. So whether this optimization can be made use of depends on the sstable set created. It might be enough to do something as simple as changing the code below to always create an ordered set, instead of whatever is specific of the compaction strategy (which is not relevant as the staging sstables don't adhere to its invariant anyway):
https://github.com/scylladb/scylla/blob/bbd3ed9d47d70207dc122bdd61cd69aaf3ec1a8a/db/view/view_update_generator.cc#L54-L60

denesb on 1 Jul 2020

😄2 🎉1 👍1

All 14 comments

I think the decision on whether this optimization can be used or not should happen above make_range_sstable_reader(), in the layer which actually knows this. There is really no way to find out whether you can use this optimization or not reliably. The way this optimization works is that you arrange your sstables in a special partitioned_sstable_set. The thing is you can create this set from any sstable set, but for most cases it will have huge overlaps between sstables and you won't be able to make use of it and hence you just wasted a bunch of time building your interval map.

denesb on 25 Jun 2020

Earlier I had a chat with @psarna, @denesb, @slivne and others about this issue.

It turns out that this is a regression starting in commit 71ac6ebcc52d8c451fa6a9550691a8f0b4a65040 (and perhaps other related commits). In the past, we built the view updates for each sstable separately. But since that commit by @glommer, we have a background fiber which once in a while grabs several sstables from the staging directory (placed there by the repair), and builds the view updates for all of them together (note that the "building view updates" part is only relevant for tables with views - in normal tables this only moves the tables)
.
The intention (stated in that commit) was that we shouldn't move (or view-build) these many small tables, but that we should compact them to fewer tables first. This would have solved the present issue - the compaction should be done efficiently (we should force it to use the partitioned_sstable_set optimization, if that's not already the default) - and then the read will just be a normal read. However, this followup was never done - the compaction call is still missing. That's probably what we should do to fix this issue - but we should make sure that the compaction uses the disjoint-sstable optimization.

Note that this compaction is important not just for view building, we also want to do it for the health of the base table - we don't want to bring in thousands of tiny sstables from the repair. These can hurt not only view building (as in this issue) but also regular base-table reads (and especially scans), until the normal background compaction catches up with them.

nyh on 29 Jun 2020

denesb on 1 Jul 2020

😄2 🎉1 👍1

@denesb wrote elsewhere that:

It seems we are always using the incremental selector, regardless of the compaction strategy. What determines whether the optimization can be used is the sstable_set that is used.

So, can we easily fix view_update_generator.cc to use, instead of make_range_sstable_reader() (or alternatively, fix that existing function), something which creates sstable_set in a way that can take advantage of the disjoint token ranges - and then this issue will be miraculously solved?

Later, we can also compact these hundreds of sstables (@raphaelsc calls this "off-strategy compaction") but we won't have to.

nyh on 1 Jul 2020

On Wed, Jul 1, 2020 at 11:36 AM nyh notifications@github.com wrote:

@denesb https://github.com/denesb wrote elsewhere that:

It seems we are always using the incremental selector, regardless of the
compaction strategy. What determines whether the optimization can be used
is the sstable_set that is used.

So, can we easily fix view_update_generator.cc to use, instead of
make_range_sstable_reader() (or alternatively, fix that existing
function), something which creates sstable_set in a way that can take
advantage of the disjoint token ranges - and then this issue will be
miraculously solved?

If we know that SSTables in view_update_generator::start() are
non-overlapping, we can make it instantiate explicitly the
partitioned_sstable_set, regardless of the strategy, when creating the
sstable set. And yes, the problem would be solved.

Later, we can also compact these hundreds of sstables (@raphaelsc

https://github.com/raphaelsc calls this "off-strategy compaction") but
we won't have to.

Don't we need to integrate those hundreds of SSTables in the staging
directory into the main directory? Sorry, not familiar at all with the view
build generator.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/6707#issuecomment-652457068,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKYA445Y66BDCGKCN73CQLRZNCXXANCNFSM4OGQ7X7Q
.

raphaelsc on 1 Jul 2020

Don't we need to integrate those hundreds of SSTables in the staging directory into the main directory? Sorry, not familiar at all with the view build generator.

Eventually yes, but this issue is about what happens before we move the sstables into the main directory. If the table has materialized views, we _scan_ the staging sstables - all of them together - and build (and write) view updates. It is this scan that is done inefficiently (keeping hundreds of sstables open). We _could_ compact before doing this scan, but an alternative will be just to do this scan efficiently (keeping only RF-1 sstables open at a time) - and then it doesn't matter when we do this "off-strategy" compaction (which today we don't do at all).

nyh on 1 Jul 2020

@denesb since it seems you are familiar with area of the code, and how to run the relevant tests, could you prepare a patch for this issue based on your idea above? I'm a bit rusty in the sstable_set stuff and the relevant tests. Thanks.

nyh on 1 Jul 2020

On Wed, Jul 1, 2020 at 1:03 PM Raphael S. Carvalho raphaelsc@scylladb.com
wrote:

>
>

On Wed, Jul 1, 2020 at 11:36 AM nyh notifications@github.com wrote:

@denesb https://github.com/denesb wrote elsewhere that:

It seems we are always using the incremental selector, regardless of the
compaction strategy. What determines whether the optimization can be used
is the sstable_set that is used.

So, can we easily fix view_update_generator.cc to use, instead of
make_range_sstable_reader() (or alternatively, fix that existing
function), something which creates sstable_set in a way that can take
advantage of the disjoint token ranges - and then this issue will be
miraculously solved?

If we know that SSTables in view_update_generator::start() are
non-overlapping, we can make it instantiate explicitly the
partitioned_sstable_set, regardless of the strategy, when creating the
sstable set. And yes, the problem would be solved.

The patch would be something as simple as this:
diff --git a/db/view/view_update_generator.cc
b/db/view/view_update_generator.cc
index 11c42ef2f..342ee3b6a 100644
--- a/db/view/view_update_generator.cc
+++ b/db/view/view_update_generator.cc
@@ -54,7 +54,7 @@ future<> view_update_generator::start() {
// temporary: need an sstable set for the flat
mutation reader, but the
// compaction_descriptor takes a vector. Soon this
will become a compaction
// so the transformation to the SSTable set will not
be needed.
- auto ssts =
make_lw_shared(t->get_compaction_strategy().make_sstable_set(s));
+ auto ssts =
make_lw_shared(make_partitioned_sstable_set(s, false));
for (auto& sst : sstables) {
ssts->insert(sst);
}

Note the second parameter set to false. It tells partitioned_sstable_set
to discard the level metadata information. If set to true, level 0 sstables
are placed at a vector instead of interval map, which is what we're
targeting, to avoid a quadratic space complexity issue.

Later, we can also compact these hundreds of sstables (@raphaelsc

https://github.com/raphaelsc calls this "off-strategy compaction") but
we won't have to.

Don't we need to integrate those hundreds of SSTables in the staging
directory into the main directory? Sorry, not familiar at all with the view
build generator.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/6707#issuecomment-652457068,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKYA445Y66BDCGKCN73CQLRZNCXXANCNFSM4OGQ7X7Q
.

raphaelsc on 1 Jul 2020

On Wed, Jul 1, 2020 at 1:23 PM nyh notifications@github.com wrote:
>

Don't we need to integrate those hundreds of SSTables in the staging directory into the main directory? Sorry, not familiar at all with the view build generator.

Eventually yes, but this issue is about what happens before we move the sstables into the main directory. If the table has materialized views, we scan the staging sstables - all of them together - and build (and write) view updates. It is this scan that is done inefficiently (keeping hundreds of sstables open). We could compact before doing this scan, but an alternative will be just to do this scan efficiently (keeping only RF-1 sstables open at a time) - and then it doesn't matter when we do this "off-strategy" compaction (which today we don't do at all).

Understood. Thanks for the explanation!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

raphaelsc on 1 Jul 2020

@nyh @raphaelsc I already have a patch, I just need to test it.

denesb on 1 Jul 2020

I started testing my patch but I keep finding more strange stuff. For starters the semaphore which is intended to limit the concurrency of adding sstables to the staging directory is unbalanced. When deducting from the semaphore, one unit per sstable is deducted:
https://github.com/scylladb/scylla/blob/de82b3efaec3a9f285e5e676d32d88c163cee262/db/view/view_update_generator.cc#L123-L137

When returning the unit, we return one unit per table processed, this in theory can result in all units but one drying up after a while:
https://github.com/scylladb/scylla/blob/de82b3efaec3a9f285e5e676d32d88c163cee262/db/view/view_update_generator.cc#L92

The next strange thing is that we only restrict concurrency, after start() is called. Who is expected to call register_staging_sstable() before we call start()? In the core (#6603) I see a staging reader created with a whopping 207 readers, and a further 34 units being available. This means that someone managed to sneak in 241 (!!!) sstables somehow, despite the semaphore being initialized with 5 units.

denesb on 6 Jul 2020

I opened a separate issue (#6774) for the issue I described above, as I found it to be more severe than I thought: instead of limiting the concurrency to 1, it can grow it to infinity!

denesb on 6 Jul 2020

@avikivity @scylladb/scylla-maint please backport also to 4.1 due to #6875