Cockroach: storage: dropping a large table will brick a cluster due to compactions

Created on 19 Mar 2018  ·  140Comments  ·  Source: cockroachdb/cockroach

$ roachprod create USER-FOO -n10
$ roachprod run USER-FOO 'mkdir -p /mnt/data1/cockroach && gsutil -m -q cp -r gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/stores=10/$((10#$(hostname | grep -oE [0-9]+$)))/* /mnt/data1/cockroach'

Wait ~10m for stores to download. Then drop the 2TiB table:

ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}';
DROP TABLE bank.bank;

The cluster explodes a few minutes later as RocksDB tombstones pile up. I can no longer execute any SQL queries that read from/write to disk.

Very closely related to #21901, but thought I'd file a separate tracking issue.

/cc @spencerkimball

A-storage S-1-stability

All 140 comments

Ok, I'm counting 22k range deletion tombstones in aggregate:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l | tail -n+2 | cut -f2 -d: | paste -sd+ - | bc
22194

Broken down by node:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l 
benesch-drop-2: grep -a -A 10000 'Range del... 10/10
   1: 61
   2: 2134
   3: 2342
   4: 2292
   5: 2252
   6: 5080
   7: 34
   8: 862
   9: 2992
  10: 4145

I'm surprised there's such a high variance—the replicas looked balanced when I dropped the table.

Breaking it down on a per-sst basis, most of the 4k ssts have zero tombstones. There are a handful that have hundreds:

007620_dump.txt     1
007840_dump.txt     1
008611_dump.txt     1
008641_dump.txt     1
008681_dump.txt     1
008703_dump.txt     1
008826_dump.txt     1
008822_dump.txt     106
008817_dump.txt     112
007623_dump.txt     2
008596_dump.txt     2
008820_dump.txt     241
008818_dump.txt     247
007615_dump.txt     3
008824_dump.txt     3
008728_dump.txt     4
008816_dump.txt     4
008823_dump.txt     4
008825_dump.txt     6
008814_dump.txt     632
008819_dump.txt     662
008815_dump.txt     7
008821_dump.txt     92

Digging into individual values on 008819, we see that the end key of one range and the start key of the next are adjacent in cockroach terms but different at the rocksdb level. This is the start and end key of two adjacent tombstones:

{/Table/51/1/8035458/0 0.000000000,0}
{/Table/51/1/8036092 0.000000000,0}
{/Table/51/1/8036092/0 0.000000000,0}
{/Table/51/1/8039365 0.000000000,0}

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

Let's make sure this becomes a variant of the drop roachtest!

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

Do you think that would help that much? A sufficiently scattered table can always produce maximally discontiguous ranges.

True, but empirically it would make a big difference in this case. A very large fraction of the range tombstones in 008819 are contiguous (or would be with this change).

Another upstream fix would be for rocksdb to prioritize sstables with multiple range tombstones for compaction. These are likely to be easy to compact away to the lowest level, and leaving them in higher levels is disproportionately expensive.

Note that I'm speculating about this - I haven't found any code in rocksdb that would join two consecutive range deletions. I'm not sure if it's possible - each range tombstone has a sequence number as a payload, but I think it may be possible for those to be flattened away on compaction.

range_del_aggregator.cc has a collapse_deletions_ flag, but I haven't been able to trace what sets it.

The CPU profile showed that we're spending all our time inside the if (collapse_deletions_) block, so it's definitely getting set. But I've been looking at what that flag does and if it joins two adjacent ranges I can't see where it does so.

I experimented with a change that replaced the ClearRange with a (post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the high parallelism with which these queries are thrown at RocksDB by DistSender, the nodes ground to a halt, missing heartbeats and all. So this alone isn't an option. I'll run this again with limited parallelism but that will likely slow it down a lot.

If you have time on your hands, you might also want to experiment with
turning on the CompactOnDeletionCollectionFactory. Not my hacked together
version for range deletion tombstones, but the official upstream one for
triggering compactions whenever an SST contains too many normal deletion
tombstones within a window:
https://github.com/facebook/rocksdb/blob/master/utilities/table_properties_collectors/compact_on_deletion_collector.h

On Wed, Mar 21, 2018 at 11:53 AM, Tobias Schottdorf <
[email protected]> wrote:

I experimented with a change that replaced the ClearRange with a
(post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the
high parallelism with which these queries are thrown at RocksDB by
DistSender, the nodes ground to a halt, missing heartbeats and all. So
this alone isn't an option. I'll run this again with limited parallelism
but that will likely slow it down a lot.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-374989269,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IKz_0Rdmv09pwIUVHNFwIXLNyanZks5tgndtgaJpZM4SwswS
.

Good find! I'll try those out.

Ping https://github.com/cockroachdb/cockroach/issues/17229#issuecomment-362483650

I tried running with https://github.com/facebook/rocksdb/pull/3635/files (on top of stock master) and it doesn't help. Not sure whether it's because of a bug in the code or because of @benesch's observation that it doesn't really address our problem; either way we see goroutines stuck in RocksDB seeks for >20 minutes at a time (they eventually come back and go into the next syscall, as far as I can tell, but it's clearly not anywhere close to working).

I then added @benesch's PR on top and restarted the nodes in the hope that I would see compaction activity while the node is stuck in initialization. However, that doesn't seem to happen; we only see one single thread maxing out one CPU:

image

Back to the ClearIterRange version, even with an added CompactOnDeletionCollectionFactory it's pretty bad (which makes sense, we've added more work):

root@localhost:26257/> create database foo;
CREATE DATABASE

Time: 1m49.87228088s

image

goroutine 52246 [syscall, 15 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    _cgo_gotypes.go:549 +0x74
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0xba
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc42a9b87e0, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0x16b
github.com/cockroachdb/cockroach/pkg/storage/engine.dbIterate(0x7f4e510990c0, 0x2657fc0, 0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2391 +0x166
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Iterate(0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, 0x8, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1548 +0xfc
github.com/cockroachdb/cockroach/pkg/storage/batcheval.ClearRange(0x26486a0, 0xc42f06f440, 0x7f4e65c12328, 0xc42be3fb00, 0x26701c0, 0xc424c7c000, 0x151e0641f6a360bb, 0x0, 0x100000001, 0x3, ...)

It might not be a popular suggestion, but we could explore reinstating the original changes to simply drop SSTables for large aggregate suggested compactions. This requires that we set an extra flag on the suggestion indicating its a range of data which will never be rewritten. This is true after table drops and truncates, but not true after rebalances.

Or am I misunderstanding, and dropping the files would be independent of the range tombstone problem?

I was thinking about that too, I'm just spooked by the unintended consequences this can have as it pulls out the data even from open snapshots.

Do you expect it would alleviate the problem? I have a PR that does it if
you want to try.

On Wed, Mar 21, 2018 at 5:28 PM Tobias Schottdorf notifications@github.com
wrote:

I was thinking about that too, I'm just spooked by the unintended
consequences this can have as it pulls out the data even from open
snapshots.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-375101757,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTY6b2vN-YRJ0OR0tZDRZsTeV9O3Vks5tgsXsgaJpZM4SwswS
.

It's likely that it would alleviate the problem, but actually introducing this is not an option that late in the 2.0 cycle, so I'm investing my energies elsewhere. That said, if you want to try this, go ahead! Would be fun to see it in action.

Like I said in the ISSUE facebook/rocksdb#3634, That PR can only solve non-first seek performance issues, so @benesch is right, the PR facebook/rocksdb#3635 cannot solve your problem because cockroachdb requires multiple iterators, and then each iterator only does a small amount of seeks.

For this issue, I have some immature suggestions for consideration:

  1. A table can contain multiple ranges, when dropping table, one-time delete the entire table, rather than one by one to delete range, which can reduce the number of tombstone. But I think this can only alleviate the problem, because this large range may split, when the rocksdb does a level compaction. And it may be difficult to do a merge deletion in a re-balance scenario.
  2. If there is a mechanism guarantee, the dropped range can only be truely deleted, I.E., call DeleteRange() of rocksdb, if it is guaranteed to not be accessed anymore. For example, using a delayed deletion mechanism, such as delaying a transaction timeout? You can set ReadOptions.ignore_range_deletions to true, which will ignore range tombstones. Because you no longer need to access the deleted data, it is safe to ignore it.

Thanks @lingbin. We're aware of these options, but it's tricky for us to implement them because we've so far relied on RocksDB giving us consistent snapshots, and we have various mechanisms that check that the multiple copies of the data we keep are perfectly synchronized. That said, somewhere down the road we're going to have to use some technique that involves either vastly improving the performance of deletion tombstone seeks (what @benesch said upstream), DeleteFilesInRange, and/or skipping deletion tombstones.

@bdarnell, re the below:

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening? At least for testing, I can make these holes disappear, but it would be good to know if it's any good in the first place.

I'm also curious what the compactions necessary to clean up this table would do even if seeks weren't affected at all. Running a manual compaction takes hours on this kind of dataset; we were implicitly assuming that everything would just kind of evaporate and it would be quick, but it doesn't seem to be what's happening here. Maybe this is due to the "holes" we leave in the keyspace, but it's unclear to me. We need this to be reasonably efficient or the ClearRange option isn't one at all, even without the seek problem. Something to investigate.

Ping @a-robinson just in case anything from https://github.com/cockroachdb/cockroach/issues/21528 is applicable here. I believe the dataset there was much smaller, right?

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

image

I'm not advocating that that should be our strategy for 2.0, but it's one of the so far equally bad strategies.

Thinking more about ClearRange coalescence, it’s not clear how RocksDB
would manage it when they come in as separate commands. If you have
ClearRange [a, b) @ t1 and ClearRange [b, c) @ t3, you'd need to preserve
both of them for proper handling of any keys @ t2.

On Thu, Mar 22, 2018 at 12:15 PM Tobias Schottdorf notifications@github.com
wrote:

@bdarnell https://github.com/bdarnell, re the below:

If we adjusted our endpoints so that the tombstones lined up exactly,
would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's
true. Were you ever able to figure out if that should be happening? At
least for testing, I can make these holes disappear, but it would be good
to know if it's any good in the first place.

I'm also curious what the compactions necessary to clean up this table
would do even if seeks weren't affected at all. Running a manual compaction
takes hours on this kind of dataset; we were implicitly assuming that
everything would just kind of evaporate and it would be quick, but it
doesn't seem to be what's happening here. Maybe this is due to the "holes"
we leave in the keyspace, but it's unclear to me. We need this to be
reasonably efficient or the ClearRange option isn't one at all, even
without the seek problem. Something to investigate.

Ping @a-robinson https://github.com/a-robinson just in case anything
from #21528 https://github.com/cockroachdb/cockroach/issues/21528 is
applicable here. I believe the dataset there was much smaller, right?

In other news, I ran the experiment with a) ClearIterRange b)
tombstone-sensitive compaction and the cluster was a flaming pile of
garbage for most of the time, but came out looking good:

[image: image]
https://user-images.githubusercontent.com/5076964/37782015-46fbf22c-2dc8-11e8-90d2-fc851ffe1f40.png

I'm not advocating that that should be our strategy for 2.0, but it's one
of the so far equally bad strategies.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-375365121,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IK9gUo5GPcEA1d2XmGKF4hltzsR6ks5tg848gaJpZM4SwswS
.

RocksDB would keep track of the lowest sequence number that's still open in a snapshot and at compaction time, it joins SSTables whose sequence numbers are irrelevant.

Ping @a-robinson just in case anything from #21528 is applicable here. I believe the dataset there was much smaller, right?

It took about 5 days of running kv --read-percent=0 --max-rate=1200 before problems started showing up on the 6-node cluster under chaos. I'm not sure if I was using a custom block size or how much data had been generated by the time the problems started.

I don't think there's anything particularly applicable here that we learned from that issue. The main lessons were:

  1. manual compactions are slow
  2. manual compactions of key ranges that have had data written to them since the deletion are particularly slow and expensive
  3. using the exclusive_manual_compaction setting is a bad idea if you care about foreground writes

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening?

I haven't found any code that would do this, but I can't rule it out. It would need to happen on compaction but not, I think, on all uses of the range tombstones so I'm not sure I've looked in the right places. Might be worth trying an empirical test.

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

I'm on the fence about that -- with the range deletion tombstones, you can run a full compaction and you're back. That option does not exist with the ClearIterRange change; you're kinda screwed until it clears up by itself.

Digging into individual values on 008819, we see that the end key of one range and the start key of the next are adjacent in cockroach terms but different at the rocksdb level. This is the start and end key of two adjacent tombstones:

@bdarnell, do you understand why that is? For two adjacent ranges, we would first clear a range with the right boundary MVCCKey{Key: middle} and then an adjacent range starting at MVCCKey{Key: middle}. Where does the hole come from?

Also, do you or @benesch still have the magic invocation around to dump the number of range tombstones?

From https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-374384340, it appears that the difference is that the start keys contain a /0 column family suffix, while the start keys do not. I haven't investigated why that would be the case.

Still trying to figure out how to dump the tombstones. My best so far (but it doesn't include them):

./cockroach debug rocksdb dump --path=/../000007.sst --stats

You want to compile the sst_dump program (it’s not part of ldb for whatever
reason) and run that on each of the SST files. I forget exactly what
options you need but it should be pretty obvious. cd c-deps/rocksdb && make sst_dump

On Mon, May 7, 2018 at 6:54 AM Tobias Schottdorf notifications@github.com
wrote:

Still trying to figure out how to dump the tombstones. My best so far (but
it doesn't include them):

./cockroach debug rocksdb manifest_dump --path=.../000007.sst --stats


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-387029821,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IPTPqLqNBlzQ0dIvawkTVY6UkAc-ks5twCgCgaJpZM4SwswS
.

FWIW, these are the (compaction queue) compactions I'm seeing while running this.

$ roachprod ssh tobias-drop:1-10 'grep "processing compact" logs/cockroach.log'
tobias-drop: grep "processing compact" l... 10/10
   1: I180519 06:08:35.644623 260 storage/compactor/compactor.go:300  [n1,s1] processing compaction #1-372/589 (/Table/51/1/0/0-/Table/51/1/2528853) for 9.3 GiB (reasons: size=true used=false avail=false)
   2: I180519 06:08:33.889164 174 storage/compactor/compactor.go:300  [n8,s8] processing compaction #1-370/718 (/Min-/Table/51/1/3470314) for 8.1 GiB (reasons: size=true used=false avail=false)
I180519 06:10:15.772722 174 storage/compactor/compactor.go:300  [n8,s8] processing compaction #600-628/718 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false)
   3: I180519 06:09:07.943102 872 storage/compactor/compactor.go:300  [n4,s4] processing compaction #1-725/2031 (/Table/51/1/311532/0-/Table/51/1/6289196) for 18 GiB (reasons: size=true used=true avail=false)
   4: I180519 06:08:36.240354 395 storage/compactor/compactor.go:300  [n3,s3] processing compaction #1-340/524 (/System/tsd-/Table/51/1/2258625/0) for 7.4 GiB (reasons: size=true used=false avail=false)
I180519 06:12:57.768175 395 storage/compactor/compactor.go:300  [n3,s3] processing compaction #349-450/524 (/Table/51/1/2623759-/Table/51/1/3487708) for 2.1 GiB (reasons: size=true used=false avail=false)
   5: I180519 06:08:34.510402 207 storage/compactor/compactor.go:300  [n6,s6] processing compaction #1-267/562 (/Table/51/1/6546/0-/Table/51/1/2528853/0) for 6.7 GiB (reasons: size=true used=false avail=false)
   6: I180519 06:08:46.769268 898 storage/compactor/compactor.go:300  [n5,s5] processing compaction #1-672/2300 (/Table/51/1/0/0-/Table/51/1/4442685/0) for 14 GiB (reasons: size=true used=false avail=false)
   7: I180519 06:08:34.357633 168 storage/compactor/compactor.go:300  [n7,s7] processing compaction #1-303/507 (/Table/13-/Table/51/1/3034152/0) for 7.9 GiB (reasons: size=true used=false avail=false)
I180519 06:12:40.379005 168 storage/compactor/compactor.go:300  [n7,s7] processing compaction #1-976/1135 (/Table/51/1/1808697/0-/Table/51/1/12112569/0) for 26 GiB (reasons: size=true used=true avail=true)
I180519 06:14:08.812506 168 storage/compactor/compactor.go:300  [n7,s7] processing compaction #976-1023/1135 (/Table/51/1/12169617-/Table/51/1/13091303) for 1.3 GiB (reasons: size=true used=false avail=false)
   8: I180519 06:08:35.980654 236 storage/compactor/compactor.go:300  [n10,s10] processing compaction #1-326/502 (/System/""-/Table/51/1/3467041) for 8.7 GiB (reasons: size=true used=false avail=false)
   9: I180519 06:09:12.348539 799 storage/compactor/compactor.go:300  [n2,s2] processing compaction #1-760/1716 (/System/tsd-/Table/51/1/6864500) for 19 GiB (reasons: size=true used=true avail=true)
  10: I180519 06:08:35.364444 233 storage/compactor/compactor.go:300  [n9,s9] processing compaction #1-392/578 (/Table/51/1/9819/0-/Table/51/1/3399280) for 10 GiB (reasons: size=true used=false avail=false)

(this shows the fragmentation -- one of the compactions is for 427mb, and another for 19gb -- it depends how many contiguous replicas you have). Pushing ClearRange into the compactor should give some constant factor which seems to be kind of large (~10gb/64mb), but it doesn't technically solve the problem of too many tombstones.

(it actually does technically solve the problem if you compact after putting down each tombstone, as it guarantees that there's only ever one range tombstone originating from the DROP in RocksDB, at least if the compactions always "resolve" the tombstone -- which I hope they do -- and don't just carry it into higher levels).

I'll try the approach this week.

RocksDB->CompactRange, assuming that’s actually what we call, claims to
compact all the way down to the lowest level, so the tombstones should
definitely be dropped (modulo bugs in RDB):
https://github.com/facebook/rocksdb/wiki/Manual-Compaction

On Mon, May 21, 2018 at 5:39 AM Tobias Schottdorf notifications@github.com
wrote:

(it actually does technically solve the problem if you compact after
putting down each tombstone, as it guarantees that there's only ever one
range tombstone originating from the DROP in RocksDB, at least if the
compactions always "resolve" the tombstone -- which I hope they do -- and
don't just carry it into higher levels).

I'll try the approach this week.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-390514954,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IAd4yDn-H1Qcyc9PFSLCWiE1s4IZks5t0eKbgaJpZM4SwswS
.

The more I look at this, the more it seems that we don't even need range tombstones to make this kind of thing a mess.

I'm running a custom branch in which ClearRange doesn't do anything except make the suggestion to the compactor, and the compactor adds a range tombstone and immediately compacts it away, plus some other hacks like disabling the consistency checker and never aggregating compactions that are not adjacent because that would delete lots of data now.

The result: still a broken cluster. Initially (at night) there are compactions that take lots of time, now (in the morning) there's a steady stream of small compactions, each taking around 1s. It does this because the logical bytes of the store are now ~0, so even a 32mb compaction triggers the heuristic that is supposed to reclaim space for very small stores (which this is not). That this effectively strangles the cluster is troubling, because it implies that few 32mb compactions back to back will have a troubling effect. (@bdarnell is this perhaps related to the large MANIFEST files? The first node is on /mnt/data1/cockroach/MANIFEST-009972 which seems like a high number).

I180522 10:17:10.335792 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6181/6205 (/Table/51/1/62105934/0-/Table/51/1/62109206) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:11.317257 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6181/6205 (/Table/51/1/62105934/0-/Table/51/1/62109206) for 32 MiB in 981.411309ms
I180522 10:17:11.317318 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6182/6205 (/Table/51/1/62109206/0-/Table/51/1/62112478) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:12.139546 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6182/6205 (/Table/51/1/62109206/0-/Table/51/1/62112478) for 32 MiB in 822.211223ms
I180522 10:17:12.139579 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6183/6205 (/Table/51/1/62112478/0-/Table/51/1/62115750) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:12.981165 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6183/6205 (/Table/51/1/62112478/0-/Table/51/1/62115750) for 32 MiB in 841.533492ms
I180522 10:17:12.981194 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6184/6205 (/Table/51/1/62115750/0-/Table/51/1/62119022) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:13.842772 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6184/6205 (/Table/51/1/62115750/0-/Table/51/1/62119022) for 32 MiB in 861.534854ms
I180522 10:17:13.842802 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6185/6205 (/Table/51/1/62119022/0-/Table/51/1/62122294) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:14.686480 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6185/6205 (/Table/51/1/62119022/0-/Table/51/1/62122294) for 32 MiB in 843.570175ms
I180522 10:17:14.686522 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6186/6205 (/Table/51/1/62122294/0-/Table/51/1/62125566) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:15.533905 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6186/6205 (/Table/51/1/62122294/0-/Table/51/1/62125566) for 32 MiB in 847.347856ms
I180522 10:17:15.533935 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6187/6205 (/Table/51/1/62125566/0-/Table/51/1/62128838) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:16.412174 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6187/6205 (/Table/51/1/62125566/0-/Table/51/1/62128838) for 32 MiB in 878.192735ms
I180522 10:17:16.412220 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6188/6205 (/Table/51/1/62128838/0-/Table/51/1/62128931) for 933 KiB (reasons: size=false used=true avail=false)
I180522 10:17:17.135783 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6188/6205 (/Table/51/1/62128838/0-/Table/51/1/62128931) for 933 KiB in 723.501353ms
I180522 10:17:17.135812 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6189/6205 (/Table/51/1/62128931/0-/Table/51/1/62132203) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:17.992488 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6189/6205 (/Table/51/1/62128931/0-/Table/51/1/62132203) for 32 MiB in 856.625562ms
I180522 10:17:17.992517 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6190/6205 (/Table/51/1/62132203/0-/Table/51/1/62135475) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:18.853890 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6190/6205 (/Table/51/1/62132203/0-/Table/51/1/62135475) for 32 MiB in 861.308991ms
I180522 10:17:18.853920 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6191/6205 (/Table/51/1/62135475/0-/Table/51/1/62137901) for 24 MiB (reasons: size=false used=true avail=false)
I180522 10:17:19.669350 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6191/6205 (/Table/51/1/62135475/0-/Table/51/1/62137901) for 24 MiB in 815.383245ms
I180522 10:17:19.669378 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6192/6205 (/Table/51/1/62137901/0-/Table/51/1/62141173) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:19.997260 355 server/status/runtime.go:219  [n1] runtime stats: 5.5 GiB RSS, 706 goroutines, 371 MiB/44 MiB/538 MiB GO alloc/idle/total, 3.9 GiB/5.0 GiB CGO alloc/total, 462.10cgo/sec, 0.94/0.11 %(u

The cluster is still unhappy. I disabled the compactor and it took ~30s to set the cluster setting. So at the end of the day, it looks like these tombstones may just be another 1000 nails in a coffin that already has lots of nails in it.

By the way, I really like the health alerts I recently introduced:

W180522 06:39:30.125437 357 server/node.go:802  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:requests.slow.distsender Value:67} {StoreID:1 Category:METRICS Description:queue.raftsnapshot.process.failure Value:1}]}
[...]
W180522 06:41:41.961410 357 server/node.go:802  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:requests.slow.distsender Value:67}]}
[...]
W180522 10:26:20.088564 357 server/node.go:802  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:requests.slow.distsender Value:94} {StoreID:1 Category:METRICS Description:queue.raftlog.process.failure Value:1} {StoreID:1 Category:METRICS Description:requests.slow.lease Value:12}]}

I was hoping that the cluster would become healthy again after disabling the compactor, but it hasn't during the last 10min.

For anyone who can read them (@petermattis?) here is a compaction stats dump that is pretty typical of others while the compactor was running:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2      0.2       0.0   1.0      0.0      2.5        72      1640    0.044       0      0
  L2      0/0    0.00 KB   0.0      0.3     0.2      0.1       0.2      0.1       0.0   1.4     25.1     22.5        10         5    2.091     12M    16K
  L3      0/0    0.00 KB   0.0      0.3     0.1      0.2       0.3      0.1       0.1   2.9     68.4     68.1         4         4    1.051    461K    40K
  L4      0/0    0.00 KB   0.0      1.6     0.2      1.4       1.6      0.2       0.1   6.8     66.4     65.2        25         6    4.204   1127K    50K
  L5      1/1    1.41 KB   0.0      2.8     1.6      1.1       2.7      1.6       0.0   1.7     25.2     25.0       112      1333    0.084     17M    40K
  L6    780/1   67.63 GB   0.0    412.0     2.1    409.9     339.3    -70.7       0.0 163.4     53.3     43.9      7911      1566    5.052    458M    19M
 Sum    781/2   67.63 GB   0.0    416.9     4.2    412.7     344.3    -68.4       0.2 1965.5     52.5     43.3      8136      4554    1.786    490M    20M
 Int      0/0    0.00 KB   0.0     45.6     0.0     45.6      45.6      0.0       0.0 66037.6     82.9     82.9       562      1398    0.402    140M   3667
Uptime(secs): 8437.5 total, 600.0 interval
Flush(GB): cumulative 0.175, interval 0.001
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 344.29 GB write, 41.78 MB/s write, 416.94 GB read, 50.60 MB/s read, 8135.7 seconds
Interval compaction: 45.56 GB write, 77.76 MB/s write, 45.56 GB read, 77.76 MB/s read, 562.4 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Looks like over 600s we write ~50gb to disk, at 70mb/s reads and 70mb/s writes. That seems pretty hefty. The original idea with the tombstones was that compacting them down would be cheap, as they delete everything. In what I'm doing, where I removed compactions, this is not true of course because the compactions are rather small (and so you rewrite each bottom SST a few - maybe like 4-6? times).

I can improve my prototype so that it aggregates over gaps as well, without covering them with tombstones, to see how that would fare.

What else should I be looking at to evaluate this performance? Feels that our toolbox to understand these issues is not great because we don't reason about the RocksDB layer very well.

Ah, the cluster is still recovering because the Raft snapshot queue is backed up. We might be limited by the quota pool which then causes requests to be held back.

image

... and I won't be able to investigate the quota pool because neither does it have a cluster setting nor does it export any metrics. Womp womp. Filed https://github.com/cockroachdb/cockroach/issues/25799.

Also, I just saw a flurry of these fly by:

W180522 11:20:50.602707 1626794 kv/dist_sender.go:1301  [n1] have been waiting 1m0s sending RPC to r42488 (currently pending: [(n3,s3):1]) for batch: ClearRange [/Table/51/1/64734528/0,/Table/51/1/64737800)

I think what's happening here is that the schema change didn't run to completion and so now it's running again. @bdarnell that seems like a good explanation for the Manifest file growth seen elsewhere -- it just retries and retries, and if the thing being dropped is just one SST, you're going to rewrite that SST over and over over time. What I don't understand is that if that's the case, why wouldn't the schema change succeed? I can understand why it wouldn't in the experiment which I'm running.

@tschottdorf Re: compaction stats. A W-Amp number of 1965.5 is huge. This it the write amplification number (I'm not sure precisely how it is collected) and is related to how many times a particular byte of data is rewritten. Generally you want to see this in the low double digits. This is a pretty good indication that we're doing something very wrong with our compaction suggestions.

That this effectively strangles the cluster is troubling, because it implies that few 32mb compactions back to back will have a troubling effect.

Note that it's misleading to think of these as 32MB compactions. They're rewriting at least one 128MB sstable, and likely two.

But my guess is that the problem has less to do with the amount of IO and more to do with the level of synchronization required to do any compaction. This argues for being much more conservative in how often we trigger manual compactions. Compactions don't really become "cheap" until they can discard entire sstables (and even then unless the boundaries line up they may have to rewrite adjacent SSTs).

is this perhaps related to the large MANIFEST files? The first node is on /mnt/data1/cockroach/MANIFEST-009972 which seems like a high number

Yes, it's related, because every compaction increases the size of the manifest file (the number in the manifest file doesn't mean much because all rocksdb filenames draw their numbers from a shared sequence. What matters is the highest number across every file in the directory)

I looked at the rocksdb code to see what sort of synchronization is involved in a compaction and found that DBImpl::RunManualCompaction holds DBImpl::mutex_ for what looks like the bulk of its running time. On its face, this looks extremely expensive, although there are other indications (the exlusive_manual_compaction option, which we set to false) which suggest that manual compactions are expected to run concurrently with other operations. Note that background compactions do not appear to acquire this lock, but reads and writes do.

On its face, this looks extremely expensive

D'oh, nevermind. It actually spends nearly all of its time in a CondVar::Wait with the lock released.

Well, I can reproduce badness using #25837. Not sure what is going on other than a shit-ton of disk activity (both reads and writes). Presumably, that is out of control compactions, but I haven't had time to verify that.

I'm not at all familiar with this issue or the debugging that has gone on so far or even the systems involved, so I started working on it from the beginning. First, I'm able to reproduce the cluster wedging using the clearrange roachtest from #25837 with the compactor.enabled cluster setting turned off. So whatever is happening, it isn't the compaction queue going crazy (unless there is something broken with the cluster setting). I then noticed that when the cluster wedges there are a lot of ClearRange slow RPC messages. Thousands of them. Tracing through the code, I see that dropping a table results in a single ClearRange for the entire table key range:

I180525 23:39:16.142667 257 sql/tablewriter_delete.go:184  [n9] ClearRange /Table/51 - /Table/52

DistSender is presumably splitting this single operation into thousands of requests to the 10s of thousands of ranges in this table. Interestingly, this is not the only ClearRange operation logged from that line of code in the cluster:

I180525 23:24:02.838137 314 sql/tablewriter_delete.go:184  [n3] ClearRange /Table/51 - /Table/52
I180525 23:29:07.004769 303 sql/tablewriter_delete.go:184  [n10] ClearRange /Table/51 - /Table/52
I180525 23:34:11.599280 252 sql/tablewriter_delete.go:184  [n7] ClearRange /Table/51 - /Table/52

Notice that the timestamps are ~5min apart. I believe this corresponds to the 5min SchemaChangeLeaseDuration. I haven't verified it yet, but I'd be willing to bet that the ClearRange operation is taking longer than 5min to complete which results in the schema change lease for this table being picked up by another node which then executes another ClearRange operation which will take longer than 5min and on and on. Why this eventually wedges the cluster is unclear.

I'm going to try bumping SchemaChangeLeaseDuration to a much higher value to see if that allows the drop to succeed without wedging the cluster. I'm also going to turn on additional logging at the DistSender level to make sure that the ClearRange operation is making progress and not retrying RPCs for individual ranges over and over.

After bumping SchemaChangeLeaseDuration to 100min I no longer see the multiple ClearRange operations from tablewriter_delete.go. Unfortunately, the cluster still wedged. DistSender log messages indicate that it was processing several thousand RPCs per minute. Considering there are ~40k ranges in this test, the expected time for the ClearRange operation is ~7min. I accidentally fat-fingered the cluster logs, but before doing so I saw DistSender log messages showing it was still processing the ClearRange for more than 20min.

There are likely multiple levels of badness here, with each one causing problems individually:

  1. laying down many RocksDB range tombstones, which slow down RocksDB to a crawl
  2. compactions
  3. the extremely large cross-range RPC that may not finish fast enough.

My past experiments were a) no compactor b) lay down ClearRange just before compacting (though that resulted in frequent small compactions in my prototype due to the way I hacked it together). I had also noticed that the schema changer seemed to time out, but at that point it wasn't clear why.

The ClearRange RPCs also has a code path that recomputes stats, so if that gets hit a lot (it shouldn't) some slowness may also be explained by it.

As a smoke test, with a return inserted here, the roachtest should pass trivially -- the schema-change turns into a sophisticated no-op.

As a second (more involved) smoke test, you can remove the batch.ClearRange and instead let the compactor do it just as it aggregates the suggestion into a compaction, which minimizes the amount of time the tombstone is around for. I had a bad WIP for this (which didn't deal well with the feature of the compactor that compacts across "gaps" between the suggestions) and the result was.. not good.

I think one general problem here is that the original thinking was that the compactions would be really cheap because the keyspace compacted is completely covered by RocksDB range tombstones. But that's not true because each node has a more-or-less random subset of replicas, and so the compactions we run typically touch either only a part of an SST, or span multiple SSTs (but not gapless range tombstones).

If this is indeed the problem, we could introduce a smarter mechanism that "covers up" the holes in the keyspace with fresh range deletion tombstones, so that the compaction (in the case of the DROP) really gets to see a completely range-deleted keyspace. I wasn't able to play with this because the naive way of prototyping it (running engine.ClearRange on the key span to be compacted) trips (*Replica).assertStateLocked. Not sure if that's because some replicas receive ClearRange really late, or because the span got extended beyond that of the dropped table.

@tschottdorf have you guys bothered to give this a try: https://github.com/cockroachdb/cockroach/pull/24137

@tschottdorf I see reference about to a CPU profile showing badness in RocksDB due to large numbers of range tombstones. How was that gathered? I haven't had a chance to poke at this any more this weekend, but so far I haven't seen evidence of RocksDB badness (to be fair, I haven't really been looking). I'm mentioning this because I'm working my way down towards RocksDB and want to to understand what evidence led to the conclusion that large numbers of range tombstones is problematic.

I think one general problem here is that the original thinking was that the compactions would be really cheap because the keyspace compacted is completely covered by RocksDB range tombstones. But that's not true because each node has a more-or-less random subset of replicas, and so the compactions we run typically touch either only a part of an SST, or span multiple SSTs (but not gapless range tombstones).

If this is indeed the problem, we could introduce a smarter mechanism that "covers up" the holes in the keyspace with fresh range deletion tombstones, so that the compaction (in the case of the DROP) really gets to see a completely range-deleted keyspace. I wasn't able to play with this because the naive way of prototyping it (running engine.ClearRange on the key span to be compacted) trips (*Replica).assertStateLocked. Not sure if that's because some replicas receive ClearRange really late, or because the span got extended beyond that of the dropped table.

I'd like to better understand what is going on at the RocksDB level before thinking too much about possible solutions. I worry that we could be papering over the real problem.

@petermattis When Nikhil originally encountered the problem, he tried to restart one of the affected nodes and it never booted up, because iterating over the replicas range descriptors was extremely slow. I think a CPU profile showed activity in a method related to tombstones, and upstream confirms that these are handled in a wildly inefficient manner (plus, we found hundreds of tombstones in some SSTs by manual dumping). I think this also proves that the compactor is not the single root problem, as it's not running at that point in time (though maybe it is).

I think the smoke test (in which you simply don't lay down tombstones as suggested above) can help reason about above-RocksDB problems to some extent (at least we all agree that there shouldn't be anything going wrong in that case, except not actually deleting anything).

DistSender is presumably splitting this single operation into thousands of requests to the 10s of thousands of ranges in this table.

DistSender limits the parallelism, but I think it's to ~500. That's a lot if each of these requests does work. Not out of the question that there are cascading failures here.

I think a CPU profile showed activity in a method related to tombstones, and upstream confirms that these are handled in a wildly inefficient manner (plus, we found hundreds of tombstones in some SSTs by manual dumping).

We're pretty much only creating range tombstones on the cluster in the clearrange roachtest. Seems reasonable that behavior would result in sstables with hundreds of range tombstones.

I think this also proves that the compactor is not the single root problem, as it's not running at that point in time (though maybe it is).

By compactor, do you mean RocksDB compactions, or the compaction queue?

I think the smoke test (in which you simply don't lay down tombstones as suggested above) can help reason about above-RocksDB problems to some extent (at least we all agree that there shouldn't be anything going wrong in that case, except not actually deleting anything).

I'll definitely try that test soon.

I think the smoke test (in which you simply don't lay down tombstones as suggested above) can help reason about above-RocksDB problems to some extent (at least we all agree that there shouldn't be anything going wrong in that case, except not actually deleting anything).

Ok, I just tried this out and the clearrange roachtest passed without problem and the cluster looked good the whole time. The ClearRange operation was broken into 43777 RPCs which took 4m9s to complete. One thing I had noticed in past runs is that the consistency queue processing time was very high (10s of seconds), but during this run the consistency queue processing time was in the low hundreds of milliseconds. Not sure if this is relevant or not. Perhaps the consistency queue scans are hitting the large number of range tombstones and consuming excessive CPU.

@tschottdorf I also tried this more minimal diff to your suggestion:

diff --git a/pkg/storage/batcheval/cmd_clear_range.go b/pkg/storage/batcheval/cmd_clear_range.go
index d2a95cc73..3ff8797d7 100644
--- a/pkg/storage/batcheval/cmd_clear_range.go
+++ b/pkg/storage/batcheval/cmd_clear_range.go
@@ -104,9 +104,9 @@ func ClearRange(
                        },
                },
        }
-       if err := batch.ClearRange(from, to); err != nil {
-               return result.Result{}, err
-       }
+       // if err := batch.ClearRange(from, to); err != nil {
+       //      return result.Result{}, err
+       // }
        return pd, nil
 }

That is, we simply omit the ClearRange RocksDB operation, while doing everything else the same. This too allows the clearrange roachtest to succeed. Some graphs:

screen shot 2018-05-28 at 3 15 15 pm

screen shot 2018-05-28 at 3 15 27 pm

screen shot 2018-05-28 at 3 16 05 pm

Logical bytes per store steadily decreases on the cluster, but because we're not actually deleting the data the disk used bytes doesn't change.

Perhaps the consistency queue scans are hitting the large number of range tombstones and consuming excessive CPU.

I whacked the consistency queue code to force it off and the cluster still wedged. I also took a CPU profile on one of the nodes before the cluster wedged and I can see an excessive amount of time being spent in the RocksDB AddTombstones call. My new theory which I'm going to try and validate: we're adding hundreds of range tombstones per second which are winding up in the memtable and L0 tables. But each ClearRange RPC is also doing a stats computation which creates and iterator and needs to process those tombstones (via a call to AddTombstones). We might be better served by delaying the RocksDB ClearRange operation until immediately before performing the compaction, though that would need to interact properly with the consistency queue.

Yep, I came to a similar conclusion, though my prototype for this was too
simplistic. Getting the interaction with the gap spanning compactions right
is nontrivial, but might be time to do it. When I naively cleared the span
that was going to be compacted, it caused assertions to fail.

On Mon, May 28, 2018, 9:45 PM Peter Mattis notifications@github.com wrote:

Perhaps the consistency queue scans are hitting the large number of range
tombstones and consuming excessive CPU.

I whacked the consistency queue code to force it off and the cluster still
wedged. I also took a CPU profile on one of the nodes before the cluster
wedged and I can see an excessive amount of time being spent in the RocksDB
AddTombstones call. My new theory which I'm going to try and validate:
we're adding hundreds of range tombstones per second which are winding up
in the memtable and L0 tables. But each ClearRange RPC is also doing a
stats computation which creates and iterator and needs to process those
tombstones (via a call to AddTombstones). We might be better served by
delaying the RocksDB ClearRange operation until immediately before
performing the compaction, though that would need to interact properly with
the consistency queue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-392593235,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135C2_KICdITIQLBvshTs8uWugkpdQks5t3FPogaJpZM4SwswS
.

>

-- Tobias

Yep, I came to a similar conclusion, though my prototype for this was too
simplistic. Getting the interaction with the gap spanning compactions right
is nontrivial, but might be time to do it. When I naively cleared the span
that was going to be compacted, it caused assertions to fail.

Can you describe your prototype in more detail? (Or point to an existing explanation which I might have missed). Trying to reduce the number of range tombstones feels hacky and fragile. I'd rather put effort into fixing RocksDB to support range tombstones more efficiently (if that is the problem). Or figure out a way to optimistically skip writing range tombstones into the memtable (if range tombstones in L0 is the problem). Now that I'm up to speed on this issue, I think it would be worthwhile to brainstorm next steps tomorrow.

I’ll voice my support for patching RocksDB. I think it will be only
approximately as much effort as trying to work around the issue in
Cockroach.

On Mon, May 28, 2018 at 8:43 PM Peter Mattis notifications@github.com
wrote:

Yep, I came to a similar conclusion, though my prototype for this was too
simplistic. Getting the interaction with the gap spanning compactions right
is nontrivial, but might be time to do it. When I naively cleared the span
that was going to be compacted, it caused assertions to fail.

Can you describe your prototype in more detail? (Or point to an existing
explanation which I might have missed). Trying to reduce the number of
range tombstones feels hacky and fragile. I'd rather put effort into fixing
RocksDB to support range tombstones more efficiently (if that is the
problem). Or figure out a way to optimistically skip writing range
tombstones into the memtable (if range tombstones in L0 is the problem).
Now that I'm up to speed on this issue, I think it would be worthwhile to
brainstorm next steps tomorrow.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-392623192,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IHaKPM2pGHIfPlB8gi2dJRmCoVoRks5t3JmfgaJpZM4SwswS
.

I'm running into something strange in my latest experiment. I've added a SuggestedCompaction.Clear field which indicates to the compactor that the span should be cleared before being compacted. And when that field is cleared we always perform the compaction. Strangely, the compaction queue doesn't seem to be running in a timely fashion. It's definitely not disabled. I'm adding more logging to figure out what is going on.

PS The idea behind this experiment is to add the range tombstones and immediately flush them from the memtable and compact the associated range with the hope being that they don't affect concurrent operations that access L0.

The strangeness mentioned in my previous comment was a bug. Making the compaction queue perform the RocksDB ClearRange immediately before the compaction seems to help as the concurrent scan operation in the clearrange roachtest is not getting wedged. On the other hand, the cluster doesn't seem to be reclaiming disk space. That could be indicative of some other badness going on. Or perhaps due to the compaction queue not forcing a bottommost-level compaction.

Can you describe your prototype in more detail? (Or point to an existing explanation which I might have missed). Trying to reduce the number of range tombstones feels hacky and fragile.

I wasn't suggesting implementing that properly, but was trying to figure out whether it would solve the problem. I agree that fixing this upstream is the most desirable option, but it's miles away from something I (and I assume, you) can conceivably prototype. To get an idea of whether it can fix this, you can configure our RocksDB reads to skip range tombstones (there's an option somewhere).

I've added a SuggestedCompaction.Clear field which indicates to the compactor that the span should be cleared before being compacted.

That's pretty much what I refer to as "prototype" before, except I did this in a worse way.

Strangely, the compaction queue doesn't seem to be running in a timely fashion. It's definitely not disabled.

It will always wait for at least two minutes from the first suggestion, by design. Were you seeing longer pauses?

My experiment branch is here: https://github.com/tschottdorf/cockroach/tree/roachtest/drop2tb-wip

That is, we simply omit the ClearRange RocksDB operation, while doing everything else the same. This too allows the clearrange roachtest to succeed. Some graphs:

Wait, and this is with the compactor enabled? That's surprising, for the compactions should at least somewhat impair the cluster, as they're now pretty expensive. You're basically compacting 2tb of full keyspace (though it's already all at L6, maybe that makes it close to a noop?). I think the compactions, when doing actual work, would show up. Maybe taking out ClearRange is accidentally masking additional concern coming from the compactor.

PS in my experiments it sometimes took >10 minutes after the ClearRange until things started falling apart. That is, I would generally like to see graphs on somewhat longer timescales -- that's when the compaction pressure really shows -- just to make sure it doesn't go downhill after we stopped looking.

Wait, and this is with the compactor enabled? That's surprising, for the compactions should at least somewhat impair the cluster, as they're now pretty expensive. You're basically compacting 2tb of full keyspace (though it's already all at L6, maybe that makes it close to a noop?). I think the compactions, when doing actual work, would show up. Maybe taking out ClearRange is accidentally masking additional concern coming from the compactor.

Yes, this was with the compactor enabled. It is also possible I did something wrong in the experiment.

PS in my experiments it sometimes took >10 minutes after the ClearRange until things started falling apart. That is, I would generally like to see graphs on somewhat longer timescales -- that's when the compaction pressure really shows -- just to make sure it doesn't go downhill after we stopped looking.

I've seen the same behavior where sometimes the cluster wedges quickly and other times it takes many minutes for it to fall apart.

It will always wait for at least two minutes from the first suggestion, by design. Were you seeing longer pauses?

No, I had a simple bug. I wasn't seeing any of my "clear range" compactions because I had added SuggestedCompaction.Clear which isn't persisted. I needed to move that to SuggestedCompaction.Compaction.Clear.

My latest experiment was an extension of the SuggestedCompaction.Compaction.Clear approach. The ClearRange KV operation is changed to not perform a RocksDB ClearRange operation, but to instead create a suggested compaction with the Clear bit set. The compaction queue is modified to not aggregate compactions with the Clear bit set (this part was important to avoid multi-GB compactions). Before performing a "clearing" compaction, a RocksDB ClearRange operation is performed and then the memtable is flushed. I extended to the clearrange roachtest to run for an hour and the concurrent COUNT(*) operations never encountered problems.

screen shot 2018-05-29 at 11 50 31 am

Live bytes on the cluster very quickly dropped to 0.

screen shot 2018-05-29 at 11 50 27 am

Strangely, used bytes (which is computed based on the actual disk usage) increased. A more recent snapshot of this graph shows that used space is now decreasing rapidly.

screen shot 2018-05-29 at 11 57 29 am

screen shot 2018-05-29 at 11 58 55 am

Looks like RocksDB decided to start compacting, though I'm not sure what triggered that. Perhaps the compaction queue. I need to poke around more.

screen shot 2018-05-29 at 11 50 41 am

RocksDB sstables were decreasing, though I believe this metric counts the number of "live" sstables.

screen shot 2018-05-29 at 11 50 51 am

The number of open file descriptors didn't decrease with the number of sstables. This suggests some sort of sstable leak.

I tried flushing the RocksDB memtable whenever a replica proposal contains suggested compactions. The idea being that the suggested compactions correspond to a batch containing a RocksDB ClearRange operation and it is beneficial to flush that operation. The cluster wedged quickly with this experiment. I'm going to add some debug logs to RocksDB to validate my understanding of what is going on at the RocksDB level.

Investigation of RocksDB shows that RangeDelAggregator causes significant performance overhead to iterator creation when there are lots of tombstones in the memtable. The benchmark numbers below show how iterator creation/destruction slows down as the memtable goes from 0-10000 tombstones:

name                                              time/op
RocksDBDeleteRangeNewIterator/tombstones=0-8      1.26µs ± 4%
RocksDBDeleteRangeNewIterator/tombstones=1-8      2.54µs ± 2%
RocksDBDeleteRangeNewIterator/tombstones=10-8     5.57µs ± 2%
RocksDBDeleteRangeNewIterator/tombstones=100-8    50.8µs ± 3%
RocksDBDeleteRangeNewIterator/tombstones=1000-8    673µs ± 2%
RocksDBDeleteRangeNewIterator/tombstones=10000-8  9.19ms ± 2%

Note that in the clearrange test there are 43k ranges spread across 10 nodes, so ~13k replicas per node. Since there is no other traffic on the system, we can expect that the memtables on each of these nodes will contain ~13k tombstones (once the ClearRange operation finishes).

Note that the above slowdown is all CPU spinning. I ran another experiment, the kv95/encrypt=false/nodes=1 roachtest, with a cpu spin-loop in DBNewIter that varied from no-delay to a 10ms delay to see what this did to performance:

0us      21745 ops/sec
1us      21980 ops/sec
10us     21097 ops/sec
100us    16823 ops/sec
1ms      5853  ops/sec
10ms     799   ops/sec

When doing this experiment, CPU was absolutely pegged on my machine which matches up with what I've seen when running the clearrange test. I'm pretty confident at this point that usage of a large number of range tombstones is the core of this problem. And there is an acute problem with the existence of large numbers of range tombstones in the memtable and L0 because such tombstones result in significant cpu usage during iterator creation even when accessing unrelated parts of the key space. We can either attempt to fix RocksDB (see #26323), or workaround this limitation by being much more careful in how we create tombstones.

I'm not sure what to make of this: I'm seeing DBCompactRange calls never return. I spotted one goroutine that had been blocked for 19min in a DBCompactRange call for a span of keys that was supposedly covering 32MiB. A CPU profile showed that RocksDB was churning CPU in snappy::internal::CompressFragment. It is possible that DBCompactRange has some sort of logic bug and is either looping or is trying to compact a range of keys significantly larger than requested.

DBCompactRange breaks up a large compaction request into a series of smaller requests. It appears to have a bug in the logic to do this in handling both the beginning and ending keys to compact from/to. This is likely having a very bad effect on 2.0 performance whenever the compaction queue is requesting a compaction.

Progress. The graphs below are from a run of the clearrange roachtest with the DBCompactRange bug fix and ClearRange performed by the compaction queue instead of by the ClearRange operation (this avoids a build up of thousands of range deletion tombstones).

screen shot 2018-06-03 at 9 57 34 pm

screen shot 2018-06-03 at 9 57 26 pm

This is the first time I've seen the Used bytes metric decreasing. Logging shows that the compaction queue is steadily churning through the ClearRange compactions.

screen shot 2018-06-03 at 9 57 48 pm

screen shot 2018-06-03 at 9 57 59 pm

Note that the compaction queued bytes metric is only updated at the end of each cycle of processing suggested compactions. That is easy to fix and would give a more accurate picture of the progress the compaction queue is making.

That looks promising! I knew something was up in CompactRange. Glad you've found a bug there.

For packing this up into something we can comfortably ship in 2.0.x, a challenge is dealing with the consistency violations created by deferring ClearRange to right before the compactions. I have the following suggestion there, though I don't know if it performs well enough - we'd have to try:

  1. reduce the compaction queue min idle time from two minutes to something like ten seconds
  2. perform engine.ClearRange from the ClearRange RPC, but tie it a little closer to the compaction queue. Instead of simply returning after the suggestion has been made, we block (below Raft per replica) until the compaction queue has processed the suggestion, and if the suggestion was discarded, force the compaction (or put a bit on the suggestion that avoids it being discarded; we 100% want it to happen in this case to get rid of the ranged tombstone).
  3. this should still make steady process: DistSender can have up to ~100 requests in flight concurrently, and it also dispatches these requests in ascending key order.
  4. to combat the case in which the schema changer 5 minute timeout is exceeded, make the ClearRange RPC a no-op if there's nothing to clear (we should do that anyway).

This is definitely not ideal, but perhaps it's good enough for a patch release. Working around the consistency issues is likely trickier and implies marking ranges as nuked, which is problematic since we're also trying to start merging ranges.

perform engine.ClearRange from the ClearRange RPC, but tie it a little closer to the compaction queue. Instead of simply returning after the suggestion has been made, we block (below Raft per replica) until the compaction queue has processed the suggestion, and if the suggestion was discarded, force the compaction (or put a bit on the suggestion that avoids it being discarded; we 100% want it to happen in this case to get rid of the ranged tombstone).

Blocking below Raft makes me anxious. There shouldn't be any other activity on these ranges, but I wonder what unexpected activity we'll see that could cause this to be problematic.

This is definitely not ideal, but perhaps it's good enough for a patch release. Working around the consistency issues is likely trickier and implies marking ranges as nuked, which is problematic since we're also trying to start merging ranges.

Cc @benesch for his insight here. I was thinking we'd have a marker key that indicated a Range had been cleared and could not be written to again. I haven't thought through this very far. Perhaps this will be its own can of worms.

An option here is to revert the use of ClearRange() for 2.0 and get this in better shape for 2.1

An option here is to revert the use of ClearRange() for 2.0 and get this in better shape for 2.1

I believe @tschottdorf tried this and not using ClearRange still left the cluster in a bad state. We should double check this.

Notice that the timestamps are ~5min apart. I believe this corresponds to the 5min SchemaChangeLeaseDuration. I haven't verified it yet, but I'd be willing to bet that the ClearRange operation is taking longer than 5min to complete which results in the schema change lease for this table being picked up by another node which then executes another ClearRange operation which will take longer than 5min and on and on.

We have schema changes that take much, much longer than five minutes. We just need to call sc.ExtendLease periodically while the ClearRange commands are occurring. We'll presumably want to plumb some notion of "progress" into the schema changer so that we don't extend the lease indefinitely if a ClearRange command really does wedge.

Now that the compaction queue is fixed, can we solve this with just an aggressive rate limit? If the schema changer did its own splitting of the ClearRange command, instead of relying on dist sender, it could send only _n_ ClearRange commands per node per minute.

Blocking below raft (where exactly?) seems like a bad idea. And reducing the idle time of the compaction queue might increase wasted work (the reason it's so high is to allow multiple compactions to be coalesced to minimize the amount of data rewritten (remember that even though these ranges are empty, they're smaller than an sstable so there may be non-deleted data next to them).

Now that the compaction queue is fixed, can we solve this with just an aggressive rate limit? If the schema changer did its own splitting of the ClearRange command, instead of relying on dist sender, it could send only n ClearRange commands per node per minute.

This seems like the most viable option to me.

I'll work on a change to read all the ranges that are affected by ClearRange and then call the ClearRange on chunks of ranges (perhaps 1 to start with) with call to sc.ExtendLease in between. This is similar to how distsql does physical planning today.

Just chatted with @benesch. He's going to work on the 2.0.x mitigation and has a few experiments to run, such as directly invoking DBCompactRange after committing a batch containing a ClearRange, flushing the memtable after committing a batch containing a ClearRange, nad prioritizing compacting sstables with too many range deletion tombstones (#24057) which didn't previously show any benefit but that benefit might have been obscured by the DBCompactRange bug. We're at the point now that we understand the outlines of what needs to be done, but need some more experimentation to determine which approach both works and is not too intrusive.

In parallel, I'm going to work on fixing range tombstones upstream in RocksDB. I doubt this work will be appropriate for back-porting to 2.0.x, but we'll know more in a week or two.

Ok, there's something very broken with the clearrange store dirs, and perhaps more broadly with whatever was used to generate them. On a hunch I applied this diff to the test to remove the DROP TABLE:

diff --git a/pkg/cmd/roachtest/clearrange.go b/pkg/cmd/roachtest/clearrange.go
index 70ace8262..106f48283 100644
--- a/pkg/cmd/roachtest/clearrange.go
+++ b/pkg/cmd/roachtest/clearrange.go
@@ -69,16 +69,6 @@ func registerClearRange(r *registry) {
                    return err
                }

-               // Set a low TTL so that the ClearRange-based cleanup mechanism can kick in earlier.
-               // This could also be done after dropping the table.
-               if _, err := conn.ExecContext(ctx, `ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}'`); err != nil {
-                   return err
-               }
-
-               if _, err := conn.ExecContext(ctx, `DROP TABLE bank.bank`); err != nil {
-                   return err
-               }
-
                // Spend a few minutes reading data with a timeout to make sure the DROP
                // above didn't brick the cluster.
                //

Shouldn't be any compactions happening in that cluster, right?

$ roachprod run benesch-clearrange -- 'grep compactor.go logs/cockroach.log'
benesch-clearrange: grep compactor.go logs/cock... 10/10
   1: 
Process exited with status 1
   2: I180605 05:38:18.680504 169 storage/compactor/compactor.go:326  [n8,s8,compactor] processing compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB (reasons: size=true used=false avail=false)
I180605 05:38:19.210780 169 storage/compactor/compactor.go:342  [n8,s8,compactor] processed compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB in 530.241647ms
I180605 05:38:20.107331 169 storage/compactor/compactor.go:326  [n8,s8,compactor] processing compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false)
I180605 05:40:04.513392 169 storage/compactor/compactor.go:342  [n8,s8,compactor] processed compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB in 1m44.405947506s
   3: I180605 05:38:54.126078 1849 storage/compactor/compactor.go:326  [n4,s4,compactor] processing compaction #419-1441/1442 (/Table/51/1/18669166-/Max) for 4.4 GiB (reasons: size=true used=false avail=false)
   4: 
Process exited with status 1
   5: 
Process exited with status 1
   6: I180605 05:38:32.403214 859 storage/compactor/compactor.go:326  [n5,s5,compactor] processing compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB (reasons: size=true used=false avail=false)
I180605 05:46:38.469327 859 storage/compactor/compactor.go:342  [n5,s5,compactor] processed compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB in 8m6.066106127s
I180605 05:46:40.070311 859 storage/compactor/compactor.go:326  [n5,s5,compactor] processing compaction #221-1842/1843 (/Table/51/1/4610515-/Max) for 5.8 GiB (reasons: size=true used=false avail=false)
   7: 
Process exited with status 1
   8: 
Process exited with status 1
   9: I180605 05:38:52.913420 915 storage/compactor/compactor.go:326  [n2,s2,compactor] processing compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB (reasons: size=true used=false avail=false)
I180605 05:41:04.507447 915 storage/compactor/compactor.go:342  [n2,s2,compactor] processed compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB in 2m11.594000298s
I180605 05:41:05.053543 915 storage/compactor/compactor.go:326  [n2,s2,compactor] processing compaction #239-1094/1095 (/Table/51/1/10569964-/Max) for 3.9 GiB (reasons: size=true used=false avail=false)
  10: I180605 05:38:20.757556 232 storage/compactor/compactor.go:326  [n9,s9,compactor] processing compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB (reasons: size=true used=false avail=false)
I180605 05:38:20.870035 232 storage/compactor/compactor.go:342  [n9,s9,compactor] processed compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB in 112.445892ms
Error:  Process exited with status 1

Pay special attention to the compactions that extend to /Max:

  • /Table/51/1/18669166-/Max
  • /Table/51/1/4610515-/Max

Sure enough, these pin read/write I/O at approximately ~75MB/s on the affected nodes, since one RocksDB thread is rewriting basically all the data in the store, constantly. This doesn't impact performance enough on its own to fail the roachtest, but you can bet that it's not helping matters. DROP TABLE can't rely on the compaction queue to purge range deletions on the affected nodes because the compaction queue is busy rewriting all the data in the darn store.

So where are these tombstones coming from? I have a hunch. I suspect that during the RESTORE to generate this data the last range at some point in time, say [/Table/51/1/4610515-/Max), got rebalanced from one node to another, leaving a deletion tombstone and compaction suggestion in its place. Unfortunately RESTORE than split that range approximately a billion times, installed hundreds of gigabytes of data in that area of the keyspace, and 3/10 of those replicas ended up on the node with the compaction suggestion. Installing these compaction suggestions when rebalancing seems fraught with peril.

I'll bet this is part of the reason that none of our strategies, until Peter's, made any dent in the problem. Anything that relies on the compaction queue to compact anything in a timely fashion is hopeless. (~One of the affected nodes is still compacting after 30m.~ It took 31m1s.) Peter (inadvertently?) turned off the compaction queue for preexisting suggestions and skirted this nastiness.

Another point of confusion: aren't these compaction suggestions supposed to expire after 24h? How are these suggested compactions still relevant given that the store dirs were generated months ago?

Another point of confusion: aren't these compaction suggestions supposed to expire after 24h? How are these suggested compactions still relevant given that the store dirs were generated months ago?

Oh. Suggestions only age out if they're unactionable for 24h. A two month old suggestion is still valid as long as you can act on it as soon as you see it.

Yowza. Good find @benesch. We should discard compactions immediately when they're above 2x the max age (between 1x and 2x the max age might still be good to compact because of the way the compactor times itself -- once it skips something once, it potentially sleeps for the max age).

Or rather, forget the 2x thing. Just make the main compaction queue loop timer maxAge/2 instead of maxAge as it is today. Then we can categorically not process suggestions older than maxAge.

Nice find. I had noticed these existing suggested compactions (that’s why I
disabled non-Clear compactions in the compaction queue) but wanted to focus
on the ClearRange problem. I meant to revisit but this slipped my mind.

How many bugs will this issue uncover?

PS I’m tracking down another one: the tiny gap between sugggested
compactions. Definitely not in the compactor code.

On Tue, Jun 5, 2018 at 2:14 AM Nikhil Benesch notifications@github.com
wrote:

Ok, there's something very broken with the clearrange store dirs, and
perhaps more broadly with whatever was used to generate them. On a hunch I
applied this diff to the test to remove the DROP TABLE:

diff --git a/pkg/cmd/roachtest/clearrange.go b/pkg/cmd/roachtest/clearrange.go
index 70ace8262..106f48283 100644
--- a/pkg/cmd/roachtest/clearrange.go
+++ b/pkg/cmd/roachtest/clearrange.go
@@ -69,16 +69,6 @@ func registerClearRange(r *registry) {
return err
}

  • // Set a low TTL so that the ClearRange-based cleanup mechanism can kick in earlier.
  • // This could also be done after dropping the table.
  • if _, err := conn.ExecContext(ctx, ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}'); err != nil {
  • return err
  • }
    -
  • if _, err := conn.ExecContext(ctx, DROP TABLE bank.bank); err != nil {
  • return err
  • }
    -
    // Spend a few minutes reading data with a timeout to make sure the DROP
    // above didn't brick the cluster.
    //

Shouldn't be any compactions happening in that cluster, right?

$ roachprod run benesch-clearrange -- 'grep compactor.go logs/cockroach.log'
benesch-clearrange: grep compactor.go logs/cock... 10/10
1:
Process exited with status 1
2: I180605 05:38:18.680504 169 storage/compactor/compactor.go:326 [n8,s8,compactor] processing compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB (reasons: size=true used=false avail=false)
I180605 05:38:19.210780 169 storage/compactor/compactor.go:342 [n8,s8,compactor] processed compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB in 530.241647ms
I180605 05:38:20.107331 169 storage/compactor/compactor.go:326 [n8,s8,compactor] processing compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false)
I180605 05:40:04.513392 169 storage/compactor/compactor.go:342 [n8,s8,compactor] processed compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB in 1m44.405947506s
3: I180605 05:38:54.126078 1849 storage/compactor/compactor.go:326 [n4,s4,compactor] processing compaction #419-1441/1442 (/Table/51/1/18669166-/Max) for 4.4 GiB (reasons: size=true used=false avail=false)
4:
Process exited with status 1
5:
Process exited with status 1
6: I180605 05:38:32.403214 859 storage/compactor/compactor.go:326 [n5,s5,compactor] processing compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB (reasons: size=true used=false avail=false)
I180605 05:46:38.469327 859 storage/compactor/compactor.go:342 [n5,s5,compactor] processed compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB in 8m6.066106127s
I180605 05:46:40.070311 859 storage/compactor/compactor.go:326 [n5,s5,compactor] processing compaction #221-1842/1843 (/Table/51/1/4610515-/Max) for 5.8 GiB (reasons: size=true used=false avail=false)
7:
Process exited with status 1
8:
Process exited with status 1
9: I180605 05:38:52.913420 915 storage/compactor/compactor.go:326 [n2,s2,compactor] processing compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB (reasons: size=true used=false avail=false)
I180605 05:41:04.507447 915 storage/compactor/compactor.go:342 [n2,s2,compactor] processed compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB in 2m11.594000298s
I180605 05:41:05.053543 915 storage/compactor/compactor.go:326 [n2,s2,compactor] processing compaction #239-1094/1095 (/Table/51/1/10569964-/Max) for 3.9 GiB (reasons: size=true used=false avail=false)
10: I180605 05:38:20.757556 232 storage/compactor/compactor.go:326 [n9,s9,compactor] processing compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB (reasons: size=true used=false avail=false)
I180605 05:38:20.870035 232 storage/compactor/compactor.go:342 [n9,s9,compactor] processed compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB in 112.445892ms
Error: Process exited with status 1

Pay special attention to the compactions that extend to /Max:

  • /Table/51/1/18669166-/Max
  • /Table/51/1/4610515-/Max

Sure enough, these pin read/write I/O at approximately ~75MB/s on the
affected nodes, since one RocksDB thread is rewriting basically all the
data in the store, constantly. This doesn't impact performance enough on
its own to fail the roachtest, but you can bet that it's not helping
matters. DROP TABLE can't rely on the compaction queue to purge range
deletions on the affected nodes because the compaction queue is busy
rewriting all the data in the darn store.

So where are these tombstones coming from? I have a hunch.
https://github.com/cockroachdb/cockroach/blob/93da1e6622fd1d943fa93822a38510ce7143226e/pkg/storage/replica.go#L789-L796
I suspect that during the RESTORE to generate this data the last range at
some point in time, say [/Table/51/1/4610515-/Max), got rebalanced from one
node to another, leaving a deletion tombstone and compaction suggestion in
its place. Unfortunately RESTORE than split that range approximately a
billion times, installed hundreds of gigabytes of data in that area of the
keyspace, and 3/10 of those replicas ended up on the node with the
compaction suggestion. Installing these compaction suggestions when
rebalancing seems fraught with peril.

I suspect this is part of the reason none of our strategies, until
Peter's, made any dent in the problem. Anything that relies on the
compaction queue to compact anything in a timely fashion is hopeless. (One
of the affected nodes is still compacting after 30m. It took 31m1s.)
Peter (inadvertently?) turned off the compaction queue for preexisting
suggestions and skirted this nastiness.

Another point of confusion: aren't these compaction suggestions supposed
to expire after 24h? How are these suggested compactions still relevant?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-394593725,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f99tNwB8g12YVu7QNTG-Pouj7nExxks5t5iHTgaJpZM4SwswS
.

Definitely not in the compactor code.

Must be DistSender?

Discussed on Slack.
An alternative is to have the SQL code rename the table upon DROP into some "trash" database, then issue a background job to delete the data. The background job could use smaller ClearRange requests (over subsets of the table) and alternate ClearRange with explicit compaction requests.

Also the work could be paced with some intermediate delay so that it is spread over a long period of time, to alleviate the performance impact on concurrent queries.

DistSender seemed to be doing the right thing when I tried to reproduce on
a smaller test scenario. Might be something specific to the clearrange
store dirs or how they were generated (I.e. restore/import).

On Tue, Jun 5, 2018 at 5:46 AM Tobias Schottdorf notifications@github.com
wrote:

Definitely not in the compactor code.

Must be DistSender?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-394649069,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f9xeO6mNjuiWtCGnYENHryGLv6mJyks5t5lNYgaJpZM4SwswS
.

PS I’m tracking down another one: the tiny gap between sugggested
compactions. Definitely not in the compactor code.
DistSender seemed to be doing the right thing when I tried to reproduce on
a smaller test scenario. Might be something specific to the clearrange
store dirs or how they were generated (I.e. restore/import).

This is ringing a distant bell about a RESTORE/IMPORT bug. I can't track it down, though.

Any hunches on how RESTORE/IMPORT could mis-generate data to cause this behavior? The range descriptors line up, right?

An alternative is to have the SQL code rename the table upon DROP into some "trash" database, then issue a background job to delete the data.

This happens already! Except that the background job (the schema changer) issues all the ClearRange requests at once. I have a WIP PR to throttle the ClearRange requests.

Erm, seems like this data set might just have a bunch of garbage ranges that span from, e.g., /Table/51/1/948949{-/0}.

This data set is pretty old, right? Not the store dirs, but the backup that they're generated from. Perhaps it's time to throw it away and generate a new one from scratch.

Any hunches on how RESTORE/IMPORT could mis-generate data to cause this behavior? The range descriptors line up, right?

No hunches. I am going to check the range descriptors this morning.

Erm, seems like this data set might just have a bunch of garbage ranges that span from, e.g., /Table/51/1/948949{-/0}.

I think we're seeing this on tpcc fixtures too. See https://github.com/cockroachdb/cockroach/issues/26375.

The range descriptors line up. The problem is the empty ranges (see #26375). The clearrange data set (store dirs created from a bank fixture) shows:

root@localhost:26257/defaultdb> select * from [show testing_ranges from table bank.bank] limit 10;
+-----------+----------+----------+----------+--------------+
| Start Key | End Key  | Range ID | Replicas | Lease Holder |
+-----------+----------+----------+----------+--------------+
| NULL      | /0/0     |       22 | {1,3,5}  |            1 |
| /0/0      | /3273    |       37 | {1,3,5}  |            1 |
| /3273     | /3273/0  |       38 | {4,7,9}  |            4 |
| /3273/0   | /6546    |       23 | {2,5,7}  |            5 |
| /6546     | /6546/0  |       25 | {6,7,10} |            6 |
| /6546/0   | /9819    |       27 | {2,6,7}  |            6 |
| /9819     | /9819/0  |       29 | {6,7,10} |            6 |
| /9819/0   | /13092   |       61 | {5,7,9}  |            9 |
| /13092    | /13092/0 |       63 | {3,4,9}  |            4 |
| /13092/0  | /16365   |       65 | {5,6,7}  |            6 |
+-----------+----------+----------+----------+--------------+

See all of those empty ranges such as /3273-/3273/0? We're correctly sending ClearRange operations to each of those ranges, but because they are empty they don't result in suggested compactions because of this code. That code exists to avoid creating a range tombstone when only a small amount of data is being deleted. The result is that we see suggested compactions for /0/0-/3273, then /3273/0-/6546, then /6546/0-/9819, etc. So this answers the mystery of the gap between suggested compactions. Let's move the discussion to the cause of the empty ranges to #26375.

ClearRange should probably suggest a compaction even if it decides not to use a range tombstone. The compactor will ignore the suggestion if it's too small, but this allows it to stitch up the consecutive gaps.

ClearRange should probably suggest a compaction even if it decides not to use a range tombstone. The compactor will ignore the suggestion if it's too small, but this allows it to stitch up the consecutive gaps.

The compactor already aggregates compactions together which don't overlap. See https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/compactor/compactor.go#L389-L400. But I don't allow aggregation for "clearing" compactions (in my PR). I could certainly have some additional logic to fix this. I'm running an experiment right now to see if it would be worthwhile as we should be able to have a far smaller number of range tombstones and perform far fewer compactions.

To return to @tschottdorf 's "blocking below raft" idea, what if we did something similar but blocked above raft? I'm thinking of a new semaphore tracking the number of range tombstones that have been written but not compacted away. We'd acquire the semaphore early when processing ClearRange (in beginCmds, around the time of the command queue), then release it either when ClearRange decides not to write a range tombstone or when the compaction it suggests has been processed (with some tuning of the compactor to not let clearing compactions sit around for too long).

This would limit the number of range tombstones that could be written within a single incarnation of the process. Would we need to persist something across restarts? Maybe, but hopefully not if we set the limit low enough.

Check it out:

screenshot 2018-06-06 00 14 19

That's with nothing but some ClearRange rate limiting. See #26449.

Shoot. I just ran an experiment on top of the compactor fix (#26488) and the clear range rate limiter (#26449) that crashed and burned:

screen shot 2018-06-06 at 9 17 33 pm

Only two of the nodes seemed to be truly suffering, which I suppose is an improvement. It was enough to grind the schema change to a halt, though.

I didn't have enough logging on to tell exactly what caused the explosion, but the end result was 40 goroutines stuck in syscalls for _multiple minutes_: https://gist.github.com/benesch/cbe7fff9e1005056ca4355f3840a8c80

My money is on the consistency queue. If it's not the root cause it's certainly exacerbating the situation. There are 31 goroutines in computeChecksumPostApply.func1 in this stack trace, and I captured one earlier with 65 (!) such goroutines.

It seems that the consistency queue is missing a backpressure mechanism. If a leaseholder doesn't hear back from a follower within 5s, it assumes the follower is dead and moves on to the next replica; if the follower is not, in fact, dead, but severely overloaded, these consistency queue checks just pile up and pile up.

@petermattis @tschottdorf can either of you take a look at the stack trace and lend your insight?

Just to confirm your observation of the cons checker as the root cause, could you go through a few runs with the consistency checker turned off? And if that fixes the problem, how about a version where the cons checker runs, but with a very high timeout (instead of the 5s)? Once we have that data, I think we can make an informed decision:

  1. either a consistency check at a time is enough to bring the cluster to its knees, and that would indicate that we need to tackle the range tombstones more directly (if a consistency check becomes really slow, real user traffic would, too -- though we should test that anyway)
  2. if it's really the piling up that causes the problem, we can either avoid the piling up via context cancellation (I assume this just doesn't happen right now - should implement that either way) and/or rate limiting the number of active consistency check post apply triggers.

I didn't have enough logging on to tell exactly what caused the explosion, but the end result was 40 goroutines stuck in syscalls for multiple minutes: https://gist.github.com/benesch/cbe7fff9e1005056ca4355f3840a8c80

The interesting part to me is that these goroutines are stuck in engine._Cfunc_DBIterSeek. Excessive range tombstones slow down iterator creation, but once the RangeDelAggregator is populated the overhead for seeking should be minimal (tombstones still have to be added for each sstable, but below L0 there shouldn't be many).

Perhaps we've stumbled upon another RocksDB bug. I'm struggling to imagine what could cause a Seek to take multiple minutes.

@petermattis something I've generally wanted to be able to do multiple times was to actually peek into RocksDB while we're in that state. What's the appropriate tool to do so? gdb et al? Assuming the DBIterSeek is at the bottom of the stack, it's probably waiting on some mutex.

edit: there's no way it's at the bottom of the stack. Rather what I'm trying to say is "what's the full stack there and what's at the bottom of it".

@petermattis something I've generally wanted to be able to do multiple times was to actually peek into RocksDB while we're in that state. What's the appropriate tool to do so? gdb et al? Assuming the DBIterSeek is at the bottom of the stack, it's probably waiting on some mutex.

edit: there's no way it's at the bottom of the stack. Rather what I'm trying to say is "what's the full stack there and what's at the bottom of it".

A CPU profile may show where these goroutines are blocked on the C++ side. It's unfortunate that the you can't extract the stack for a specific goroutine for a CPU profile. Alternately, it would be nice of the normal goroutine dump showed the C++ stacks as well, though I understand the technical reason they currently don't (Go would have to send a SIGPROF signal to all threads, while the current runtime.Stacks output is produced by looping over the goroutines).

I haven't tried gdb.

We could use github.com/google/pprof/profile to parse a CPU profile and look for a sample that contains _Cfunc_DBIterSeek. Somewhat onerous. Off topic: we should investigate using profiler labels (https://rakyll.org/profiler-labels/) at some point.

By the way, there was code added originally with the consistency queue mechanism to avoid scheduling for newly created ranges, such as after splits and rebalancings. Objections kept that feature out, but perhaps it should be revived. We are too aggressive in scheduling these checks.

@spencerkimball on the other hand, that would've potentially hidden this issue here. Your suggestion fits into https://github.com/cockroachdb/cockroach/issues/26151 in spirit, though.

Ok, I have a theory!

Just to confirm your observation of the cons checker as the root cause, could you go through a few runs with the consistency checker turned off?

Turning off the consistency checker helps quite a bit, but doesn't fix the problem. Other goroutines still get wedged and the CPU is still pegged.

So I turned off the scanner and the problem went poof. (I didn't try running a schema change, but I was able to get the cluster up and running again.) This makes quite a bit of sense: the scanner's pace interval is essentially zero because there are more replicas on each node than there are seconds in the default scan interval (10m, I think?). Every replica scanned triggers a few calls to getQueueLastProcessed/setQueueLastProcessed, each of which create an iterator. When iterator creation is slowed down by range tombstones, that can be a few hundred milliseconds of CPU time right there.

I stopped seeing stuck goroutines after that... except for one.

goroutine 307 [syscall, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7fd83ba1ab10, 0xc427fb85a7, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    _cgo_gotypes.go:666 +0x74
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7fd83ba1ab10, 0xc427fb85a7, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1951 +0xba
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc42f60efc0, 0xc427fb85a7, 0x7, 0x13, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1951 +0x16b
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).fetchSuggestions.func1(0xc427fb85a0, 0x19, 0x1a, 0x0, 0x0, 0xc4334611c0, 0xf, 0xf, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:309 +0x284
github.com/cockroachdb/cockroach/pkg/storage/engine.dbIterate(0x7fd83bb181b0, 0x2b0c0a0, 0xc420678340, 0xc4202b510a, 0x6, 0x6, 0x0, 0x0, 0xc4202b5110, 0x6, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2448 +0x320
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Iterate(0xc420678340, 0xc4202b510a, 0x6, 0x6, 0x0, 0x0, 0xc4202b5110, 0x6, 0x6, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:798 +0xbf
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).fetchSuggestions(0xc4209ae1e0, 0x2afa760, 0xc427bc50b0, 0xc42031a1e0, 0x26c7338, 0x1d, 0x2afa760, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:295 +0x306
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).processSuggestions(0xc4209ae1e0, 0x2afa760, 0xc427bc50b0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:207 +0x11d
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).Start.func1(0x2afa760, 0xc429ae75c0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:147 +0x1d9
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42b0b8c00, 0xc4208085a0, 0xc42bc71c60)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:186 +0xad

The code I added to the compactor is getting bogged down by range tombstones. As you might imagine, this leads to even more range tombstones. 🤦‍♂️

On the bright side, because there was only one thread in C++, it was easy to capture a stacktrace:

#0  snappy::SnappyDecompressor::DecompressAllTags<snappy::SnappyArrayWriter> () at /go/src/github.com/cockroachdb/cockroach/c-deps/snappy/snappy.cc:747
#1  InternalUncompressAllTags<snappy::SnappyArrayWriter> () at /go/src/github.com/cockroachdb/cockroach/c-deps/snappy/snappy.cc:844
#2  InternalUncompress<snappy::SnappyArrayWriter> () at /go/src/github.com/cockroachdb/cockroach/c-deps/snappy/snappy.cc:831
#3  0x0000000002011273 in snappy::RawUncompress () at /go/src/github.com/cockroachdb/cockroach/c-deps/snappy/snappy.cc:1197
#4  snappy::RawUncompress () at /go/src/github.com/cockroachdb/cockroach/c-deps/snappy/snappy.cc:1192
#5  0x0000000001ec4d34 in rocksdb::Snappy_Uncompress () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/compression.h:190
#6  rocksdb::UncompressBlockContentsForCompressionType () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/format.cc:287
#7  0x0000000001ec4ee9 in rocksdb::UncompressBlockContents () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/format.cc:393
#8  0x0000000001ec0406 in rocksdb::BlockFetcher::ReadBlockContents () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_fetcher.cc:221
#9  0x0000000001ead52b in ReadBlockFromFile () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:87
#10 0x0000000001eb006a in rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1612
#11 0x0000000001eb02fc in rocksdb::BlockBasedTable::NewDataBlockIterator ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1489
#12 0x0000000001eb0b73 in rocksdb::BlockBasedTableIterator::InitDataBlock ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1907
#13 0x0000000001eb0d0a in rocksdb::BlockBasedTableIterator::FindKeyForward ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1928
#14 0x0000000001e41c09 in rocksdb::IteratorWrapper::Next () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:61
#15 Next () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:586
#16 0x0000000001ed07c2 in rocksdb::IteratorWrapper::Next () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:61
#17 rocksdb::MergingIterator::Next () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:170
#18 0x0000000001df058c in rocksdb::DBIter::FindNextUserEntryInternal () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:591
#19 0x0000000001df1648 in rocksdb::DBIter::FindNextUserEntry () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:397
#20 rocksdb::DBIter::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1180
#21 0x0000000001d2267a in DBIterSeek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:549
#22 0x0000000001c88313 in _cgo_cf94bd8886ad_Cfunc_DBIterSeek (v=0xc42f688f48) at cgo-gcc-prolog:572
#23 0x00000000006f47e0 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:688
#24 0x00000000006f2111 in runtime.exitsyscallfast.func1 () at /usr/local/go/src/runtime/proc.go:3034
#25 0x000000c42bc7f680 in ?? ()
#26 0x00000000006ca580 in ?? () at /usr/local/go/src/runtime/proc.go:1087
#27 0x00007fd800000000 in ?? ()
#28 0x0100000000000150 in ?? ()

I watched one of these seeks climb up to 15m in C++, alternating between adding range tombstones and reading new blocks. Then it dawned on me. This sounds a lot like https://www.cockroachlabs.com/blog/adventures-performance-debugging/, doesn't it? Just on a scale of minutes, instead of seconds.

So here's the theory: if you end up with a bunch of contiguous range deletion tombstones over gigabytes of data, seeking to a key within the beginning of that region is going to be a bad time. To complete the seek RocksDB has to load gigabytes of data into memory and then run every single key through ShouldDelete. Oops. And if you run multiple of these seeks at once... they'll thrash the block cache into oblivion.

@petermattis I'd love to get your help in verifying this theory. I know that our iterators support some integration with RocksDB's perf contexts to get stats about a slow scan.

Seems like the solution is to install an iterate_upper_bound on basically every single iterator. In most cases we know exactly the bounds we'd like to iterate; we just don't tell RocksDB about them until its too late.

Oh, before I forget, to get a cluster with thousands of range deletion tombstones back online, you need a diff like this to avoid paying the cost of creating 10 new iterators for every replica

diff --git a/pkg/storage/engine/iter.go b/pkg/storage/engine/iter.go
new file mode 100644
index 000000000..c82815933
--- /dev/null
+++ b/pkg/storage/engine/iter.go
@@ -0,0 +1,49 @@
+// Copyright 2018 The Cockroach Authors.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+// implied. See the License for the specific language governing
+// permissions and limitations under the License.
+
+package engine
+
+import (
+   "context"
+
+   "github.com/cockroachdb/cockroach/pkg/util/log"
+)
+
+type IteratorCachingEngine struct {
+   Engine
+   Ctx  context.Context // XXX
+   iter *CachedIterator
+}
+
+func (ice *IteratorCachingEngine) NewIterator(opts IterOptions) Iterator {
+   if ice.iter == nil {
+       ice.iter = &CachedIterator{ice.Engine.NewIterator(opts)}
+   }
+   return ice.iter
+}
+
+func (ice *IteratorCachingEngine) Close() {
+   if ice.iter != nil {
+       ice.iter.Close()
+   }
+}
+
+type CachedIterator struct {
+   Iterator
+}
+
+func (ci *CachedIterator) Close() {
+   // noop
+}

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index 34e6d74..b65b800 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -655,24 +655,24 @@ func newReplica(rangeID roachpb.RangeID, store *Store) *Replica {
 // replicaID should be 0 and the replicaID will be discovered from the
 // descriptor.
 func NewReplica(
-   desc *roachpb.RangeDescriptor, store *Store, replicaID roachpb.ReplicaID,
+   desc *roachpb.RangeDescriptor, store *Store, replicaID roachpb.ReplicaID, e engine.Engine,
 ) (*Replica, error) {
    r := newReplica(desc.RangeID, store)
-   return r, r.init(desc, store.Clock(), replicaID)
+   return r, r.init(desc, store.Clock(), replicaID, e)
 }

 func (r *Replica) init(
-   desc *roachpb.RangeDescriptor, clock *hlc.Clock, replicaID roachpb.ReplicaID,
+   desc *roachpb.RangeDescriptor, clock *hlc.Clock, replicaID roachpb.ReplicaID, e engine.Engine,
 ) error {
    r.raftMu.Lock()
    defer r.raftMu.Unlock()
    r.mu.Lock()
    defer r.mu.Unlock()
-   return r.initRaftMuLockedReplicaMuLocked(desc, clock, replicaID)
+   return r.initRaftMuLockedReplicaMuLocked(desc, clock, replicaID, e)
 }

 func (r *Replica) initRaftMuLockedReplicaMuLocked(
-   desc *roachpb.RangeDescriptor, clock *hlc.Clock, replicaID roachpb.ReplicaID,
+   desc *roachpb.RangeDescriptor, clock *hlc.Clock, replicaID roachpb.ReplicaID, e engine.Engine,
 ) error {
    ctx := r.AnnotateCtx(context.TODO())
    if r.mu.state.Desc != nil && r.isInitializedRLocked() {
@@ -696,7 +696,7 @@ func (r *Replica) initRaftMuLockedReplicaMuLocked(

    var err error

-   if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.store.Engine(), desc); err != nil {
+   if r.mu.state, err = r.mu.stateLoader.Load(ctx, e, desc); err != nil {
        return err
    }

@@ -720,13 +720,13 @@ func (r *Replica) initRaftMuLockedReplicaMuLocked(

    r.rangeStr.store(0, r.mu.state.Desc)

-   r.mu.lastIndex, err = r.mu.stateLoader.LoadLastIndex(ctx, r.store.Engine())
+   r.mu.lastIndex, err = r.mu.stateLoader.LoadLastIndex(ctx, e)
    if err != nil {
        return err
    }
    r.mu.lastTerm = invalidLastTerm

-   pErr, err := r.mu.stateLoader.LoadReplicaDestroyedError(ctx, r.store.Engine())
+   pErr, err := r.mu.stateLoader.LoadReplicaDestroyedError(ctx, e)
    if err != nil {
        return err
    }
@@ -751,7 +751,7 @@ func (r *Replica) initRaftMuLockedReplicaMuLocked(
        return err
    }

-   r.assertStateLocked(ctx, r.store.Engine())
+   r.assertStateLocked(ctx, e)
    return nil
 }

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 20f6101..705aab1 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -963,9 +963,7 @@ func NewStore(cfg StoreConfig, eng engine.Engine, nodeDesc *roachpb.NodeDescript
    if cfg.TestingKnobs.DisableRaftSnapshotQueue {
        s.setRaftSnapshotQueueActive(false)
    }
-   if cfg.TestingKnobs.DisableScanner {
-       s.setScannerActive(false)
-   }
+   s.setScannerActive(false)

    return s
 }
@@ -1359,13 +1357,17 @@ func (s *Store) Start(ctx context.Context, stopper *stop.Stopper) error {
    // concurrently. Note that while we can perform this initialization
    // concurrently, all of the initialization must be performed before we start
    // listening for Raft messages and starting the process Raft loop.
-   err = IterateRangeDescriptors(ctx, s.engine,
+   ice := &engine.IteratorCachingEngine{Ctx: ctx, Engine: s.engine}
+   defer ice.Close()
+   err = IterateRangeDescriptors(ctx, ice,
        func(desc roachpb.RangeDescriptor) (bool, error) {
            if !desc.IsInitialized() {
                return false, errors.Errorf("found uninitialized RangeDescriptor: %+v", desc)
            }

-           rep, err := NewReplica(&desc, s, 0)
+           log.Eventf(ctx, "initializing %s", desc.StartKey)
+
+           rep, err := NewReplica(&desc, s, 0, ice)
            if err != nil {
                return false, err
            }
@@ -2160,7 +2162,7 @@ func splitPostApply(
    {
        rightRng.mu.Lock()
        // Already holding raftMu, see above.
-       err := rightRng.initRaftMuLockedReplicaMuLocked(&split.RightDesc, r.store.Clock(), 0)
+       err := rightRng.initRaftMuLockedReplicaMuLocked(&split.RightDesc, r.store.Clock(), 0, r.store.Engine())
        rightRng.mu.Unlock()
        if err != nil {
            log.Fatal(ctx, err)
@@ -3960,7 +3962,7 @@ func (s *Store) tryGetOrCreateReplica(
        // TODO(bdarnell): other fields are unknown; need to populate them from
        // snapshot.
    }
-   if err := repl.initRaftMuLockedReplicaMuLocked(desc, s.Clock(), replicaID); err != nil {
+   if err := repl.initRaftMuLockedReplicaMuLocked(desc, s.Clock(), replicaID, s.engine); err != nil {
        // Mark the replica as destroyed and remove it from the replicas maps to
        // ensure nobody tries to use it
        repl.mu.destroyStatus.Set(errors.Wrapf(err, "%s: failed to initialize", repl), destroyReasonRemoved)

and also probably facebook/rocksdb#3800.

@petermattis I'd love to get your help in verifying this theory. I know that our iterators support some integration with RocksDB's perf contexts to get stats about a slow scan.

@tschottdorf Added support for extracting perf context stats during iterator operations. See libroach:ScopedStats. Currently only a single stat is retrieved, but adding more should be straightforward.

Ok, the results are in. Here's the perf context after a 7m seek:

user_key_comparison_count = 6119910, block_cache_hit_count = 6, block_read_count = 1629046, block_read_byte = 49444945773, block_read_time = 289346125311, block_checksum_time = 7171937629, block_decompress_time = 123770617916, get_read_bytes = 0, multiget_read_bytes = 0, iter_read_bytes = 10266, internal_key_skipped_count = 258, internal_delete_skipped_count = 6117839, internal_recent_skipped_count = 0, internal_merge_count = 0, write_wal_time = 0, get_snapshot_time = 0, get_from_memtable_time = 0, get_from_memtable_count = 0, get_post_process_time = 0, get_from_output_files_time = 0, seek_on_memtable_time = 1354, seek_on_memtable_count = 1, next_on_memtable_count = 0, prev_on_memtable_count = 0, seek_child_seek_time = 35141338, seek_child_seek_count = 6, seek_min_heap_time = 716, seek_internal_seek_time = 35143112, find_next_user_entry_time = 437018704126, write_pre_and_post_process_time = 0, write_memtable_time = 0, db_mutex_lock_nanos = 0, db_condition_wait_nanos = 0, merge_operator_time_nanos = 0, write_delay_time = 0, read_index_block_nanos = 0, read_filter_block_nanos = 0, new_table_block_iter_nanos = 427002309018, new_table_iterator_nanos = 1113030843, block_seek_nanos = 2002893, find_table_nanos = 1019574640, bloom_memtable_hit_count = 0, bloom_memtable_miss_count = 0, bloom_sst_hit_count = 0, bloom_sst_miss_count = 0, key_lock_wait_time = 0, key_lock_wait_count = 0, env_new_sequential_file_nanos = 0, env_new_random_access_file_nanos = 0, env_new_writable_file_nanos = 0, env_reuse_writable_file_nanos = 0, env_new_random_rw_file_nanos = 0, env_new_directory_nanos = 0, env_file_exists_nanos = 0, env_get_children_nanos = 0, env_get_children_file_attributes_nanos = 0, env_delete_file_nanos = 0, env_create_dir_nanos = 0, env_create_dir_if_missing_nanos = 0, env_delete_dir_nanos = 0, env_get_file_size_nanos = 0, env_get_file_modification_time_nanos = 0, env_rename_file_nanos = 0, env_link_file_nanos = 0, env_lock_file_nanos = 0, env_unlock_file_nanos = 0, env_new_logger_nanos = 0,

I'd say my theory is confirmed. That seek read 50GiB of data.

Wow!

@benesch I've been experimenting with your two PRs (#26449 and #26488) patched on top of #26526. Initial experiments still resulted in badness, with 2 nodes pegging the CPU in RangeDelAggregator::AddTombstones. I haven't dived too far into debugging why #26526 isn't helping more, but I did notice that the rate limiting of ClearRange operations from the schema changer isn't as effective as we'd like. I tried reduce the batch size from 50 to 10, but that didn't help. What did help was to insert a 1s sleep in between every batch. With this throttling in place, the compaction queue seems to be keeping up:

screen shot 2018-06-09 at 11 47 15 am

I'm still seeing 1 node pegging the CPU in RangeDelAggregator::AddTombstones. Still investigating.

Hrmm, the problematic node's compactor goroutine is stuck in DBCompactRange:

goroutine 856 [syscall, 21 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBCompactRange(0x7fb4a7d182a0, 0xc4296d2bc7, 0x8, 0xc4296d2ff1, 0x8, 0xc429f9dc00, 0x0, 0x0)
    _cgo_gotypes.go:336 +0x50
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).CompactRange.func1(0x7fb4a7d182a0, 0xc4296d2bc7, 0x8, 0xc4296d2ff1, 0x8, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:889 +0x82
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).CompactRange(0xc4201add40, 0xc4296d2bc7, 0x8, 0x15, 0xc4296d2ff1, 0x8, 0xb, 0x0, 0x1b, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:889 +0xae
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).processCompaction(0xc4204ee240, 0x2b4c360, 0xc428cae180, 0xc4296d2bc7, 0x8, 0x15, 0xc4296d2ff1, 0x8, 0xb, 0x18e13ca8, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:372 +0x8d0
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).processSuggestions(0xc4204ee240, 0x2b4c360, 0xc428cae180, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:263 +0x77e
github.com/cockroachdb/cockroach/pkg/storage/compactor.(*Compactor).Start.func1(0x2b4c360, 0xc428aa7e30)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/compactor/compactor.go:147 +0x1d9
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc428017ab0, 0xc4207341b0, 0xc42a310600)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:186 +0xad

Yeah, I've been running with --vmodule=schema_changer=2,compactor=2,rocksdb=3 in order to see where the compactor's getting stuck.

One other patch you might want to try running with: facebook/rocksdb#3800

Yeah, I was just about to run with rocksdb=3. Interestingly, in several runs I keep on seeing problems on the same node (n4, roachprod node 3). The very first compaction it does takes a long time:

I180609 17:02:07.497974 542 storage/compactor/compactor.go:367 [n4,s4,compactor] processing compaction #913-931/1427 (/Table/51/1/37957760-/Table/51/1/38233415/0) for 289 MiB (reasons: size=true used=false avail=false)
I180609 17:09:35.642037 542 storage/compactor/compactor.go:383 [n4,s4,compactor] processed compaction #913-931/1427 (/Table/51/1/37957760-/Table/51/1/38233415/0) for 289 MiB in 7m28.144071063s

Internally, DBCompactRange only issued 2 compactions (which seems appropriate for compacting 289MiB).

Wow, 7m is so slow. I wonder what's actually going on. The RocksDB logs should be illuminating.

(n4, roachprod node 3)

We've got to start generating these store dumps on clusters started with roachprod start --sequential.

Something is horked up with the manual compaction. It seems we're compacting to the end of the key space, but I don't see how. The cycle time on debugging this is irritating. Here are some RocksDB compaction logs generated from a single DB::CompactRange call:

I180609 18:20:00.049641 17 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1497] [default] Compaction start summary: Base version 2 Base level 0, inputs: [10650(270KB)], [10624(8404KB) 10623(8221KB)]
I180609 18:20:00.343483 17 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1497] [default] Compaction start summary: Base version 3 Base level 2, inputs: [10656(8221KB)], [7653(16MB) 7655(3442KB) 7556(8233KB) 7558(6024KB) 7545(8226KB) 7548(8233KB) 9065(3194KB) 9058(5724KB)]
I180609 18:20:01.478244 17 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1497] [default] Compaction start summary: Base version 5 Base level 4, inputs: [10662(2113KB)]
I180609 18:20:01.577256 17 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1497] [default] Compaction start summary: Base version 6 Base level 5, inputs: [10663(2111KB)], [5728(100MB) 5731(100MB) 5733(100MB) 5735(100MB) 5739(100MB) 5741(100MB) 5744(100MB) 5750(100MB) 5752(100MB) 5757(100MB) 5761(100MB) 5764(100MB) 5768(100MB) 5771(100MB) 5777(100MB) 5780(100MB) 5783(100MB) 5786(100MB) 5789(100MB) 5791(100MB) 5794(100MB) 5798(100MB) 5800(100MB) 5803(100MB) 5807(100MB) 5810(100MB) 5813(100MB) 5818(100MB) 5822(100MB) 5823(100MB) 5827(100MB) 5830(100MB) 5834(100MB) 5837(100MB) 5842(100MB) 5845(100MB) 5849(100MB) 5850(100MB) 5854(100MB) 5857(100MB) 5862(100MB) 5864(41MB) 3334(24MB) 3779(18MB) 3784(24MB) 3787(24MB) 3791(1534KB) 3800(24MB) 3812(24MB) 3822(24MB) 3836(6506KB) 3855(20MB) 3874(24MB) 3907(24MB) 3933(24MB) 3968(24MB) 3996(15MB) 4025(24MB) 4054(24MB) 4083(24MB) 4111(24MB) 4139(24MB) 4168(24MB) 4210(14MB) 4246(24MB) 4266(24MB) 4313(24MB) 4339(12MB) 4351(24MB) 4382(19MB) 4406(24MB) 4473(20MB) 4481(24MB) 4497(24MB) 4504(24MB) 4507(24MB) 4511(24MB) 4517(24MB) 4527(24MB) 4536(24MB) 4547(24MB) 4556(17MB) 4565(24MB) 4575(24MB) 4590(24MB) 4599(24MB) 4610(24MB) 4618(7435KB) 4628(24MB) 4639(24MB) 4649(24MB) 4658(10MB) 4669(24MB) 4683(24MB) 4694(16MB) 4704(24MB) 4716(24MB) 4728(24MB) 4735(21MB) 4743(24MB) 4755(24MB) 4766(24MB) 4774(24MB) 3790(24MB) 3795(24MB) 3807(8855KB) 3819(24MB) 3832(24MB) 3851(24MB) 3869(24MB) 3902(24MB) 3927(24MB) 3963(24MB) 3990(24MB) 4019(24MB) 4045(24MB) 4075(24MB) 4106(16MB) 4133(24MB) 4159(24MB) 4203(24MB) 4231(24MB) 4256(24MB) 4287(24MB) 4323(24MB) 4366(24MB) 4417(9607KB) 4394(24MB) 4430(24MB) 4448(24MB) 4476(17MB) 4501(24MB) 4513(24MB) 4525(24MB) 4533(24MB) 4545(24MB) 4555(6683KB) 4567(24MB) 4576(24MB) 4587(24MB) 4597(24MB) 4608(24MB) 4617(24MB) 4627(24MB) 4637(24MB) 4646(933KB) 4656(24MB) 4666(24MB) 4681(24MB) 4692(24MB) 4714(24MB) 4724(24MB) 4733(24MB) 4742(24MB) 4752(24MB) 4764(14MB) 4773(24MB) 4782(24MB) 4790(24MB) 4793(24MB) 4799(24MB) 4804(18MB) 4809(24MB) 4814(24MB) 4819(24MB) 4823(24MB) 4827(24MB) 4831(24MB) 4837(24MB) 4840(24MB) 4845(17MB) 4850(24MB) 4855(24MB) 4859(24MB) 4868(24MB) 4873(24MB) 4877(24MB) 4881(16MB) 4887(24MB) 4891(24MB) 4895(24MB) 4899(24MB) 4904(24MB) 4909(24MB) 4913(24MB) 4916(11MB) 4920(24MB) 4925(24MB) 4929(24MB) 4933(24MB) 4937(24MB) 4941(24MB) 4945(24MB) 4949(20MB) 4955(24MB) 4960(24MB) 4964(24MB) 4967(24MB) 4971(24MB) 4977(24MB) 4981(24MB) 4986(24MB) 4992(24MB)]

Kind of hard to parse, but what is happening is that we're compacting to L0 -> L2, L2 -> L3, L3 -> L4, L4 -> L5 and then L5 -> L6. (The L3 -> L4 compaction isn't represented above). The compactions up until L5 -> L6 are speedy. That last compaction covers way too much of the key space and causes badness.

What did help was to insert a 1s sleep in between every batch. With this throttling in place, the compaction queue seems to be keeping up:

My initial patch had code to send no more than one batch every 5s... except I had a bug and the rate limiting wasn't applied. And yet that experiment was the one successful experiment I've run, so I concluded the rate limiting wasn't necessary. I suspect that one node was just slow enough in processing ClearRange requests—they seemed to be taking several seconds—that there was natural rate limiting. Perhaps your fixes sped up ClearRange requests to the point where rate limiting is actually necessary.

So even with my change to the compaction queue to ignore suggestions spanning live data, a few of the nodes with this store dump have some pending suggestions that they try to act upon:

$ roachprod run benesch-clearrange 'grep "compactor.go.*process" logs/cockroach.log' 
benesch-clearrange: grep "compactor.go.*process... 10/10
   1: 
Process exited with status 1
   2: I180609 21:54:38.849945 150 storage/compactor/compactor.go:370  [n8,s8,compactor] processing compaction #49-84/447 (/Table/51/1/1255643-/Table/51/1/2495965/0) for 294 MiB (reasons: size=true used=false avail=false)
I180609 21:54:39.382614 150 storage/compactor/compactor.go:386  [n8,s8,compactor] processed compaction #49-84/447 (/Table/51/1/1255643-/Table/51/1/2495965/0) for 294 MiB in 532.616883ms
I180609 21:54:39.733709 150 storage/compactor/compactor.go:370  [n8,s8,compactor] processing compaction #329-357/447 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false)
I180609 21:56:34.559770 150 storage/compactor/compactor.go:386  [n8,s8,compactor] processed compaction #329-357/447 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB in 1m54.826041583s
   3: I180609 21:55:37.301785 583 storage/compactor/compactor.go:370  [n4,s4,compactor] processing compaction #914-932/1428 (/Table/51/1/37957760-/Table/51/1/38233415/0) for 289 MiB (reasons: size=true used=false avail=false)
I180609 22:04:13.000059 583 storage/compactor/compactor.go:386  [n4,s4,compactor] processed compaction #914-932/1428 (/Table/51/1/37957760-/Table/51/1/38233415/0) for 289 MiB in 8m35.698238964s
I180609 22:04:13.065776 583 storage/compactor/compactor.go:370  [n4,s4,compactor] processing compaction #983-1028/1428 (/Table/51/1/39538412/0-/Table/51/1/39781285) for 362 MiB (reasons: size=true used=false avail=false)
I180609 22:04:13.074273 583 storage/compactor/compactor.go:386  [n4,s4,compactor] processed compaction #983-1028/1428 (/Table/51/1/39538412/0-/Table/51/1/39781285) for 362 MiB in 8.488469ms
I180609 22:04:13.099419 583 storage/compactor/compactor.go:370  [n4,s4,compactor] processing compaction #1037-1066/1428 (/Table/51/1/40770561/0-/Table/51/1/40857017/0) for 398 MiB (reasons: size=true used=false avail=false)
   4: 
Process exited with status 1
   5: 
Process exited with status 1
   6: I180609 21:55:07.840456 352 storage/compactor/compactor.go:370  [n5,s5,compactor] processing compaction #585-614/1823 (/Table/51/1/19637255-/Table/51/1/19907573/0) for 269 MiB (reasons: size=true used=false avail=false)
I180609 21:56:17.033874 352 storage/compactor/compactor.go:386  [n5,s5,compactor] processed compaction #585-614/1823 (/Table/51/1/19637255-/Table/51/1/19907573/0) for 269 MiB in 1m9.193311445s
   7: 
Process exited with status 1
   8: 
Process exited with status 1
   9: 
Process exited with status 1
  10: 
Process exited with status 1
Error:  Process exited with status 1

It takes n5 1m, n8 2m to process its suggestions, and n4 10m and counting. The problem, as @petermattis discovered, is that even though we're asking for a small range to be compacted, that range is contained within an SST in L5 with a very wide range, spanning over 1k SSTs (!) in L6. Compacting that, unsurprisingly, takes quite a while.

I suspect we're tripping over another problem with range tombstones. I need to do a fuller write up of my understanding of compactions, but the short summary is that RocksDB attempts to align the boundaries of sstables at Ln with those at Ln+1. But somehow it is failing to do so and the result is a super-large compaction (as alluded to above, an L5 -> L6 compaction involving a single file in L5 is overlapping 1247 files in L6).

I've been adding logging to try and understand what RocksDB is doing. Some highlights:

L0 -> L2
/Local/RangeID/1/r/AbortSpan/"911ec3ce-6d25-4713-82e4-2fc6c2951af7" /Table/51/1/46687208
    /Table/51/1/23887551/0 /Table/51/1/64697768

L2 -> L3
/Table/51/1/23887551/0 /Table/51/1/64697768/NULL
    /Table/51/1/23887551/0 /Table/51/1/27440183/0
    /Table/51/1/27440184/0 /Table/51/1/27442206/0
    /Table/51/1/27442207/0 /Table/51/1/28781803/0
    /Table/51/1/28781804/0 /Table/51/1/28783823/0
    /Table/51/1/28783824/0 /Table/51/1/64697768

L3 -> L4 (no-op because no overlapping L4 files)

L4 -> L5
/Table/51/1/28783824/0 /Table/51/1/64697768
    /Table/51/1/28783824/0 /Table/51/1/40794462
    /Table/51/1/40804278/0 /Table/51/1/64697768

The first line indicates the key range for the input file (and in these compactions there is always a single input file). The indented lines indicate the key ranges for the output files. So we see in the L0->L2 compaction an output file over a fairly large key range. The output file becomes the input for the L2->L3 compaction which generates a few small ranges and then a large range at the end. That last output file again becomes the input for the L4->L5 compaction.

Not shown is that the key range /Table/51/1/40804278/0 /Table/51/1/64697768 overlaps 1247 files in L6.

My current hypothesis is that SubcompactionState::ShouldStopBefore is correctly indicating good boundaries for the output files (i.e. boundaries that align well with lower levels), but we have a range tombstone at the end of the file which is causing the end-key boundary to be extended in RangeDelAggregator::AddToBuilder. A bit of support for this theory is that my RocksDB logging indicates the "type" of the key /Table/51/1/64697768 (the end-key for the problematic sstables) as being a range deletion.

@benesch As far as I can tell this is independent of my RocksDB changes. In CompactionJob::FinishCompactionOutputFile, range tombstones for the last generated file are bounded by the input key range, but that can result in the last file in a level covering too large a range.

I'm going to try reproducing this badness in a smaller test as I'm tired of dealing with the 10-15min debug cycle of the clearrange roachtest.

@benesch As far as I can tell this is independent of my RocksDB changes. In CompactionJob::FinishCompactionOutputFile, range tombstones for the last generated file are bounded by the input key range, but that can result in the last file in a level covering too large a range.

Yes, it certainly is. I'm running without your RocksDB patches and observing the same too-large compactions.

I'm going to try reproducing this badness in a smaller test as I'm tired of dealing with the 10-15min debug cycle of the clearrange roachtest.

Running on ZFS-formatted volumes turns out to be really easy:

sudo apt-get update
sudo apt-get -qqy install zfsutils-linux
sudo umount /mnt/data
sudo zpool create -f data1 -m /mnt/data1 /dev/sdb
sudo chown -R $(whoami):$(whoami) /mnt/data1
# download store dumps without starting cluster
sudo zfs snapshot data1@pristine

Then you can restore to the pristine state in less than a second with:

sudo zfs rollback data1@pristine

The performance characteristics of ZFS are bound to cause problems for our performance tests, but so far I haven't noticed any problems.

Here's a list of all of the tombstones on that node: https://gist.github.com/benesch/e72fb0a347382ce1eeca4c99232cfd27

The tombstone you mention spans /Table/51/1/64694496/0 - /Table/51/1/64697768. Doesn't seem especially large on its own. It's preceded by quite a few nearly-adjacent tombstones, though.

While stepping out of the shower this morning I thought of a simple hack that might work around the problem. The TLDR of the problem is that compactions do not take into consideration range tombstones when determining the output files. The result is that the first and last output file from a compaction can be extended way too far so that it covers an excessive number of sstables in the next lower level.

The simple hack: whenever we perform a clear range we also add two additional deletions (for individual keys) at the start and end of the tombstone. These key deletions should allow the compaction logic to select output boundaries correctly. Note that the key deletions have to occur after the range deletion so that they have higher sequence numbers and thus won't get deleted as the range tombstone progresses down through the levels.

Adding the key deletions immediately after a clear range is easy. We might also want to have a migration that runs through the suggested compactions at startup and retroactively adds key deletions for any empty suggested compaction. (We can't do this in the compaction queue because a replica might be concurrently rebalanced onto the node).

Ha! That really might just work.

We might be able to skip the migration. Anyone running a 2.0 cluster that’s
been balanced for more than 24h will have aged out all of its suggested
compactions. We’re the oddballs for continually rolling back to the state
immediately after a restore

On Sun, Jun 10, 2018 at 8:33 AM Peter Mattis notifications@github.com
wrote:

While stepping out of the shower this morning I thought of a simple hack
that might work around the problem. The TLDR of the problem is that
compactions do not take into consideration range tombstones when
determining the output files. The result is that the first and last output
file from a compaction can be extended way too far so that it covers an
excessive number of sstables in the next lower level.

The simple hack: whenever we perform a clear range we also add two
additional deletions (for individual keys) at the start and end of the
tombstone. These key deletions should allow the compaction logic to select
output boundaries correctly. Note that the key deletions have to occur
after the range deletion so that they have higher sequence numbers and thus
won't get deleted as the range tombstone progresses down through the levels.

Adding the key deletions immediately after a clear range is easy. We might
also want to have a migration that runs through the suggested compactions
at startup and retroactively adds key deletions for any empty suggested
compaction. (We can't do this in the compaction queue because a replica
might be concurrently rebalanced onto the node).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-396045581,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15ICukTibL-QGg3g_8Yx4YtXE-HnFlks5t7RIUgaJpZM4SwswS
.

Ha! That really might just work.

So good in theory, but it doesn't seem to have had any effect. I'm still seeing the excessively large L5->L6 compactions. In fact, I'm seeing exactly the same compactions as before. Either my hack has a bug or there is something fouling it up.

So my hack does work, but it didn't seem to help the clearrange test. I wrote a small test which demonstrates the working of the hack. I'm going to see about pulling this out into a separate PR and then get back to investigating the clearrange test.

I've also filed https://github.com/facebook/rocksdb/issues/3977 as the range-tombstone/compaction issue deserves to be addressed upstream.

FYI, I built a new bank stores fixture: gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=2/stores=10,bin-version=2.0-6/. This was built using #26576 which means it has both the clear range hack and the restore empty-range fix. Notice that seed=2 and bin-version=2.0-6.

Awesome! How long did that take?

On Sun, Jun 10, 2018 at 8:28 PM, Peter Mattis notifications@github.com
wrote:

FYI, I built a new bank stores fixture: gs://cockroach-fixtures/
workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,
rows=65104166,seed=2/stores=10,bin-version=2.0-6/. This was built using

26576 https://github.com/cockroachdb/cockroach/pull/26576 which means

it has both the clear range hack and the restore empty-range fix. Notice
that seed=2 and bin-version=2.0-6.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-396094266,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IOkfabJd4CuM06IVszkPBR1IlXxPks5t7bmhgaJpZM4SwswS
.

Awesome! How long did that take?

3-4 hours.

So I found what seems to be a huge perf bug in RocksDB's existing range deletion implementation! Opened a PR upstream: https://github.com/facebook/rocksdb/pull/3992

I'm running a clear range test right now with that fix plus https://github.com/benesch/cockroach/tree/iter-upper-bound and results are looking VERY promising:

screenshot 2018-06-13 04 39 27

I'm going to regret making that claim in the morning. CPU usage is quite stable, though:

image

Please forgive the graph wonkiness. I filed https://github.com/cockroachdb/cockroach/issues/26679 for that. I also had to turn off the consistency checker because I think whatever is destroying the graphs is also violating replica consistency. (The exact same consistency failures are happening in nightly roachtests so I suspect they're unrelated to my RocksDB patch, but still scary timing.)

One last thought: you'll notice that storage available isn't budging in the screenshot above. It is in fact ticking downward as I write this comment.

image

We'll likely want to continue investigating the fixes for the overlarge compactions we're seeing that wedge up the compaction queue, but mitigating the massive CPU usage of range tombstones really gives us some breathing room.

Just peeked at your cluster... This really does look pretty nice:

image

That "Capacity" line at the very top deserves a better tool tip by the way... this looks very confusing, though there's probably a good reason it behaves that way.

That "Capacity" line at the very top deserves a better tool tip by the way... this looks very confusing, though there's probably a good reason it behaves that way.

That's due to running on zfs and having a zfs snapshot that is consuming capacity and thus reducing the capacity to the rest of the system as the filesystem "head" diverges from the snapshot.

Ah, TIL. Thanks for the heads up.

I ran a test this morning using https://github.com/petermattis/rocksdb/tree/pmattis/get-overlapping-inputs which changes VersionStorageInfo::GetOverlappingInputs to not consider the range tombstone sentinel key that sometimes shows up as the largest key in an sstable as being part of the user-key. Gah, I feel at a loss to explain this right now. Here are some pictures:

screen shot 2018-06-13 at 9 38 24 am

screen shot 2018-06-13 at 9 38 34 am

This also looks promising, but the CPU graph is much less pretty:

screen shot 2018-06-13 at 9 40 16 am

Grep'ing logs also showed that there were still a number of large compactions. For example, one compaction took 11m12s. Hmm, looks like this might be the compaction queue:

I180613 11:04:05.066713 234 storage/compactor/compactor.go:367  [n3,s3,compactor] processing compaction #1-2023/2023 (/Table/53/1/39713836-/Table/53/1/58425827) for 58 GiB (reasons: size=true used=true avail=true)
...
I180613 11:15:18.032050 234 storage/compactor/compactor.go:383  [n3,s3,compactor] processed compaction #1-2023/2023 (/Table/53/1/39713836-/Table/53/1/58425827) for 58 GiB in 11m12.965283011s

Putting this here so I don't forget.

There's something funny about how the existing code handles empty range tombstones, i.e., range tombstones from [a, a) or [b, a). Here's a test that I discovered last night that expects such tombstones to be preserved even though they can be dropped on the floor the minute they come through: https://github.com/facebook/rocksdb/blob/1f32dc7d2b6721a1fe13eb515d52e5cd6f110f59/db/db_range_del_test.cc#L489-L515

Here's a case where it's more problematic: https://github.com/facebook/rocksdb/pull/3992/files#diff-de8dc928b3d0f5bf738e5dc2cae5cc6dR114. That collapsed map has two entries for key a! Yikes. Pretty sure that's not handled correctly by ShouldDelete. I suspect it's also mishandled by AddToBuilder.

With the recent improvements to RocksDB range tombstones, the clearrange roachtest now reliably passes:

screen shot 2018-07-13 at 8 27 53 pm

Note that the test started slightly after 00:00. This test was using a freshly generated bank fixture. For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated. I'll file a separate issue when I reproduce.

For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated.

Why doesn't it make any sense? I think that fixture was generated with 1.1, and you're running a 2.1 beta.

In that graph, why does available stay constant while capacity decreases?

On Fri, Jul 13, 2018 at 8:40 PM Nikhil Benesch notifications@github.com
wrote:

For some reason, cockroach was refusing to start using the old store
fixtures complaining about the version being too old. This doesn't make any
sense and needs to be investigated.

Why doesn't it make any sense? I think that fixture was generated with
1.1, and this is a 2.1 beta.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-404985740,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTX1xzZnmjyOUp9tin3rkW-pQoIo3ks5uGT32gaJpZM4SwswS
.

Why doesn't it make any sense? I think that fixture was generated with 1.1, and you're running a 2.1 beta.

See https://github.com/cockroachdb/cockroach/issues/27525. I generated that fixture with a binary at version v2.0-7.

In that graph, why does available stay constant while capacity decreases?

Zfs. For ease of debugging we take a zfs snapshot after restoring the store fixtures which makes subsequent runs of the test fast. The downside is that as we start to delete data, zfs doesn't actually delete it off disk because it is referenced by the snapshot.

clearrange has passed 4 nights in a row. I'm closing this issue. https://github.com/cockroachdb/cockroach/issues/26693 remains open to investigate the excessive compactions that can still occur, but do not drive a cluster to its knees.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

magaldima picture magaldima  ·  3Comments

melskyzy picture melskyzy  ·  3Comments

nvanbenschoten picture nvanbenschoten  ·  3Comments

nvanbenschoten picture nvanbenschoten  ·  3Comments

awoods187 picture awoods187  ·  3Comments