cockroach 🚀 - storage: transitive prerequisites are not always transferred on command cancellation

Last time we saw this error was #10501

bdarnell on 1 Jun 2017

I'm seeing this on sky right now while try to pre-split a table into 10000 parts:

F170808 17:31:44.518350 9645 storage/replica_command.go:1326  [n88,s88,r1/23:/{Min-System/}] range lookup of meta key '^C<BB><89><FD>%K<A1><A0>ԯ<BE><82>' found only non-matching ranges:

Cc @tamird

petermattis on 8 Aug 2017

Have the meta ranges split yet? It's possible we're running into #2266 on sky.

bdarnell on 8 Aug 2017

@bdarnell The above happened after 2000 splits. Pretty sure the meta ranges have not split yet. Yesterday I had sky up to 400k ranges (via sized-based splits).

petermattis on 8 Aug 2017

This happened again on sky last night. This time the system had ~1m ranges. The failure is pretty bad as we seem to have walked through the replicas for this meta2 table and triggered the failure on each, which then resulted in loss of quorum for a large swath of ranges:

cockroach.stderr.17:F170809 04:05:45.149824 4508002 storage/replica_command.go:1326  [n11,s11,r272822/3:/Meta2/Table/51/1/-{465…-299…}] range lookup of meta key '���ɖ��&~��' found only non-matching ranges:
cockroach.stderr.37:F170809 04:05:44.611391 12562094 storage/replica_command.go:1326  [n2,s2,r272822/2:/Meta2/Table/51/1/-{465…-299…}] range lookup of meta key '���ɖ��&~��' found only non-matching ranges:
cockroach.stderr.39:F170809 04:05:44.816689 5185165 storage/replica_command.go:1326  [n21,s20,r272822/1:/Meta2/Table/51/1/-{465…-299…}] range lookup of meta key '���ɖ��&~��' found only non-matching ranges:

The raw printing of the meta key makes debugging a bit difficult. I'm going to try and reproduce again with #17540 in place.

petermattis on 9 Aug 2017

Translating that key reveals it to be /Meta2/Table/51/1/-3920777067433132351/0. That key is within the span of r272822 (the negative numbers in the span of r272822 make determining this slightly confusing).

petermattis on 9 Aug 2017

cockroach debug range-data <dir> 272822 reveals (this data has been manually massaged for clarity):

/Meta2/Table/51/1/-3920798135581715988,03:40:51
    [/Table/51/1/-3920807719635987012, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:24:16
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:47
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:45
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:41
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:36
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:09:50
    [/Table/51/1/-3920917233049129889, /Table/51/1/-3920798135581715988)

/Meta2/Table/51/1/-3920754471613951699,03:40:50
    [/Table/51/1/-3920755829099915981, /Table/51/1/-3920754471613951699)
/Meta2/Table/51/1/-3920754471613951699,03:28:20
    [/Table/51/1/-3920798135581715988, /Table/51/1/-3920754471613951699)

Note that there are only 2 keys above. If we look at the 2 most recent versions of the keys we see:

/Meta2/Table/51/1/-3920798135581715988,03:40:51
    [/Table/51/1/-3920807719635987012, /Table/51/1/-3920798135581715988)

/Meta2/Table/51/1/-3920754471613951699,03:40:50
    [/Table/51/1/-3920755829099915981, /Table/51/1/-3920754471613951699)

The key we're looking for (3920777067433132351) lies in the gap between these 2 range descriptors. The versions immediately preceding these is:

/Meta2/Table/51/1/-3920798135581715988,03:24:16
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)

/Meta2/Table/51/1/-3920754471613951699,03:28:20
    [/Table/51/1/-3920798135581715988, /Table/51/1/-3920754471613951699)

Notice that the gap is gone. The end key of the first range descriptor is the start key of the second range descriptor. Still trying to piece together what is going on here. Seems like there was a split at 03:40:50. What happened to it? Did we lose a write? Are we not printing an intent in debug range-data?

petermattis on 9 Aug 2017

Here is the split:

cockroach.stderr.25:I170809 03:40:50.842124 2119452 storage/replica_command.go:2667  [n42,s40,r798869/1:/Table/51/1/-39207{981…-544…}] initiating a split of this range at key /Table/51/1/-3920755829099915981 [r936561]

The keys here exactly match the range-data output, except we're missing the record for the left hand side of the split.

petermattis on 9 Aug 2017

If the theory is that we're simply losing the write (not unheard of in past anomalies), and you can "quickly" repro this issue, then you could add the following logging:

log from evalResolveIntent{,Range} whenever an intent changes (resolved, pushed, aborted).
log from evalEndTransaction when it has a splitTrigger. In particular, the list of intents, and which intents we return upwards to the intentResolver.
log from the intent resolver for the EndTransaction case. If the intent resolver ended up eagerly GC'ing the transaction without really resolving the LHS descriptor intent, that intent would subsequently go AWOL as the absent txn record would count as an aborted txn.

tbg on 9 Aug 2017

This reminds me of https://github.com/cockroachdb/cockroach/issues/9265#issuecomment-246665791. That led to #9399 which proposed introducing additional assertions to MVCCResolveWriteIntent, although in the end we decided not to do that.

What are the bounds of r272822? Doesn't the fact that we have meta2 ranges on a range other than r1 mean that meta2 has split?

bdarnell on 9 Aug 2017

The bounds of r272822 are /Meta2/Table/51/1/-4655368821960641081 and /Meta2/Table/51/1/-2990335433313644993. I think meta2 did split (I calculate we can store 550k range descriptors per meta2 range), but the meta key in question is right in the middle of this range, not near the bounds.

petermattis on 9 Aug 2017

Seems like an assertion in MVCCResolveWriteIntent isn't feasible, but we could at least add a warning.

petermattis on 9 Aug 2017

sky encountered another of these failures:

F170809 20:26:59.279564 767502 storage/replica_command.go:1317 [n5,s5,r236328/1:/Meta2/Table/51/1/{-88…-385…}] range lookup of meta key '/Meta2/Table/51/1/3856042551658982491' found only non-matching ranges:

The span for r236328 is [/Meta2/Table/51/1/-882441077842675592, /Meta2/Table/51/1/3856096413566835965). But in this case, the last key in r236328 is before the desired meta key:

2017-08-09T20:22:33.39724672Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024847892073890, /Table/51/1/3856042551658982491)
2017-08-09T20:22:28.886126068Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024847892073890, /Table/51/1/3856042551658982491)
2017-08-09T20:22:28.41487805Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024847892073890, /Table/51/1/3856042551658982491)
2017-08-09T20:18:57.808850961Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024233464125966, /Table/51/1/3856042551658982491)

Notice that the meta key we were looking up is exactly equal to the end key of the last range descriptor in this meta2 range. I'm going to try and track down the next meta2 range to see what it contains.

petermattis on 9 Aug 2017

And another instance:

F170809 21:13:13.176847 565412 storage/replica_command.go:1317 [n59,s59,r1/4:/M{in-eta2/Table/…}] range lookup of meta key '/Meta2/Table/51/1/1225124192177909585' found only non-matching ranges:

r1 has the bounds [/Min, /Meta2/Table/51/1/1225124729451769362) which means it does contain the desired key, but right at the end of the range:

2017-08-09T21:01:22.608522522Z,11 /Meta2/Table/51/1/1225124192177909585: [/Table/51/1/1225085640252622231, /Table/51/1/1225124192177909585)

The next meta range is r118151 which has the bounds [/Meta2/Table/51/1/1225124729451769362, /System/""). The first meta key is:

2017-08-09T21:13:13.175969315Z,0 /Meta2/Table/51/1/1225124729451769362: [/Table/51/1/1225124192177909585, /Table/51/1/1225124729451769362)

Notice that the meta key aligns with the start key of r118151, but the range descriptor extends before it. Seems like our handling of lookups in meta2 ranges is broken. We need to not only look at the range containing the desired key, but the following range as well.

Cc @nvanbenschoten whom @tschottdorf says has some understanding of this.

Note that the debugging I did in this comment indicated a different issue as the desired meta key was definitely contained within the meta range, not at the end.

petermattis on 9 Aug 2017

PS Seems easy to write a test which splits meta2 and verify that we don't properly handle lookup correctly at the end of the first meta2 range.

petermattis on 9 Aug 2017

@petermattis if I'm understanding your last crash correctly, it looks like the meta2 splits are misaligned with the userspace splits, which is why you suggest

We need to not only look at the range containing the desired key, but the following range as well.

I'd expect that we would make sure to keep the split boundaries between the different meta layers aligned, but a quick search doesn't lead me to believe that we do that. For instance, we wouldn't have this issue if r1 was bounded to [/Min, /Meta2/Table/51/1/1225124192177909585) and r118151 bounded to [/Meta2/Table/51/1/1225124192177909585, /System/""). To guarantee this, the only valid meta2 splitKeys would be RangeMetaKey(any existing userspace range start key). Luckily we don't support range merges yet, meaning that once a key is a range start key, it will always be a range start key. That simplifies things in a number of ways.

nvanbenschoten on 10 Aug 2017

@nvanbenschoten I'm not sure your proposed solution works since every userspace range start key is the end key of the previous range. If r1 was bounded by [/Min, /Meta2/Table/51/1/1225124192177909585), the range [/Table/51/1/1225085640252622231, /Table/51/1/1225124192177909585) would have been placed in the next meta range. I think this approach could work if we chose meta split keys as metaKey.Next(). This would ensure that the last valid key in a range is just before the range split point. This feels fragile to me and maintaining this invariant when we do range merging might be a headache.

It's interesting that a Scan or ReverseScan operation would not have this problem since at the DistSender level those requests can span multiple ranges. Changing DistSender range lookup to send a request to the following range is mildly problematic as we'd have to adjust the key we're looking for so that it falls within the bounds of the following range. This doesn't seem conceptually difficult, but I think you're the most familiar with the range cache code.

In the short term, I'll send a patch to disable splitting of the meta2 range as we clearly don't handle this correctly.

petermattis on 10 Aug 2017

I changed my mind. Changing meta split keys as metaKey.Next() is relatively straightforward. We'll be adding future work to range merging and I'll document that in the associated issue.

petermattis on 10 Aug 2017

I'm not sure I follow. Why wouldn't/shouldn't the range [/Table/51/1/1225085640252622231, /Table/51/1/1225124192177909585) be placed in r1 if we chose a split key of /Table/51/1/1225124192177909585?

nvanbenschoten on 10 Aug 2017

Aren't the range descriptors stored at the range's start key, not their end key? So in your example, we'd have:

range 1 [a, e):
  a -> [a, b)
  b -> [b, c)
  c -> [c, d)
range 2 [d, g):
  d -> [d, e)
  e -> [e, f)
  f -> [f, g)

nvanbenschoten on 10 Aug 2017

Aren't the range descriptors stored at the range's start key, not their end key?

They're stored twice: the range-local copy is stored under the start key, and the meta copy is stored under the end key.

bdarnell on 10 Aug 2017

Aren't the range descriptors stored at the range's start key, not their end key?

No, that's for the local range descriptor. The range addressing key is based on the end key. See https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/addressing.go#L92

petermattis on 10 Aug 2017

Huh, I never realized that. Is the only reason for that so that we can scan forwards instead of backwards when performing a range lookup?

Regardless, in that case I agree that splitting meta ranges at metaKey.Next() makes sense.

EDIT: sorry for the fat finger.

nvanbenschoten on 10 Aug 2017

Is the only reason for that so that we can scan forwards instead of backwards when performing a range lookup?

Probably. ReverseScan wasn't introduced until relatively late.

bdarnell on 10 Aug 2017

Huh, I never realized that. Is the only reason for that so that we can scan forwards instead of backwards when performing a range lookup?

Yes.

petermattis on 10 Aug 2017

Now that ReverseScan is supported in RocksDB, we can efficiently scan backwards during the range lookup requests, so I don't think there are any compelling reasons for meta range descriptors to be stored under their end key other than compatibility. How unreasonable would it be to create a migration that moved these meta descriptors from their end key to their start key? I'd expect this to relieve a lot of future headaches we'll have when splitting meta ranges and performing range addressing.

nvanbenschoten on 10 Aug 2017

Such a migration would require a complete cluster shutdown. Unless I'm missing something about how you're envisioning this.

petermattis on 10 Aug 2017

You could migrate over without a shutdown (after all, it's very similar to adding an index in SQL), but it sounds like a rather large undertaking.

tbg on 10 Aug 2017

Such a migration sounds scary. New proposal: notice when we're querying "off-the-end" of a meta range during RangeLookup and return an error containing the new key to query at (i.e. the end key of the meta range). RangeDescriptorCache would have to handle this error appropriately and retry the RangeLookup. Changing RangeLookup in this way seems straightforward. I'm not familiar enough with RangeDescriptorCache to give an opinion.

For 1.1, I agree we should disable splitting of meta2 ranges.

petermattis on 10 Aug 2017

I think we need to fix this soon. In my experimentation on sky, sizeof(meta-key)+sizeof(range-desc) is ~120 bytes. A 64MB meta range is thus able to hold 550k range desc versions. We add new meta keys whenever we split ranges, but we add a new version whenever we perform a rebalance (we actually add 2 versions, one for the replica addition and one for the deletion). In my experimentation on sky, we split meta2 when the cluster reached ~100k-150k ranges. Clearly, letting meta2 grow to a bit larger (i.e. to 512MB) isn't hugely problematic, but this is a definite scalability bottleneck right now.

petermattis on 10 Aug 2017

Such a migration sounds scary.

Agreed.

New proposal: notice when we're querying "off-the-end" of a meta range during RangeLookup and return an error containing the new key to query at (i.e. the end key of the meta range). RangeDescriptorCache would have to handle this error appropriately and retry the RangeLookup. Changing RangeLookup in this way seems straightforward. I'm not familiar enough with RangeDescriptorCache to give an opinion.

Yes, I think that's what we need. Or instead of an error, we could make it work more like a Scan, as discussed in https://github.com/cockroachdb/cockroach/issues/2266#issuecomment-236919331.

bdarnell on 12 Aug 2017

Reducing the GC TTL for the meta span is another way to stretch this a little further (cc #14990). We never do historical queries here, and while historical information is sometimes useful for debugging, we have the history in the range-local copies of the descriptors (which are not space constrained).

bdarnell on 13 Aug 2017

Note that my observation in this comment about the number of ranges we can fit in meta2 before wanting to split was perturbed by significantly by #17612. Running with #17644, splitting a table into 1m ranges created a meta2 that was 200MB., only 82MB of which was "live".

petermattis on 15 Aug 2017

The range lookup ... found only non-matching ranges assertion happened again on denim. The cluster is currently stopped and I'll investigate more later.

F170819 13:57:55.387542 5638337 storage/replica_command.go:1362  [n6,s6,r1/4:/{Min-System/}] range lookup of meta key '/Meta2/Table/51/1/-5252770944880247516' [/Meta2/Table/51/1/-5252770944880247516] found only non-matching ranges:
r384024:/Table/51/1/-5252{712008722069663-579142777540461} [(n3,s3):1, (n4,s4):2, (n1,s1):3, next=4]
r235870:/Table/51/1/-525257{9142777540461-6442878868637} [(n3,s3):1, (n2,s2):4, (n1,s1):3, next=5]
r425683:/Table/51/1/-52525{76442878868637-24196296821412} [(n3,s3):1, (n2,s2):2, (n1,s1):3, next=4]
r384055:/Table/51/1/-52525{24196296821412-12403099748452} [(n3,s3):1, (n4,s4):2, (n1,s1):3, (n2,s2):4, next=5]
r336522:/Table/51/1/-5252{512403099748452-434618074207912} [(n3,s3):1, (n2,s2):4, (n1,s1):3, next=5]
r424868:/Table/51/1/-52524{34618074207912-24104336565740} [(n3,s3):1, (n2,s2):4, (n1,s1):3, next=5]
r184020:/Table/51/1/-5252{424104336565740-349501922954315} [(n3,s3):1, (n6,s6):4, (n1,s1):3, next=5]
r425877:/Table/51/1/-5252349{501922954315-174204882030} [(n3,s3):1, (n6,s6):2, (n1,s1):3, next=4]

This occurred with the splitting of meta2 disabled.

petermattis on 19 Aug 2017

debug range-data 1 shows:

1503151073.565521987,0 /Meta2/Table/51/1/-5252764971134362292: [/Table/51/1/-5252770944880247516, /Table/51/1/-5252764971134362292)
    Raw:r287880:/Table/51/1/-52527{70944880247516-64971134362292} [(n3,s3):1, (n4,s4):2, (n1,s1):3, next=4]

1503151073.565521987,0 /Meta2/Table/51/1/-5252712008722069663: [/Table/51/1/-5252764971134362292, /Table/51/1/-5252712008722069663)
    Raw:r426696:/Table/51/1/-52527{64971134362292-12008722069663} [(n3,s3):1, (n4,s4):2, (n1,s1):3, next=4]

1503150863.786684851,0 /Meta2/Table/51/1/-5252579142777540461: [/Table/51/1/-5252712008722069663, /Table/51/1/-5252579142777540461)
    Raw:r384024:/Table/51/1/-5252{712008722069663-579142777540461} [(n3,s3):1, (n4,s4):2, (n1,s1):3, next=4]

It very much seems like we should have found the first 2 keys listed above, yet the first one the assertion is indicating was found was the 3rd key. Elsewhere in the logs, I see that there was a split that occurred just before the assertion fired and the split affected the 2 ranges that weren't found:

I170819 13:57:53.565507 12513971 storage/replica_command.go:2703  [n4,s4,r287880/2:/Table/51/1/-52527{709…-120…}] initiating a split of this range at key /Table/51/1/-5252764971134362292 [r426696]

My eyes are hurting from looking at this meta2 history. Going to take a further look at this later to see if I can piece together everything.

petermattis on 19 Aug 2017

Here is a timeline of meta records containing /Table/51/1/-5252770944880247516

@ 13:44:46.852083 (1503150286.852114859,0) r98335 splits creating r287880
  r98335:[/Table/51/1/-5252830313377815514, /Table/51/1/-5252770944880247516)
  r287880:[/Table/51/1/-5252770944880247516, /Table/51/1/-5252579142777540461)

@ 13:54:23.786674 (1503150863.786684851,0) r287880 splits creating r384024
  r287880:???
  r384024:[/Table/51/1/-5252712008722069663, /Table/51/1/-5252579142777540461)

@ 13:57:53.565507 (1503151073.565521987,0) r287880 splits creating r426696
  r287880:[/Table/51/1/-5252770944880247516, /Table/51/1/-5252764971134362292)
  r426696:[/Table/51/1/-5252764971134362292, /Table/51/1/-5252712008722069663)

@ 13:57:55.387542 the range lookup of /Meta2/Table/51/1/-5252770944880247516 fails

The above was pieced together from logs and debug range-data 1. Somewhat obvious when you think about it, for every range split there are 2 keys in meta2 at the same timestamp. For the splits at 13:44:46.852083 and 13:57:53.565507 I can find both such records. For the split at 13:54:23.786674 I can find only one.

I'm not sure why the range lookup at 13:57:55.387542 failed because at that time meta2 looks fine. Perhaps the timestamp in the request predated the split at 13:57:53.565507 committing. Regardless, there seems to be something serious going on here. Unfortunately, I was not running with #17580.

petermattis on 19 Aug 2017

I've copied all of the data off of denim. I'm going to try and reproduce on top of #17580.

petermattis on 20 Aug 2017

Btw, piecing together the state of meta2 and discovering there is a missing key is a pain in the ass. I'd like a debug meta-fsck <range-id> command which would sanity check the contents of a meta range.

petermattis on 20 Aug 2017

Perhaps at this point we should also consider a consistency queue that
periodically checks meta1/2 consistency (of the cluster as a whole).

On Sat, Aug 19, 2017 at 6:45 PM Peter Mattis notifications@github.com
wrote:

Btw, piecing together the state of meta2 and discovering there is a
missing key is a pain in the ass. I'd like a debug meta-fsck
command which would sanity check the contents of a meta range.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/16266#issuecomment-323557897,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135PB7O98VsKQlsLrdjHyFeMDJjmxSks5sZ4-dgaJpZM4NtA5o
.

>

-- Tobias

tbg on 20 Aug 2017

I wrote a debug meta-check command which loads up all of the keys in meta2, sorts them by timestamp and then walks over them doing basic sanity checks. One of the sanity checks is that when a new range ID is introduced it is introduced as part of a split and there will be another record at the same timestamp. Running this check on the saved data from denim and it points to a single occurrence of such a "dangling split":

Aug 19 13:54:23.786684851 /Meta2/Table/51/1/-5252579142777540461 -> r384024:[/Table/51/1/-5252712008722069663,/Table/51/1/-5252579142777540461)

This is exactly the split I manually discovered had a problem last night. I'm somewhat reassured by this as it eliminates a concern that we had a lot of problems with meta2 but simply weren't triggering the RangeLookup assertion. This result indicates that the corruption is relatively rare.

I ran my split test over night in a loop on denim. The test uses 384 workers to split a table into 500k ranges. No problems seen so far. I'm going to work on incorporating a meta-check pass after each run.

petermattis on 20 Aug 2017

I'm going to work on incorporating a meta-check pass after each run.

This is done and has been running most of the day. Nothing found so far.

@bdarnell, @tschottdorf I've been refreshing my memory about #9265 and #9377. The symptoms here are identical: we've lost the meta key update for the left hand side of a split. Would appreciate some of your thinking about other sanity checks to add and performing an additional check on the code in question.

petermattis on 20 Aug 2017

We solved #9265 in multiple ways: #9377 fixed the root cause (by ensuring that aborted transactions stay aborted) and #9734 changed the transaction so we avoid the unusual situation of a write before the BeginTransaction. So we're unlikely to be seeing a regression here, and it's a different issue with the same symptoms. There are three basic possibilities:

The intent for the meta record was never written, but the transaction went on to commit anyway.
The intent was written, then was removed (i.e. resolved as if the transaction failed) before the transaction commits.
The intent was written, the transaction committed, and the transaction record was GC'd without the intent being resolved. Later, the intent was resolved, and because the txn record was already gone it was deemed aborted.

The third case strikes me as the most likely, because it involves the most moving parts (the intent resolver and gc queue). Before the GC queue removes a transaction record, it verifies that all the intents contained in that transaction record have been resolved. So if a ResolveIntent request appears to succeed when it should have failed, we might try to remove the txn record too early.

The assertions in #17580 should catch this if the failing ResolveIntent request makes it down to the evaluation stage. This suggests that we may be losing the request earlier, in the intent resolver. The fact that we're saturating the sync intent resolution queue with this workload might be the problem (I don't think we've run into that before). Could we somehow be failing to wait on something when we have a mix of sync and async operations? Does changing the intent resolver task limit change how long it takes to run into this error?

bdarnell on 21 Aug 2017

The assertions in #17580 should catch this if the failing ResolveIntent request makes it down to the evaluation stage. This suggests that we may be losing the request earlier, in the intent resolver.

This hasn't been reproduced with those assertions in place yet, has it?

tamird on 21 Aug 2017

Hmmmmm:

gcKeys, info, err := RunGC(ctx, desc, snap, now, zone.GC,
    func(now hlc.Timestamp, txn *roachpb.Transaction, typ roachpb.PushTxnType) {
        pushTxn(ctx, gcq.store.DB(), now, txn, typ)
    },
    func(intents []roachpb.Intent, poison bool, wait bool) error {
        return repl.store.intentResolver.resolveIntents(ctx, intents, poison, wait)
    })

// later:


if err := resolveIntentsFn(intents, true /* wait */, false /* !poison */); err != nil {
    return nil, GCInfo{}, err
}

if err := resolveIntents(roachpb.AsIntents(txn.Intents, &txn),
    true /* wait */, false /* !poison */); err != nil {
    log.Warningf(ctx, "failed to resolve intents of aborted txn on gc: %s", err)
}

I was really hopeful here (notice how the order of (poison, wait) flips between definition and call sites) but it looks like the intent resolver really takes (wait, poison).

I'll refactor this though, to make sure we can never ever actually mix those two.

tbg on 21 Aug 2017

@petermattis was this cluster tracing to lightstep? If so, the gc queue activity might have been saved (unfortunately only if you set that up as something to track in that particular lightstep project). If we had that it'd be very helpful as we could correlate GC timings with the failure (and hopefully rule out GC this time).

tbg on 21 Aug 2017

@tschottdorf No, lightstep was not enabled. There have been only 2 occurrences of this problem recently (the other similar occurrences were due to meta2 splitting). Still trying to reproduce and will leave tests running overnight.

@tamird says:

This hasn't been reproduced with those assertions in place yet, has it?

No, it hasn't. The first occurrence was before I added the assertions. The most recent occurrence was running with a binary without the assertions.

@bdarnall says:

Does changing the intent resolver task limit change how long it takes to run into this error?

I haven't tried that. I imagine you're suggesting that we try a lower limit to see if it makes the problem reproduce quicker.

petermattis on 21 Aug 2017

@petermattis and can you roughly estimate whether GC might've been running? It shouldn't happen anytime before the GC TTL, so a default of 25 hrs. If you saw the problem before that, it suggests not focusing on the gc queue as @bdarnell suggests (and I think he's right).

Can you enable lightstep for your running tests, and make sure the gc queue is tracked?

tbg on 21 Aug 2017

BTW, this also reminds me of https://github.com/cockroachdb/cockroach/pull/16775 (if we fail the application, we shouldn't commit intents at all), but somehow it's the opposite and I also just looked at the intents handling again, and it seems ok.

tbg on 21 Aug 2017

@tschottdorf For both occurrences, the cluster had been running less than 1hr. The second occurrence, for which I still have all of the logs, the cluster had been running for 27m31s.

Lightstep is problematic for us right now because we're overloading their system. We need to set up our own collectors. Without that, we're losing spans and can't really trust what is present in Lightstep. I need to ping them again about doing so.

Are there --vmodule flags that are worth turning on? Or additional logging?

petermattis on 21 Aug 2017

it should be enough if you make this a Warningf:

log.Eventf(ctx, "done GC'ing, new score is %s", makeGCQueueScore(ctx, repl, repl.store.Clock().Now(), sysCfg))

It seems unlikely that it's the GC queue's fault though. To be even more sure, you could also make GC runs that run on a meta{1,2} range no-ops.

tbg on 21 Aug 2017

I just discovered an abort cache entry with the same timestamp as the missing meta key:

/Local/RangeID/1/r/AbortCache/"00bddd07-bf9d-45b2-a753-10e4ca7c6252": key=/Local/Range/"\xbb\x89\x80\xb7\x1ah\xa7\xef\xeb}$"/RangeDescriptor timestamp=1503150863.786684851,0 priority=1049194

The timestamp here (1503150863.786684851,0) matches up with missing LHS meta key mentioned in this comment.

petermattis on 21 Aug 2017

Isn't that abort cache entry indicative of a ResolveIntent with Status == ABORTED?

petermattis on 21 Aug 2017

I'm not near a computer, but that sounds like you're on a hot trail.

tbg on 21 Aug 2017

Conjecture that doesn't quite work out: an explicit EndTxn with commit=false sent by TxnCoordSender is the culprit. These happen without the client's direct knowledge. I can't think of a way in which that would break this, though. Even if we about the pending txn and immediately GC the entry, how would the actual commit partially apply? That requires BeginTxn. Either way, we should log those prominently. Having the intent resolver block presumably increases our chances of context cancellation in TxnCoordSender.

tbg on 21 Aug 2017

No luck on reproduction running tests over night.

@bdarnell, @tschottdorf Why do evalResolveIntent and evalResolveIntentRange write an abort cache entry for the transaction? Where is that entry read? The only read of the abort cache I can find is in checkIfTxnAborted.

petermattis on 21 Aug 2017

Perhaps related:

W170819 13:54:27.794504 11409718 storage/replica.go:2091 [n4,s4,r287880/2:/Table/51/1/-5252{7709…-5791…}] context deadline exceeded while in command queue: PushTxn [/Local/Range/"\xbb\x89\x80\xb7\x1ah\xa7\xef\xeb}$"/RangeDescriptor,/Min)

Fixing the pretty printing of RangeDescriptor keys, that looks like:

W170819 13:54:27.794504 11409718 storage/replica.go:2091 [n4,s4,r287880/2:/Table/51/1/-5252{7709…-5791…}] context deadline exceeded while in command queue: PushTxn [/Local/Range//Table/51/1/-5252770944880247516/RangeDescriptor,/Min)

This is the same transaction key as the split with the missing LHS meta key and it occurred 3 seconds after the split which corresponds to the base.NetworkTimeout used by the intent resolver.

petermattis on 21 Aug 2017

Why do evalResolveIntent and evalResolveIntentRange write an abort cache entry for the transaction? Where is that entry read? The only read of the abort cache I can find is in checkIfTxnAborted.

That's the one that matters, when a txn's intents are aborted, we don't want that txn to be able to come back and read the keys it previously wrote (for it'd read something else).

tbg on 21 Aug 2017

That's the one that matters, when a txn's intents are aborted, we don't want that txn to be able to come back and read the keys it previously wrote (for it'd read something else).

Ah, got it. So after resolving an intent with status == ABORTED, any operation by the transaction that touches the range containing that intent will get an error? So how did the transaction possibly commit successfully?

petermattis on 21 Aug 2017

It doesn't have to touch that range successfully to commit, does it?

our txn writes to rangeX
our txn writes to meta2
something bad happens and meta2-intent gets killed and meta2 poisoned
but we don't go back, we commit on rangeX

tbg on 21 Aug 2017

(clearly that also shouldn't work because we would have to find we're aborted on rangeX)

tbg on 21 Aug 2017

Right. I'm staring at the IntentResolver code right now trying to see if there is any code path which results in us aborting an intent without having successfully aborting the transaction first.

petermattis on 21 Aug 2017

Just as a sanity check, I put some logging into EndTransaction and am watching the split that happens as a result of CREATE DATABASE foo;. I'm logging the "external intents" produced by the commit.

I'm expecting there to be two external intents, corresponding to both meta2 writes (updating the existing descriptor, writing the new descriptor), and indeed that is the case (plus some more -- perhaps event logging, etc?)

W170821 10:11:31.207897 702 storage/replica_command.go:792
[n1,s1,r14/1:/{Table/19-Max}] EXTERNALINTENTS: [

{Span:/M{eta2/Table/50-in}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/M{eta2/Max-in}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/{Table/13/1/2017-08-21T14:11:31.203478Z/272807661420019712/0-Min}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/{Table/13/1/2017-08-21T14:11:31.203478Z/272807661420019712/2/1-Min}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/{Table/13/1/2017-08-21T14:11:31.203478Z/272807661420019712/3/1-Min}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/{Table/13/1/2017-08-21T14:11:31.203478Z/272807661420019712/4/1-Min}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/{Table/13/1/2017-08-21T14:11:31.203478Z/272807661420019712/5/1-Min}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}

{Span:/{Table/13/1/2017-08-21T14:11:31.203478Z/272807661420019712/6/1-Min}
Txn:{ID:cad2b450-0dea-428f-a2f9-b77904f03432 Isolation:SERIALIZABLE Key:[1 107
18 155 0 1 114 100 115 99] Epoch:0 Timestamp:1503324691.203477833,0
Priority:459063 Sequence:2 BatchIndex:0} Status:COMMITTED}]

I checked this because if we failed to return an intent as "external", we wouldn't even try to resolve it, and so concurrent reads might race with our transactions self-removal (after "successfully" resolving its intents). But it doesn't seem that we're getting off that easily. I'll still audit the code that does the splitting - perhaps it doesn't manifest all the time.

We could also consider an assertion that if there is a split trigger, we're returning two intents on meta2 to the intent resolver.

tbg on 21 Aug 2017

We could also consider an assertion that if there is a split trigger, we're returning two intents on meta2 to the intent resolver.

Yeah, I'll see about adding that to my testing binary.

petermattis on 21 Aug 2017

Apologies if you answered this before, but you didn't find a transaction entry for the split, right? If there were some problem with the premature self-removal of the commit path, then a push could conceivably write an aborted entry (only BeginTxn checks the tsCache), and the corresponding intent resolution would poison, consistent with the abort cache entry you've found.

tbg on 21 Aug 2017

I found one transaction entry and the key is what I'd expect, but it has the wrong timestamp:

/Local/Range/Table/51/1/-5252770944880247516/Transaction/"91483bdf-92c4-45f1-a8b6-cca9f803a6b5": id=91483bdf key=/Local/Range/Table/51/1/-5252770944880247516/RangeDescriptor rw=false pri=0.03844001 iso=SERIALIZABLE stat=ABORTED epo=0 ts=1503151075.631198693,0 orig=1503151075.631198693,0 max=0.000000000,0 wto=false rop=false seq=9

The timestamp here corresponds to 13:57:55. This is exactly when the assertion fired, but well after the fouled up split transaction. And notice that the transaction ID doesn't correspond to the abort cache entry I found.

petermattis on 21 Aug 2017

Another piece of code I think might be problematic, though I don't have a specific scenario in mind: The eager self-removal of the txn entry after resolving its intents makes a special GC key span, but really touches the txn record. When we declare the keys for GC, we just use the supplied span. This means that the command queue doesn't adequately serialize concurrent ops that touch the txn entry. You could, for instance, start evaluating a push while the GC is in flight, or start a GC while the push is underway. Not sure how that would matter.

var gcArgs roachpb.GCRequest
{
    key := keys.MustAddr(txn.Key)
    if localMax := keys.MustAddr(keys.LocalMax); key.Less(localMax) {
        key = localMax
    }
    endKey := key.Next()

    gcArgs.Span = roachpb.Span{
        Key:    key.AsRawKey(),
        EndKey: endKey.AsRawKey(),
    }
}

tbg on 21 Aug 2017

For the record, the last occurrence of this problem was on cce46a31bf67b5fd3fe1d7c47efa171fb915409e. I'm somewhat surprised that it hasn't reproduced in my testing this weekend. I'm running a binary that is close to current master with a mix of patches such as #17580, #17766 and my debug meta-check change. Can't see how those would have fixed the problem, though.

petermattis on 21 Aug 2017

Could you link to the actual branch you're running?

tbg on 21 Aug 2017

For posterity: I looked through mvccResolveWriteIntent as well. What's mildly concerning is that deleting the txn entry is essentially the "catch-all" branch in that code. For example, if we had an intent at epoch 10, and an intent resolution (abort or commit) came in with epoch 9, I think we would actually nuke the intent.

I think that might actually happen in practice: txn writes at epoch 9, txn gets aborted on its home range (which is elsewhere), txn catches an unrelated restart and writes the intent at epoch 10. Then the abort comes in and nukes it. However, that transaction shouldn't be able to commit, so there isn't an obvious problem here.

tbg on 21 Aug 2017

@tschottdorf https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/denim?expand=1

Though I'm not running with the most recent commit there (yet).

petermattis on 21 Aug 2017

It shouldn't happen anytime before the GC TTL, so a default of 25 hrs

Prior to #9374, the intent was written before the txn record. Such an intent is eligible for GC immediately. Shouldn't happen any more, though.

Conjecture that doesn't quite work out: an explicit EndTxn with commit=false sent by TxnCoordSender is the culprit. These happen without the client's direct knowledge. I can't think of a way in which that would break this, though.

Synthesizing a roachpb.Transaction with a status of ABORTED or a TransactionAbortedError without persisting the aborted transaction record can break things in this way (because a client can take that as permission to resolve the intent as aborted). I intercepted a change in one recent code review that could have done that; maybe another one snuck in.

The eager self-removal of the txn entry after resolving its intents makes a special GC key span, but really touches the txn record. When we declare the keys for GC, we just use the supplied span. This means that the command queue doesn't adequately serialize concurrent ops that touch the txn entry.

This code is tricksy, but it's supposed to handle exactly this case, by constructing a non-empty span that covers the transaction record. If it failed to do so in unit tests, we'd know because of the SpanSet assertions. It might be worth enabling those for this testing.

bdarnell on 21 Aug 2017

This code is tricksy, but it's supposed to handle exactly this case, by constructing a non-empty span that covers the transaction record. If it failed to do so in unit tests, we'd know because of the SpanSet assertions. It might be worth enabling those for this testing.

I've enabled the SpanSet assertions on my pmattis/denim branch and I've enabled COCKROACH_ROCKSDB_WAL_TTL=24h. If we do see another occurrence we'll be able to reconstruct precisely what happened at the RocksDB level.

petermattis on 21 Aug 2017

Such an intent is eligible for GC immediately. Shouldn't happen any more, though.

Only when the GC queue runs, though. I assume it didn't, but it would be good to know for sure.

Synthesizing a roachpb.Transaction with a status of ABORTED or a TransactionAbortedError without persisting the aborted transaction record can break things in this way

Yeah, I also double checked that. evalEndTransaction seems to do the right thing, though.
Looking at evalPushTxn though, and having a WTF moment:

    // If already committed or aborted, return success.
    if reply.PusheeTxn.Status != roachpb.PENDING {
        // Trivial noop.
        return EvalResult{}, nil
    }

Won't that mean that if we find a committed pushee, and our push is trying to abort, it will return success to the caller, which would then... proceed to nuke intents, racing with the committer trying to commit them? This code is before we check anything about the persisted txn, really.

tbg on 21 Aug 2017

Won't that mean that if we find a committed pushee, and our push is trying to abort, it will return success to the caller, which would then... proceed to nuke intents, racing with the committer trying to commit them?

The PushTxn will return success, and a Transaction proto with Status==COMMITTED. The caller will use that status to do the right thing.

bdarnell on 21 Aug 2017

😄1 👍1

We have a reoccurrence (found via debug meta-check):

invalid split: missing LHS meta key
  Aug 22 01:26:57.354887271 /Meta2/Table/51/1/-3115641400805916584 -> r445488:[/Table/51/1/-3115670186029628967,/Table/51/1/-3115641400805916584)

And in the logs we have:

W170822 01:27:01.456051 5303470 storage/engine/mvcc.go:1857  [n6,s6,r1/4:/{Min-System/}] unable to find value for /Meta2/Table/51/1/-3115670186029628967 @ 1503365217.354887271,0
W170822 01:27:01.649816 5303603 storage/engine/mvcc.go:1857  [n6,s6,r1/4:/{Min-System/}] unable to find value for /Meta2/Table/51/1/-3115670186029628967 @ 1503365217.354887271,0

This log message corresponds to one of the "assertions" added in #17580:

    // For cases where there's no write intent to resolve, or one exists
    // which we can't resolve, this is a noop.
    if !ok {
        if intent.Status == roachpb.COMMITTED {
            log.Warningf(ctx, "unable to find value for %s @ %s",
                intent.Key, intent.Txn.Timestamp)
        }
        return nil
    }

The missing LHS meta key corresponds to this split:

I170822 01:26:57.354871 4116836 storage/replica_command.go:2703  [n5,s5,r303124/2:/Table/51/1/-31156{704…-414…}] initiating a split of this range at key /Table/51/1/-3115670186029628967 [r445488]

The data directory is is now backed up to cockroach.tar on denim 1-6. The logs are in logs/cockroach.stderr on the same machine. I'll take a further look at this tomorrow (unless someone else wants to figure it out tonight).

petermattis on 22 Aug 2017

The most recent occurrence was on a binary that reenabled the span-set assertions and asserts that split txns contain 2 meta2 intents. Neither of those assertions fired.

What did fire (as a warning) is the check that committing an intent finds either an intent or a key at the desired timestamp. Notice the timestamps of when that warning occurred. The split txn had a timestamp of 01:26:57, but the warnings occurred at 01:27:01.46 and 01:27:01.65. Why were there 2 attempts to commit the intent? Need to spend some quality time with the logs.

petermattis on 22 Aug 2017

Some more logs related to the intent resolution of the missing LHS meta key:

W170822 01:27:01.330235 5303315 storage/replica.go:2648  [n6,s6,r1/4:/{Min-System/}] context cancellation after 0.0s of attempting command ResolveIntent [/Meta2/Table/51/1/-3115670186029628967,/Min)
W170822 01:27:01.456051 5303470 storage/engine/mvcc.go:1857  [n6,s6,r1/4:/{Min-System/}] unable to find value for /Meta2/Table/51/1/-3115670186029628967 @ 1503365217.354887271,0
W170822 01:27:01.649816 5303603 storage/engine/mvcc.go:1857  [n6,s6,r1/4:/{Min-System/}] unable to find value for /Meta2/Table/51/1/-3115670186029628967 @ 1503365217.354887271,0
W170822 01:27:01.703917 5303603 storage/replica.go:2648  [n6,s6,r1/4:/{Min-System/}] context cancellation after 0.1s of attempting command ResolveIntent [/Meta2/Table/51/1/-3115670186029628967,/Min), ResolveIntent [/Meta2/Table/51/1/-3115487473449930676,/Min)

The logs also contain a whole bunch of messages about throttling intent resolution:

I170822 01:26:57.009090 4115469 util/stop/stopper.go:296  stopper throttling task from storage.intentResolve: resolving intents due to semaphore

petermattis on 22 Aug 2017

From the WAL on n6 I can see the batch which added the meta2 intents for the split:

PUT(0) : /Meta2/Table/51/1/-3115641400805916584/1503365217.354887271,0 PUT(0) : /Meta2/Table/51/1/-3115641400805916584 PUT(0) : /Meta2/Table/51/1/-3115670186029628967/1503365217.354887271,0 PUT(0) : /Meta2/Table/51/1/-3115670186029628967 PUT(0) : /Local/RangeID/1/r/RaftAppliedIndex PUT(0) : /Local/RangeID/1/r/LeaseAppliedIndex PUT(0) : /Local/RangeID/1/r/RangeStats

Notice we wrote both the MVCC metadata key and the version key at the desired timestamp.

Later in the WAL:

DELETE(0) : /Meta2/Table/51/1/-3115670186029628967/1503365217.354887271,0 DELETE(0) : /Meta2/Table/51/1/-3115670186029628967 PUT(0) : /Local/RangeID/1/r/AbortCache/"4c98937f-1988-44ad-af6a-6ab4f4e3a86d" PUT(0) : /Local/RangeID/1/r/RaftAppliedIndex PUT(0) : /Local/RangeID/1/r/LeaseAppliedIndex PUT(0) : /Local/RangeID/1/r/RangeStats

Here we deleted the LHS meta key for the split and added an abort-cache entry for the transaction. So the split txn definitely wrote the intents for the meta2 keys and then something triggered the deletion of the LHS key meta2 intent.

PS The rocksdb dump_wal command doesn't allow me to easily format values. I'm going to work on plumbing that in as well which will be needed to view the evolution of the transaction record for this split.

petermattis on 22 Aug 2017

Plumbing formatting of values into rocksdb dump_wal will take a bit of time, so instead I manually tracked down the transaction record changes for 4c98937f-1988-44ad-af6a-6ab4f4e3a86d and decoded the values (which were hex strings):

"split" id=4c98937f key=/Local/Range/Table/51/1/-3115670436260695454/RangeDescriptor rw=true pri=0.05985145 iso=SERIALIZABLE stat=PENDING epo=0 ts=1503365217.354887271,0 orig=1503365217.354887271,0 max=1503365217.354887271,0 wto=false rop=false seq=2
"split" id=4c98937f key=/Local/Range/Table/51/1/-3115670436260695454/RangeDescriptor rw=true pri=0.05985145 iso=SERIALIZABLE stat=PENDING epo=0 ts=1503365217.354887271,0 orig=1503365217.354887271,0 max=1503365217.354887271,0 wto=false rop=false seq=2
"split" id=4c98937f key=/Local/Range/Table/51/1/-3115670436260695454/RangeDescriptor rw=true pri=0.05985145 iso=SERIALIZABLE stat=ABORTED epo=0 ts=1503365217.354887271,0 orig=1503365217.354887271,0 max=1503365217.354887271,0 wto=false rop=false seq=2
"split" id=4c98937f key=/Local/Range/Table/51/1/-3115670436260695454/RangeDescriptor rw=true pri=0.05985145 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1503365217.354887271,0 orig=1503365217.354887271,0 max=1503365217.354887271,0 wto=false rop=false seq=2 int=8
DELETED

PENDING, PENDING, ABORTED, COMMITTED! That's not good.

We don't have timestamps associated with the WAL entries, but I can determine that the number of log entries between the ABORT and COMMIT is 298. This is relatively small which means that those log entries occurred temporally close together.

petermattis on 22 Aug 2017

The ABORT does not have any intents attached (no int=X suffix, cf. with the commit which has int=8). If the client itself sent the abort, there would be intents, at least in any code path through TxnCoordSender (including (tc *TxnCoordSender).tryAsyncAbort). I think that effectively rules out the case in which the aborted entry was written by EndTransaction. Instead, it seems more likely that this is a Push succeeding when it shouldn't.

tbg on 22 Aug 2017

(or rather, the following illegal interleaving:

Push evaluates
Commit evaluates
Push applies
Commit applies
)

tbg on 22 Aug 2017

What commit and push declare seems fine (just double checking), in line with the fact that we didn't trigger the assertion. I wonder if we have a bug in the command queue when the context gets cancelled. I'm thinking something like

push enters cmdq
push gets proposed
push's ctx expires
push prematurely leaves command queue
commit enters cmdq
push applies
commit evaluates, applies, and leaves cmdq

Will take a look.

tbg on 22 Aug 2017

Oh, good idea. I suspect #9448 might have caused this.

bdarnell on 22 Aug 2017

Sorry, s/command queue/proposal/. If we leave the command queue, we don't propose, do we? I'll spelunk a bit.

tbg on 22 Aug 2017

If we're canceled while in the command queue, we won't propose, but if we propose and are subsequently canceled, what happens? Prior to #9448, we'd spawn a goroutine to wait indefinitely. Now, it looks like we might remove the command from the command queue prematurely, although I'm still re-reading that diff to see what's going on.

bdarnell on 22 Aug 2017

The Raft applied index that is present in the batch containing the ABORT is 25. The Raft applied index that is present in the batch containing the COMMIT is 26.

petermattis on 22 Aug 2017

one place that looks concerning though I don't think it's at fault here is this:

    sort.Sort(reproposals)
    for _, p := range reproposals {
        log.Eventf(p.ctx, "re-submitting command %x to Raft: %s", p.idKey, reason)
        if err := r.submitProposalLocked(p); err != nil {
            delete(r.mu.proposals, p.idKey)
            p.finishRaftApplication(proposalResult{Err: roachpb.NewError(err), ProposalRetry: proposalErrorReproposing})
        }
    }

This is in the reproposals logic. When we repropose a command, it's because we think it might actually apply. If we catch an error here, we'll release the command from the cmdQ prematurely. I've however checked the submit path, and essentially raftGroup.Propose would have to return an error, which it doesn't unless there's a real problem.

tbg on 22 Aug 2017

@bdarnell what you're talking about is in beginCmds though, and we "give up" if our context does. I just checked with @nvanbenschoten and if the push makes it past beginCmd to propose, the Commit doesn't do anything shady with its dependency's context. Only a premature removal of the push from the command queue would get us into trouble. That seems like the command queue isn't at fault here, at least not directly, or at least not when our model only contains those two operations.

tbg on 22 Aug 2017

32148379 - /Local/RangeID/303124/u/RaftLog/logIndex:25 (this is the ABORT)
32149654 - /Local/RangeID/303124/u/RaftLog/logIndex:26 (this is the COMMIT)
32149785 - apply Raft log index 25
32151441 - apply Raft log index 26

The 321* numbers are the WAL sequence numbers. Unless I made a mistake, we evaluated the COMMIT before the ABORT applied.

petermattis on 22 Aug 2017

Another potential culprit:

        if !r.store.TestingKnobs().DisableRefreshReasonSnapshotApplied &&
            refreshReason == noReason {
            refreshReason = reasonSnapshotApplied
        }

reasonSnapshotApplied is special-cased in refreshProposalsLocked: when we see this reason, and the current LeaseAppliedIndex is higher than what our proposal can apply with, we realize that we can't be sure whether it actually already applied. Consequently we return an ambiguous result error.

However, refreshReason as set above is overwritten when leadership changes.

Again, don't think that's what happened here, as a leadership change usually comes with its own applied Raft command (which we don't seem to see here), plus the new lease holder wouldn't apply commands from the old lease. Perhaps life finds a way to still mess things up, but I doubt it happened here.

tbg on 22 Aug 2017

A comment on the earlier log messages:

W170822 01:27:01.330235 5303315 storage/replica.go:2648  [n6,s6,r1/4:/{Min-System/}] context cancellation after 0.0s of attempting command ResolveIntent [/Meta2/Table/51/1/-3115670186029628967,/Min)
W170822 01:27:01.456051 5303470 storage/engine/mvcc.go:1857  [n6,s6,r1/4:/{Min-System/}] unable to find value for /Meta2/Table/51/1/-3115670186029628967 @ 1503365217.354887271,0
W170822 01:27:01.649816 5303603 storage/engine/mvcc.go:1857  [n6,s6,r1/4:/{Min-System/}] unable to find value for /Meta2/Table/51/1/-3115670186029628967 @ 1503365217.354887271,0
W170822 01:27:01.703917 5303603 storage/replica.go:2648  [n6,s6,r1/4:/{Min-System/}] context cancellation after 0.1s of attempting command ResolveIntent [/Meta2/Table/51/1/-3115670186029628967,/Min), ResolveIntent [/Meta2/Table/51/1/-3115487473449930676,/Min)

These are the intent resolutions dispatched by COMMIT. The context timeouts are interesting, too - it looks like we had an expired context when the proposal finished. Interestingly, we made it past this part below for EndTransaction. I think thus that EndTransaction's context expired while the proposal was being applied (so we set ctxDone == nil, but had to stay in the loop). The proposal came back, and we call intent resolution which fires off the intents but then immediately stops listening, printing the error messages.

Unfortunately that doesn't strike me as relevant immediately, but it's possible that this context cancellation is important here (in that it could potentially allow the "Push" to sneak in early, but I don't see how).

case propResult := <-ch:
    if len(propResult.Intents) > 0 {
                        // THIS WAS CALLED
        r.store.intentResolver.processIntentsAsync(r, propResult.Intents)
    }
    return propResult.Reply, propResult.Err, propResult.ProposalRetry
case <-slowTimer.C:
    log.Warningf(ctx, "have been waiting %s for proposing command %s",
        base.SlowRequestThreshold, ba)
    r.store.metrics.SlowRaftRequests.Inc(1)
    defer r.store.metrics.SlowRaftRequests.Dec(1)

case <-ctxDone:
    // If our context was cancelled, return an AmbiguousResultError
    // if the command isn't already being executed and using our
    // context, in which case we expect it to finish soon. The
    // AmbiguousResultError indicates to caller that the command may
    // have executed.
    if tryAbandon() {
                        // THIS WAS NOT CALLED, PRESUMABLY
        log.Warningf(ctx, "context cancellation after %0.1fs of attempting command %s",
            timeutil.Since(startTime).Seconds(), ba)
        return nil, roachpb.NewError(roachpb.NewAmbiguousResultError(ctx.Err().Error())), proposalNoRetry
    }
                // THIS MIGHT HAVE BEEN CALLED
    ctxDone = nil

tbg on 22 Aug 2017

tryAbandon no longer tries to cancel the application of the command. See this comment:

    // Must not use `proposal` in the closure below as a proposal which is not
    // present in r.mu.proposals is no longer protected by the mutex. Abandoning
    // a command only abandons the associated context. As soon as we propose a
    // command to Raft, ownership passes to the "below Raft" machinery. In
    // particular, endCmds will be invoked when the command is applied. There are
    // a handful of cases where the command may not be applied (or even
    // processed): the process crashes or the local replica is removed from the
    // range.

petermattis on 22 Aug 2017

Sorry, I didn't mean to imply that tryAbandon was doing anything. My point was that tryAbandon returned false (in which case it never did anything, current or previous code), which means that the command was in the process of being applied at the time of the call. Just an additional datapoint.

tbg on 22 Aug 2017

I've been withholding logs from you:

I170822 01:26:57.354871 4116836 storage/replica_command.go:2703 [n5,s5,r303124/2:/Table/51/1/-31156{704…-414…}] initiating a split of this range at key /Table/51/1/-3115670186029628967 [r445488]
W170822 01:27:01.163409 4121351 storage/replica.go:2091 [n5,s5,r303124/2:/Table/51/1/-31156{704…-414…}] context deadline exceeded while in command queue: PushTxn [/Local/Range/Table/51/1/-3115670436260695454/RangeDescriptor,/Min)

petermattis on 22 Aug 2017

Is there a leaseholder change going on? The split seems to be initially orchestrated on n5/s5, but the ResolveIntent ctx cancellation message is happening on n6,s6 (which indicates that it was proposed there, which in turn indicates that the split was proposed there). Is this perhaps a lease-change problem?

tbg on 22 Aug 2017

(you could dissect the ReplicatedEvalResults in the raft log further. I'd be interested in seeing them in their full glory).

tbg on 22 Aug 2017

I170822 01:23:44.219061 266 storage/replica_proposal.go:453 [n5,s5,r303124/2:/Table/51/1/-3115{6704…-5297…}] new range lease repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0 following repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0
I170822 01:27:01.335976 239 storage/replica_proposal.go:453 [n5,s5,r303124/2:/Table/51/1/-31156{704…-414…}] new range lease repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0 following repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0

Those are the lease acquisition messages for that range. Note that the ResolveIntent cancellation would happen on a different store because those are for the meta2 keys which reside on r1.

petermattis on 22 Aug 2017

Ah, duh, thanks. The second message was triggered by our split, right? The timestamp matches up perfectly.

tbg on 22 Aug 2017

The split happened at 01:26:57.354871, the PushTxn was cancelled while in the command queue at 01:27:01.163409, and the lease was (re-)acquired at 01:27:01.335976.

petermattis on 22 Aug 2017

01:26:57.354871 is when the split was started. The commit definitely happened before 01:27:01.456051 (assuming clocks weren't shady). Do you have the corresponding lease message for the RHS of this split? That should be the precise timestamp the split applied (well, or as as close as we can get).

Something about that lease acquisition message at 01:27:01.335976 strikes me as odd. Why are we reacquiring an epoch-based lease when we already held the lease? I thought we didn't need to renew those. What am I missing?

tbg on 22 Aug 2017

There is no new range lease message for r445488. Doesn't the RHS of a split inherit the lease from the parent? This is the first message I can find for r445488 (other than initiating the split):

I170822 01:27:01.336215 239 storage/allocator_scorer.go:589 [replicate,n5,s5,r445488/2:/Table/51/1/-31156{701…-414…}] s5: should-rebalance(ranges-overfull): rangeCount=235345, mean=223949.17, overfull-threshold=235147

petermattis on 22 Aug 2017

We call splitPostApply, which in turn calls rightRng.leasePostApply:

    rightRng.mu.Lock()
    // Copy the minLeaseProposedTS from the LHS.
    rightRng.mu.minLeaseProposedTS = r.mu.minLeaseProposedTS
    rightLease := *rightRng.mu.state.Lease
    rightRng.mu.Unlock()
    r.mu.Unlock()
...
    // Invoke the leasePostApply method to ensure we properly initialize
    // the replica according to whether it holds the lease. This enables
    // the PushTxnQueue.
    rightRng.leasePostApply(ctx, rightLease)

which in turn runs

func (r *Replica) leasePostApply(ctx context.Context, newLease roachpb.Lease) {
    r.mu.Lock()
    replicaID := r.mu.replicaID
    prevLease := *r.mu.state.Lease
    r.mu.Unlock()

    iAmTheLeaseHolder := newLease.Replica.ReplicaID == replicaID
    leaseChangingHands := prevLease.Replica.StoreID != newLease.Replica.StoreID

    if iAmTheLeaseHolder {
        // Always log lease acquisition for epoch-based leases which are
        // infrequent.
        if newLease.Type() == roachpb.LeaseEpoch || (log.V(1) && leaseChangingHands) {
            log.Infof(ctx, "new range lease %s following %s", newLease, prevLease)
        }
    }

Assuming that the RHS has the lease, I would expect it to log. Maybe we're messing up something with the lease on splits? That might also (but not quite) explain why we're seeing a lease message for the original range where the lease follows itself. Something to check.

tbg on 22 Aug 2017

(the likely problem here would be that we do indeed initialize the lease to disk, but perhaps somehow it doesn't make it to rightRng's in-mem state at that point in the code). Probably I'm just confused, but better not to leave stones unturned.

tbg on 22 Aug 2017

Do you still want the ReplicatedEvalResult? For which log entries? The PushTxn(ABORT) and EndTransaction(COMMIT)?

petermattis on 22 Aug 2017

Yep. Assume they're isn't much else around, right? I'll take them for the intent resolutions too. But I'm just fishing.

tbg on 23 Aug 2017

Btw, I've been running a modified TestIncrement with artificially introduced context cancellations. No dice so far, unfortunately. Would also be surprised if it just broke like that.

tbg on 23 Aug 2017

PushTxn(ABORT): logindex:25.txt
EndTransaction(COMMIT): logindex:26.txt

Still tracking down the intent resolution Raft log entries.

petermattis on 23 Aug 2017

OK, so one incarnation of PushTxn timed out while it was in the command queue. There must have been at least one more attempt which went on to complete. Do we have any instances of context cancellation after XX of attempting command PushTxn in the logs?

tryAbandon() looks broken to me. It returns true unless the command is in the process of executing (in which case calling endCmds.done() would race with the completion of the command). Its caller, however, is treating a true return from tryAbandon as meaning "This command will never execute, remove it from the command queue".

I think tryAbandon needs to be much more conservative. In fact, I'm not sure if it can ever safely return true (or conversely, its return value can never be used in the way we're using it now).

bdarnell on 24 Aug 2017

[nevermind, very silly. Disregard!]

tbg on 24 Aug 2017

@bdarnell when we calltryAbandon() we've already proposed to raft and set endCmds = nil, so this code path looks fine to me. endCmds is only called by the raft goroutine (after applying the cmd) or by refreshProposalsLocked, both times through finishRaftSomethingSomething. I've looked at those a bunch, but nothing has jumped out at me.

tbg on 24 Aug 2017

I see. Yeah, that's not it.

bdarnell on 24 Aug 2017

I'm planning on adding some detailed logging around proposals and the contents of the command queue and try for another reproduction. Haven't done so yet, but hope to get this done and running today. Are there other possibilities for how we could have concurrent PushTxn and EndTransaction commands inflight? Perhaps something during lease transfers. There was a lease renewal around the time of the incident, but it wasn't a transfer.

petermattis on 24 Aug 2017

The lease was proposed at 8/21/2017, 13:01:44, well before the action happened. I don't see any indication of a lease transfer, and the Epoch for the renewed lease didn't increment. When can that even happen? And why don't we see the lease renewal message for the RHS when I expect it (https://github.com/cockroachdb/cockroach/issues/16266#issuecomment-324143229)?

Either way, I'd be surprised if this weren't about the command queue and premature cancellation somehow. I'll whip up some logging I think I'd find useful.

tbg on 24 Aug 2017

This is what I'm running with: https://github.com/cockroachdb/cockroach/commit/f8762ceed1b6416c8f236bf27c4f976de6cd6a6e

Easy enough to add more.

petermattis on 24 Aug 2017

Perhaps a log line in finishRaftApplication itself would be helpful to make sure we haven't missed one (and to roughly see when the command queue removal happens). Other than that, this looks good.

tbg on 24 Aug 2017

Ok. Looks like it is going to create ~500MB logs for each run.

petermattis on 24 Aug 2017

Update: tests are running on denim again. We've gone through 6 test runs so far. Log files are actually ~400MB per run. Hopefully we'll have a repro in the next day or two.

petermattis on 24 Aug 2017

These lease messages (going back to https://github.com/cockroachdb/cockroach/issues/16266#issuecomment-324134520) are strange:

I170822 01:23:44.219061 266 storage/replica_proposal.go:453 [n5,s5,r303124/2:/Table/51/1/-3115{6704…-5297…}] new range lease repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0 following repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0
I170822 01:27:01.335976 239 storage/replica_proposal.go:453 [n5,s5,r303124/2:/Table/51/1/-31156{704…-414…}] new range lease repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0 following repl=(n5,s5):2 start=1503363704.443276018,0 epo=1 pro=1503363704.443278218,0

These leases are identical and proposed at 01:01:44; the lease manages to apply twice more than 20 minutes later. This is especially surprising because leases are exempt from (some) re-proposals. In between the two lease applications, the range managed to split.

Looking into the ReplicatedEvalResults from https://github.com/cockroachdb/cockroach/issues/16266#issuecomment-324190415, the timestamps are out of order. The PushTxn in log index 25 has timestamp 1503365218331180962 (01:26:58) and the EndTransaction in log index 26 has timestamp 1503365217354887271 (01:26:57). I don't think that's directly relevant (this timestamp field only appears to be used for checking that we're not trying to write behind the GCThreshold), but it shows that either clocks were offset or something caused the EndTransaction to be delayed before it was proposed, allowing the PushTxn to get in first.

bdarnell on 24 Aug 2017

but it shows that either clocks were offset or something caused the EndTransaction to be delayed before it was proposed, allowing the PushTxn to get in first.

ReplicatedEvalResult.Timestamp is copied from BatchRequest.Timestamp. As far as I can tell, for transactional requests we use BatchRequest.Txn.Timestamp. So it isn't surprising that the EndTransaction request would have an early timestamp than the PushTxn, unless I'm misunderstanding something.

petermattis on 24 Aug 2017

It's not surprising in the sense that it shouldn't cause any problems, but it shows that we have this ordering of events:

Split txn calls Commit
Something else calls PushTxn
The two commits should collide in the command queue and the later one should be forced to wait.
PushTxn is proposed, indicating that it won the race to the command queue

So Commit was called before PushTxn (modulo clock offsets), but PushTxn had priority in the command queue. This may not be unusual (the two commands could be coming from different nodes), but it could point to something funny happening (e.g. a bug in the command queue).

bdarnell on 24 Aug 2017

Why did the commit have to happen before the push? The timestamp in the batch containing the EndTransaction is from when the transaction started, not when the EndTransaction was sent, right? I could be mistaken as I'm not terribly familiar with all of this.

petermattis on 24 Aug 2017

Regardless, something funky is going on as the PushTxn and EndTransaction should have collided in the command queue no matter what order they arrived in.

petermattis on 24 Aug 2017

Ah, right. If the EndTransaction is using its transaction's timestamp then it's expected to be lower than the push's timestamp.

bdarnell on 24 Aug 2017

We have another occurrence. Some relevant logs (the full logs are here):

I170825 05:53:32.120342 15224381 storage/replica.go:2028 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 3 overlapping requests
I170825 05:53:33.204211 15232755 storage/replica.go:2028 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 EndTxn: waiting for 5 overlapping requests
I170825 05:53:33.205214 15224381 storage/replica.go:2028 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 2 overlapping requests
I170825 05:53:34.787416 15237224 storage/replica.go:2028 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 2 overlapping requests
W170825 05:53:35.018087 15224381 storage/replica.go:2111 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] context canceled while in command queue: PushTxn [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor,/Min)
I170825 05:53:35.018382 15237224 storage/replica.go:3075 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] proposing command 0ba12e830838a78e: 1 PushTxn
I170825 05:53:35.444466 15232755 storage/replica.go:3075 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] proposing command 18c3bd3d47d49b9b: 1 EndTxn
I170825 05:53:36.015515 242 storage/replica.go:4203 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] processing command 0ba12e830838a78e: maxLeaseIndex=5
I170825 05:53:36.015575 242 storage/replica_proposal.go:121 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] finish proposal 0ba12e830838a78e: 1 PushTxn
I170825 05:53:36.191612 259 storage/replica.go:4203 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] processing command 18c3bd3d47d49b9b: maxLeaseIndex=6
I170825 05:53:36.282650 259 storage/replica_proposal.go:457 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] new range lease repl=(n6,s6):1 start=1503638103.940207229,0 epo=1 pro=1503638103.940208829,0 following repl=(n6,s6):1 start=1503638103.940207229,0 epo=1 pro=1503638103.940208829,0
I170825 05:53:36.286820 259 storage/replica_proposal.go:121 [n6,s6,r447049/1:/Table/51/1/1561{1869…-2608…}] finish proposal 18c3bd3d47d49b9b: 1 EndTxn

Notice that the first PushTxn goes through beginCmds twice (goroutine ID 15224381): once at 05:53:32.120342 and then at 05:53:33.205214. That's curious because these are all of the log messages for r447049. At 05:53:35.018087, that PushTxn is cancelled while in the command queue, at which point we unblock both the other PushTxn and the EndTxn which propose concurrently and badness results. Seems like the scenario is:

PushTxn 1 enters command queue
EndTxn enters command queue
PushTxn 2 enters command queue
PushTxn 1 is cancelled while in command queue
PushTxn 2 and EndTxn are both allowed to proceed

Paging @nvanbenschoten who has touched the command queue code recently. Seems like the above should be enough to try and reproduce in a test. Also, see the full logs linked above which contain the contents of the command queue and the spans being waited on.

petermattis on 25 Aug 2017

The initial PushTxn going through beginCmds twice is very strange. I'm not quite sure what to make of that. If the PushTxn was actually inserted twice into the CommandQueue then that could lead to issues like this during cancellation, but I have no clue how that could happen.

Did you see any "waited %s for overlapping requests" logs? It doesn't look like they're in the dump you posted.

If there's nothing going on with multiple insertions of the PushTxn 1 command, then something must be happening with PushTxn 2 not seeing EndTxn as a dependency. When adding in #9448, one thing I was a little worried about was missing strange interactions it might have with #14342. Adding in #17436 made me feel better about that though, and I don't think it's coming into play here because PushTxn and EndTxn are both writes so #14342 wouldn't even apply.

nvanbenschoten on 25 Aug 2017

The waited %s for overlapping requests is a log.Eventf so it didn't get sent to the log.

petermattis on 25 Aug 2017

@nvanbenschoten The synchronization is subtle between trimming the already waited for cmd.prereqs during context cancellation and looking at cmd.cancelled when cmd.pending is closed. I haven't found anything wrong with it, just observing.

petermattis on 25 Aug 2017

I think I may see an issue, although I haven't figure out exactly how it could cause this. When we set cmd.cancelled = true, I think we need to set it to true for all 4 cmds in newCmds.

nvanbenschoten on 25 Aug 2017

Something like:

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index 9534e4cc9..6957af958 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -2098,13 +2098,12 @@ func (r *Replica) beginCmds(
                                                        newCmd.prereqs[j] = nil
                                                }
                                                newCmd.prereqs = newCmd.prereqs[i:]
-                                               newCmd.cancelled = true

                                                // Remove the command from the command queue immediately. Dependents will
                                                // transfer transitive dependencies when they try to block on this command.
                                                // New commands that would have established a dependency on this command
                                                // will never see it, which is fine.
-                                               r.removeCmdsFromCommandQueue(newCmds)
+                                               r.cancelCmdsInCommandQueue(newCmds)
                                                return nil, err
                                        case <-r.store.stopper.ShouldQuiesce():
                                                // While shutting down, commands may have been added to the
@@ -2148,6 +2147,15 @@ func (r *Replica) beginCmds(
        return ec, nil
 }

+func (r *Replica) cancelCmdsInCommandQueue(cmds batchCmdSet) {
+       for _, accessCmds := range cmds {
+               for scope, cmd := range accessCmds {
+                       cmd.cancelled = true
+               }
+       }
+       r.removeCmdsFromCommandQueue(cmds)
+}
+

Without this, dependents will only transfer transitive dependencies of cancelled commands for the access/scope combination that was blocking in beginCmds when the context was cancelled.

EDIT: This means that transitive dependencies for any access/scope combinations that were not already looked at when the context was cancelled would have been lost.

nvanbenschoten on 25 Aug 2017

Oh yeah, that would do it. There is a lot of logic related to cancellation in beginCmds that is probably not being tested adequately.

petermattis on 25 Aug 2017

Well, I still don't see exactly how that would do it, because the second PushTxn should still have been waiting on the EndTxn.

EDIT: unless there were other cancellations as well.

nvanbenschoten on 25 Aug 2017

Yeah, I think more testing is needed around local keys. The problem was that the change, and specifically these tests, were added before propEvalKV, and clearly I missed some of the subtleties of multiple *cmds per BatchRequest when rebasing and migrating the change over.

nvanbenschoten on 25 Aug 2017

I'll get a fix out for this issue today.

nvanbenschoten on 25 Aug 2017

Well, I still don't see exactly how that would do it, because the second PushTxn should still have been waiting on the EndTxn.

Ah, good point.

EDIT: unless there were other cancellations as well.

I believe there was a HeartbeatTxn occurring as well. Notice the 7 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc") entry in the command queue when the first PushTxn arrived. My branch is only doing extra logging for PushTxn and EndTransaction(split-trigger) requests.

petermattis on 25 Aug 2017

Here's what I think happened:

PushTxn 1 enters the command queue waiting for Unknown
EndTxn enters the command queue waiting for PushTxn 1 and Unknown
PushTxn 1 re-enters the command queue waiting for EndTxn
HeartbeatTxn enters the command queue waiting for PushTxn 1
PushTxn 2 enters the command queue waiting for HeartbatTxn 1
PushTxn 1 cancels
HeartbeatTxn cancels
PushTxn 2 proceeds

I'm not precisely sure on the ordering and cancellation here, but not that HeartbeatTxn has a read on the txn abort-cache key while PushTxn and EndTxn have a write on that key. All 3 operations have a write on the transaction key. I bet we can find some ordering of cancellations that tickles the cmd.cancelled bug.

petermattis on 25 Aug 2017

So how about this as a possible scenario:

PushTxn 1 enters command queue

push

EndTxn enters command queue

push
  |
end-txn

Heartbeat enters command queue

push
  |
end-txn
  |
heartbeat

PushTxn 2 enters command queue

push
  |---------
end-txn    |
  |        |
heartbeat  |
  |        |
push -------

Heartbeat is cancelled

push
  |---------
end-txn    |
           |
push -------

PushTxn 1 is cancelled while in command queue

end-txn    |
           |
push -------

PushTxn 2 and EndTxn are both allowed to proceed

nvanbenschoten on 25 Aug 2017

Heh, our messages crossed. Your diagramming skills are much better than mine. I think it is time to try and write a test the tickles this condition. I'm offline for a few hours starting now. Maybe this will be magically fixed when I get back.

petermattis on 25 Aug 2017

The command queue never ceases to amaze. Glad we (seem to have) figured that one out.

tbg on 25 Aug 2017

@nvanbenschoten When "PushTxn 2 enters command queue", how did it end up with an edge pointing to PushTxn1 but not EndTxn? I haven't been able to figure out a way that PushTxn2 would end up depending on something that EndTxn1 depended on instead of depending on EndTxn1 itself.

We have one more puzzle piece to play with: A regular write that is a part of a transaction reads that transaction's abort cache key. This may give us a way to construct the graph we need since EndTxn would have to wait for a regular read (because this is a split) but Heartbeat wouldn't. But I still haven't been able to construct the problematic scenario yet.

I'm going to go back to the question of PushTxn1 going through beginCmds twice, since we don't have a satisfactory answer for that and I could see all kinds of insanity resulting if the same command went through this cycle twice.

bdarnell on 28 Aug 2017

@bdarnell I was wondering about both of those issues and don't have a good answer. My guess was that a pair of ResolveIntent requests allowed it to end up with an edge around the EndTxn while still creating dependencies on the AbortCache key, but I wasn't sure why the EndTxn wouldn't also touch the same intent.

Now I think I stumbled onto a bigger issue. I don't think the transfer of command dependencies was working correctly with the coveringOptimization. It looks like we're only setting cancelled on the parent cmd, not the children. This means that in cases where commands had dependencies on expanded child cmds, they would never see the cancelled flag at all once the child's pending chan was closed. A test I'm playing with now is showing that without fixing this, we could cause this issue with only an EndTxn and two PushTxns.

nvanbenschoten on 28 Aug 2017

The covering optimization is disabled in the local command queue (and everything we're looking at in this issue is in the local command queue). However, that might be the key to a separate issue: #17815.

bdarnell on 28 Aug 2017

Ah you're right, the covering optimization is disabled in the local command queue. However, I don't actually think this issue requires us to use the covering optimization, I think it just requires that we treat a parent *cmd as a handle to all child cmds, which we do regardless. The coveringOptimization flag determines whether we initially insert this *cmd into the interval tree or whether we initially insert all the children, but it won't affect how beginCmds interacts with batches that create multiple cmds.

nvanbenschoten on 28 Aug 2017

Here's what I see poring through the logs:

I170825 05:53:32.120342 15224381 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 3 overlapping requests
  local(read-write):
    [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
    [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
command-queue(global):
command-queue(local):
  7 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  9 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  13 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  10 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  12 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  8 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

The first PushTxn is assigned ids 12 and 13 in the local command queue. There are already two operations ahead of it: IDs 7 and 8 are from a write to the (new RHS) range descriptor, and 9 and 10 look like a heartbeat for this transaction. (In "waiting for 3 requests", the heartbeat counts double because its reads and writes are counted separately)

I170825 05:53:33.204211 15232755 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 EndTxn: waiting for 5 overlapping requests
[snip]
command-queue(local):
  14 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  13 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  25 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  27 [/Local/RangeID/483671/r"",/Local/RangeID/483671/s"")
  28 [/Local/RangeID/483671/u"",/Local/RangeID/483671/v"")
  29 [/Local/RangeID/483671/u/RangeLastReplicaGCTimestamp)
  26 [/Local/Range/Table/51/1/1561186976538960094,/Local/Range/???)
  23 [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  12 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  15 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  22 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  24 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

By the time EndTxn comes around, we see a second heartbeat, with ids 14 and 15, then the EndTxn's writes get ids 22-29 (the reads are missing here but show up later). All of the things that the push was waiting on have cleared the command queue; it should be able to proceed now.

This is waiting on the push and the second heartbeat (2x); I think to get to 5 commands the first heartbeat had to finish in between the prereq computation and this log message.

I170825 05:53:33.205214 15224381 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 2 overlapping requests
[snip]
command-queue(local):
  18 readonly [/Local/RangeID/447049/r"",/Local/RangeID/447049/s"")
  20 readonly [/Local/RangeID/447049/r/AbortCache/"00000000-0000-0000-0000-000000000000",/Local/RangeID/447049/r/AbortCache/"ffffffff-ffff-ffff-ffff-ffffffffffff")
  14 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  19 readonly [/Local/RangeID/447049/u/RangeLastReplicaGCTimestamp)
  17 readonly [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  25 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  32 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  27 [/Local/RangeID/483671/r"",/Local/RangeID/483671/s"")
  28 [/Local/RangeID/483671/u"",/Local/RangeID/483671/v"")
  29 [/Local/RangeID/483671/u/RangeLastReplicaGCTimestamp)
  26 [/Local/Range/Table/51/1/1561186976538960094,/Local/Range/???)
  23 [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  15 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  22 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  31 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  24 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

And now we see the same goroutine entering the command queue for a PushTxn again (I'll call this second incarnation of the first push Push1.2). I believe this means that after the preceding log block, the PushTxn was allowed to proceed, but hit a retryable error somewhere before it was proposed to raft. I'm not sure what that error could be, but it appears to have done the right thing in the command queue: the first incarnation of the command has been removed (so the 14-15 heartbeat should be executing now). The new push has ids 31-32. We also see IDs 17-22 showing up with the read-only portion of the EndTransaction. They weren't there before (maybe the covering optimization at work?) The 2 commands Push1.2 is waiting on are the read and write sides of the EndTransaction.

I170825 05:53:34.787416 15237224 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 2 overlapping requests
[snip]
command-queue(local):
  18 readonly [/Local/RangeID/447049/r"",/Local/RangeID/447049/s"")
  20 readonly [/Local/RangeID/447049/r/AbortCache/"00000000-0000-0000-0000-000000000000",/Local/RangeID/447049/r/AbortCache/"ffffffff-ffff-ffff-ffff-ffffffffffff")
  14 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  19 readonly [/Local/RangeID/447049/u/RangeLastReplicaGCTimestamp)
  17 readonly [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  25 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  32 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  35 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  27 [/Local/RangeID/483671/r"",/Local/RangeID/483671/s"")
  28 [/Local/RangeID/483671/u"",/Local/RangeID/483671/v"")
  29 [/Local/RangeID/483671/u/RangeLastReplicaGCTimestamp)
  26 [/Local/Range/Table/51/1/1561186976538960094,/Local/Range/???)
  23 [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  15 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  22 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  31 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  34 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  24 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

And now we get to Push2, with IDs 34-35. It's hard to tell from this which 2 commands it's waiting on. One is definitely Push1.2. But what's the other? The candidates are the second heartbeat or the EndTransaction (each of which have both read and write halves). Based on what happens next I think it has to be the heartbeat (probably the read half due to some of the timestamp-related logic in the command queue).

W170825 05:53:35.018087 15224381 storage/replica.go:2111  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] context canceled while in command queue: PushTxn [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor,/Min)
I170825 05:53:35.018382 15237224 storage/replica.go:3075  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] proposing command 0ba12e830838a78e: 1 PushTxn
I170825 05:53:35.444466 15232755 storage/replica.go:3075  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] proposing command 18c3bd3d47d49b9b: 1 EndTxn

And now Push1.2 is canceled, and both EndTxn and Push2 are allowed to proceed (we can infer from this that Heartbeat2 also finished).

So the full sequence of commands is put, heartbeat, push, heartbeat, commit, push, push. And I think (given the coveringOptimization hypothesis), that can be reduced to heartbeat (or anything that stalls the commit), commit, push, push.

bdarnell on 28 Aug 2017

It looks like we're only setting cancelled on the parent cmd, not the children.

It's worse than that: we're only setting prereqs on the parent cmd, not the children. And when the covering optimization is disabled, the prereqs themselves are always children, so prereqs don't transfer more than one hop. So this definitely explains it.

We need to better encapsulate the cmd struct. Right now there are two distinct subtypes of cmd (parents and leaves) as used in command_queue.go, but replica.go is oblivious to this. Conversely, cmd carries the canceled and prereqs field which are only touched in replica.go.

This makes me nervous about our test coverage here. Should we fix this or consider reverting #9448 until post 1.1?

bdarnell on 28 Aug 2017

we're only setting prereqs on the parent cmd, not the children

Exactly, so right now we aren't properly treating cancellation of BatchRequests that create more than one *cmd for any access/scope combination. I have this fixed and am working on improving the test coverage, which was lacking because we were only testing with DeleteRangeRequests. I don't think we'll need to revert #9448, but your judgement on this is better than mine.

nvanbenschoten on 28 Aug 2017

And when the covering optimization is disabled, the prereqs themselves are always children, so prereqs don't transfer more than one hop

That's not quite accurate. The key is that this will affect all BatchRequests that contain more than one span within a certain access/scope combination.

nvanbenschoten on 28 Aug 2017

That's not quite accurate. The key is that this will affect all BatchRequests that contain more than one span within a certain access/scope combination.

Right, but this is true of all PushTxn and EndTxn requests so all usage patterns except single KV ops can be affected by this.

bdarnell on 28 Aug 2017

Now that it looks like we fixed the consistency issue here, the rest of this issue can be moved to 1.2, right?

nvanbenschoten on 29 Aug 2017

Yeah. Or even better, retitle this to the consistency issue that took it over, close it, and make a new issue to lift the restriction in 1.2.

bdarnell on 30 Aug 2017

Fixed by #17939.

Meta2 splitting issue moved to #18032.

nvanbenschoten on 30 Aug 2017

Cockroach: storage: transitive prerequisites are not always transferred on command cancellation

Most helpful comment

All 150 comments

>

Related issues