Cockroach: roachtest: kv/contention/nodes=4 failed

Created on 24 Mar 2019  路  56Comments  路  Source: cockroachdb/cockroach

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog

The test failed on release-19.1:
    ts_util.go:89,kv.go:186,cluster.go:1605,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1626,kv.go:189,test.go:1214: Goexit() was called

C-test-failure O-roachtest O-robot S-3

Most helpful comment

I think I finally have a hold on what's going on here. The TL;DR is that we're dealing with a bug in MVCC that's being tickled by a bug in DistSender that's being tickled by a small change in behavior due to parallel commits. Fixing the bug in DistSender is straightforward but insufficient, as I still don't think we can actually enforce a property that MVCC is relying on for correctness - so we'll want to fix both issues.

What @jordanlewis and I were able to see repeatedly is that a transaction performed a Scan on the primary key of a table during an UPSERT, and then issued the following three requests in a batch:

  • Put to update the primary key
  • Delete to delete the old secondary index key
  • CPut to create the new secondary index key

The CPut expected an existing key-value of nil for the secondary index key because it observed a different secondary index value in the primary key. Instead, it observed a non-nil value in the secondary index key. This resulted in a ConditionFailedError and eventually a "unique constraint violation". The fact that the CPut observed a non-nil key in the secondary index is surprising. We weren't seeing the transaction refresh, so the read portion of the CPut was reading at the same time as the initial Scan. So why weren't the primary and secondary indexes lining up? Were we not seeing a consistent snapshot when reading at the same timestamp?

Well... actually no we weren't. I added logging around ScanRequest evaluation and ConditionalPutRequest evaluation and was able to see that for the transaction that hit the issue, the result from the ScanRequest's MVCC scan did not line up with the result of the ConditionalPutRequest scan. The results also didn't have the same timestamp, which we would expect if both had been written by the same transaction. To back this up, I could see the intents for the corresponding values being committed by different transactions. During this time I also noticed that every transaction that hit this issue was in its second epoch. In other words, it had restarted once. This will come up later.

Finally, a big clue jumped out. The result of the ScanRequest had a higher timestamp than the timestamp that the transaction was reading at. This was bizarre. A transaction should only ever be able to see a value above its read timestamp if that value is its own intent. But this wasn't the case. Not only could I see the value being resolved as part of another transaction, but the value also had a higher timestamp than our transaction's own write timestamp. Additionally, the transaction hadn't even written any intents in the current epoch, and intents from previous epochs should be ignored. Digging into MVCCScan led to the first bug:

Bug 1

When performing an MVCC scan, a transaction should read values from the perspective of its read timestamp. The only case where this is not true is when a transaction observes its own intent, at which point it can read the intent regardless of its timestamp. This allows a transaction to continue reading its own writes even as its write timestamp drifts away from its read timestamp. This logic is all implemented in mvccScanner::getAndAdvance. There are a few interesting cases to handle in that method:

  1. transactions ignore other transactions' intents at higher timestamps: see here.
  2. transactions handle other transactions' intents at lower timestamps by kicking off conflict resolution: see here.
  3. transactions can read their own intents at their current epoch as if they were committed (i.e. read-your-writes): see here.
  4. transactions throw errors if they see intents from future epochs: see here.
  5. transactions ignore intents from previous epochs and scan to the MVCC value directly after the intent: see here.

Anyone see a problem there?
...

It turns out that the last condition is making a subtle assumption that a transaction at epoch E will never write an intent at a timestamp higher than the read timestamp at epoch E+1. If this is ever untrue then "scanning to the MVCC value directly after the intent" is incorrect because that committed value might still have too large of a timestamp. If this is the case then the function will return a committed value with a timestamp above the read timestamp of the reader.

Through some dumb luck, it seems like this will very rarely cause issues for us right now due to a combination of only supporting a single intent-per-key and always starting new transaction epochs with a read timestamp equal to the best-known write timestamp of the previous epoch. Intent pushes allow the timestamp of an intent at epoch E to go above the read timestamp of epoch E+1, but luckily the single intent-per-key restriction protects us there because no new MVCC values will be able to slip in between the old and new intent timestamp.

Still, I think there are cases where this is broken and could cause issues. For instance, a Put could be replayed from an earlier epoch after another transaction came in and committed a value. The Put would be pushed above the committed value, resulting in a violation of the assumption and allowing the transaction at a higher epoch to come along, ignore the Put from the earlier epoch, and end up reading the committed value even if it is at a higher timestamp than its read timestamp.

The solution here is to replace this seek with a seek to

min(PrevTimestamp(ToDBTimestamp(meta_.timestamp()), timestamp_)

If the transaction doesn't want to see anything above timestamp_ and is deliberately ignoring intents from old epochs, we shouldn't seek to versions above timestamp_. The same change will be needed here.

With that change, the issue goes away. But what caused this? That question leads to the second bug.

Bug 2

It still seemed strange to me that the transaction was being restarted with a read timestamp below its previous write timestamp. I added instrumentation in evaluateBatch and could see requests sent by the previous epoch of the transaction using the write timestamp of the problematic intent. These write timestamps were being incremented because requests were running into committed values at higher timestamps and hitting WriteTooOldErrors. Remember that a transaction will continue to write even if it hits a WriteTooOld error. Instead of returning the error immediately, it will swallow it and set the WriteTooOld flag on its proto instead.

So I could see writes at the higher timestamp. However, instrumentation inside of PrepareTransactionForRetry showed that the transaction proto being used to create the new proto at a higher epoch after a TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD) error had a lower write timestamp than I was seeing in evaluateBatch. So the information about writes at the higher timestamp was being lost somewhere.

I tracked this down to DistSender's divideAndSendBatchToRanges method. The method's purpose is to fan requests out to ranges and then combine their responses. It should be the one combining the information about write timestamps that may have been incremented during the evaluation of requests on different ranges. This is supposed to happen around here. In this case, one of the requests was returning a TransactionRetryError while the others were simply returning transaction protos with updated write timestamps. Combining all of this information should be handled here, but there's a subtle bug in this code. We stop updating the BatchReponse's transaction proto when collecting responses if we see an error. So if the TransactionRetryError returns before the other responses, the updates from the other responses will be lost.

The fix here is to stop losing information when collecting responses. The easiest way to do that is with a change like this (though the actual fix will look somewhat different):

diff --git a/pkg/kv/dist_sender.go b/pkg/kv/dist_sender.go
index 548e7ec557..83f0ae10ac 100644
--- a/pkg/kv/dist_sender.go
+++ b/pkg/kv/dist_sender.go
@@ -1161,6 +1161,8 @@ func (ds *DistSender) divideAndSendBatchToRanges(
                                if err := br.Combine(resp.reply, resp.positions); err != nil {
                                        pErr = roachpb.NewError(err)
                                }
+                       } else {
+                               pErr.UpdateTxn(resp.reply.Txn)
                        }
                }

This also fixes the issue, even without the MVCC fix.

Why did this start happening?

So what caused all of this? These two bugs have been lingering for a while, but something changed recently to cause them to combine in such a disastrous way.

The answer to that is parallel commits, but not by any fault of its own. Parallel commits allows an EndTransactionRequest to end up in a partial batch with one of the writes in the transaction's first epoch. Where before the writes in the first epoch would be issued before the EndTransaction and would all hit WriteTooOld errors and all swallow them before being combined at the client, now one of the writes hits a WriteTooOld error that is immediately converted to a TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD) during the EndTransaction evaluation before being returning to the client. This allowed the second bug to lose information about the increased write timestamps by the other partial batches issued concurrently with the batch with the EndTransaction request. This, in turn, led to a transaction restart where the new epoch's read timestamp was below the timestamp of a write from the previous epoch. This allowed the first bug to return the incorrect value during the UPSERTs Scan which eventually led to the inconsistent snapshot and in turn the duplicate key value error we've been seeing.

All 56 comments

spent 5.263158% of time below target of ... 5.000000%. We're close enough to the target that this is probably just a low-probability flake. I'll leave this as is and see if we fail over the next few days. If not, I'll bump the target threshold slightly, as it was chosen arbitrarily.

SHA: https://github.com/cockroachdb/cockroach/commits/a1e6e9decc9dec15a32bbb6d30efc67ca45a532a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204585&tab=buildLog

The test failed on release-19.1:
    ts_util.go:89,kv.go:186,cluster.go:1605,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1626,kv.go:189,test.go:1216: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/83de585d331b05a4aa02a65b353bed6bf829b696

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1247383&tab=buildLog

The test failed on master:
    ts_util.go:89,kv.go:186,cluster.go:1667,errgroup.go:57: spent 14.035088% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1688,kv.go:189,test.go:1237: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/ec4728ae986b46d4f57009233b86971198b275ed

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255121&tab=buildLog

The test failed on master:
    ts_util.go:89,kv.go:186,cluster.go:1667,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1688,kv.go:189,test.go:1237: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
    ts_util.go:89,kv.go:186,cluster.go:1761,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1782,kv.go:189,test.go:1245: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 10.344828% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1835,kv.go:189,test.go:1253: Goexit() was called

The increased flakiness looks somewhat correlated with https://github.com/cockroachdb/cockroach/pull/36748. I wonder if there are cases where we were relying on liveness to expire for transactions to be unstuck. I'm going to run this with a huge txn liveness expiration and see if things get even worse.

That did the trick. I can clearly see transactions waiting on other transactions for 30+ seconds after bumping TxnLivenessHeartbeatMultiplier to 30. I think it's likely that a transaction is failing to roll back its transaction record.

I see the transactions that are failing to roll back their transaction records hitting errors like
TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): "sql txn" id=c4215f04 key=/Table/53/1/-7656810924711545548/0 rw=true pri=0.01158742 stat=PENDING epo=0 ts=1556562081.872413329,1 orig=1556562051.258792452,0 max=1556562051.260084221,0 wto=true seq=13

So actually that doesn't say very much.

On the /debug/requests page I saw pushers getting stuck in the contentionQueue and repeatedly hitting TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) errors every 100ms (after waiting for cycle detection to be ready). This makes sense because contentionQueue.add doesn't return an error so transactions can fall into loops whether they hit an intent, try to push, fail to push due to an error, and repeat. I added a few lines to plumb the error back up through the IntentResolver and things immediately got a lot better. I no longer see any long stalls in the workload even with the 30s TxnLivenessThreshold.

I'm going to make the following changes:

  1. propagate errors from the contentionQueue instead of swallowing them
  2. tie TxnLivenessHeartbeatMultiplier to an environment variable and set it to 60s in this roachtest. We should never need to wait on a transaction's liveness to expire when all nodes are live, so the low timeout was only serving to hide bugs like this

Another option is to have a cluster setting/env var that disables transaction expiration entirely so that a failure to roll a transaction back causes everyone else to get stuck. This benefit of this is that it would be very loud.

A compromise would be to have a knob that allows us to set TxnLivenessThreshold to some high value and another to instruct PushTxn to fatal if we ever stumble upon an expired PENDING transaction. The nice part about this is that it would be equally as loud in roachtests as the first option but wouldn't cause the tests to hang forever. @andreimatei any opinion on this?

SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1835,kv.go:189,test.go:1253: Goexit() was called

A compromise would be to have a knob that allows us to set TxnLivenessThreshold to some high value and another to instruct PushTxn to fatal if we ever stumble upon an expired PENDING transaction. The nice part about this is that it would be equally as loud in roachtests as the first option but wouldn't cause the tests to hang forever. @andreimatei any opinion on this?

I like it.

SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1835,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/ff969dd6cb0e087327f0b210f728f588b2f5aeb0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1270738&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 22.807018% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1835,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/856ba9108f112f85d406bbe88d2208651859336e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1274175&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 15.789474% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/23155799e92e54915ae66259d06a630e981afbeb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1277061&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/8abb47a1c9795c1463183bc44e776b054bece682

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279683&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 8.771930% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/048bdc163fcb470d4e749fcad482cf2671c29fb1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281656&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/45e15e05abff25e099ca59f4c5cb40a6cf695e6d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1285294&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/c25518b4e9a723d8de0dba30a95ce0ade7963aed

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1286188&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/ba5c092a726134b73e789c2047f7ec151be7c1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288263&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 6.896552% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/7b2651400b2003d0a381cba9dbfc0b7bc0dfee00

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1293898&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 8.771930% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/58c567a325056033b326cb9c4ed9ba490e8956da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1296592&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/9671342fead0509bec0913bae4ae1f244660788e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1298500&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
    cluster.go:1474,kv.go:179,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1298500-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        I190521 10:37:49.444190 1 workload/workload.go:562  starting 4 splits
        Error: ALTER TABLE kv SPLIT AT VALUES (-3689348814741909504): pq: splits would be immediately discarded by merge queue; disable the merge queue first by running 'SET CLUSTER SETTING kv.range_merge.queue_enabled = false'
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.231.83.24_2019-05-21T10:37:48Z: exit status 1
        : exit status 1
    cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/7009f8750d5c3af32d5c43011869048ea7a311ae

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1300930&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:89,kv.go:186,cluster.go:1848,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1869,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/c9301cf71ea69da451fe5e5ba2c3074a4fe53831

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1303699&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1303699-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
            7.9  17179.9  31138.5  66572.0 write
          10m57s        0          163.1           94.5      6.0  16643.0  18253.6  32212.3 write
          10m58s        0            0.0           94.4      0.0      0.0      0.0      0.0 write
          10m59s        0            0.0           94.2      0.0      0.0      0.0      0.0 write
           11m0s        0            0.0           94.1      0.0      0.0      0.0      0.0 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
           11m1s        0           84.0           94.1      7.1   4831.8  22548.6  94489.3 write
        Error: ERROR: TransactionStatusError: programming error: found COMMITTED record for prevented implicit commit: id=add05534 key=/Table/53/1/-3313874327987903458/0 rw=true pri=0.13424172 stat=COMMITTED epo=1 ts=1558615331.690889202,1 orig=1558615326.639046519,2 max=0.000000000,0 wto=false seq=13 int=14 (REASON_UNKNOWN) (SQLSTATE XXUUU)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.237.56.114_2019-05-23T12:31:09Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/630a6e9cb3771912cd138f9aa3bea1f0ca9fa7c9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1306250&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1306250-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        I190524 11:43:27.549556 1 workload/workload.go:578  starting 4 splits
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
              1s        0          396.2          604.8    113.2    570.4    805.3    906.0 write
              2s        0          534.1          569.4     79.7    805.3   1140.9   1811.9 write
              3s        0          572.9          570.6     56.6    939.5   1610.6   2415.9 write
              4s        0          534.1          561.5     32.5    973.1   1744.8   3221.2 write
        Error: ERROR: duplicate key value (v)=('\x5c') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.250.48_2019-05-24T11:43:26Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

found COMMITTED record for prevented implicit commit

Is fixed by https://github.com/cockroachdb/cockroach/pull/37784.

duplicate key value (v)=('\x5c') violates unique constraint "kv_v_idx"

I'm not sure what to make of this. I need to dig in.

duplicate key value (v)=('\x5c') violates unique constraint "kv_v_idx"

I can reproduce this fairly easily, even with parallel commits disabled, so it's not that.

I can reproduce at least as far back as 7009f8750d5c3af32d5c43011869048ea7a311ae. Did something in the workload change?

I'm able to repro this immediately by running ./workload run kv --init --secondary-index --cycle-length=10 --batch=5 against a one-node cluster.

@jordanlewis could I assign you for triage? It's not clear that this is an issue in Core and I'm out for the next week so I want to make sure I get to the issues that definitely are.

Sure.

SHA: https://github.com/cockroachdb/cockroach/commits/fc7e48295cd05f94fd2883498d96d91ad538e559

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308263&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308263-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:

             12s        0          554.0          529.5     19.9    570.4   3892.3   9126.8 write
             13s        0          510.1          528.0     21.0    486.5   2818.6  11811.2 write
             14s        0          186.0          503.6     17.8    486.5    637.5   6710.9 write
             15s        0            8.0          470.5      5.5   3087.0   3087.0   3087.0 write
             16s        0            8.0          441.6      7.1  10737.4  10737.4  10737.4 write
             17s        0            3.0          415.8   8589.9  12348.0  12348.0  12348.0 write
             18s        0           13.0          393.4      6.0   9126.8  11811.2  11811.2 write
             19s        0          167.9          381.6      9.4   5368.7   9663.7  13421.8 write
             20s        0           89.0          366.9      8.1   6710.9  13958.6  15032.4 write
        Error: ERROR: duplicate key value (v)=('\xf0') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.34.56_2019-05-25T11:59:00Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/db9c1217a6967fcac2d135cf0f24a4265dc76d77

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1310296&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1310296-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        ite
             18s        0           44.0          425.8     10.0   4563.4  11274.3  11274.3 write
             19s        0          239.9          416.0     13.1   1946.2  10200.5  13958.6 write
             20s        0          396.1          415.0     12.1   1744.8   3623.9  15032.4 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
             21s        0           68.0          398.5     12.6    637.5   1744.8   2415.9 write
             22s        0           27.0          381.6     12.1   9126.8  15032.4  15032.4 write
             23s        0           24.0          366.0     13.1   2952.8   6979.3   6979.3 write
             24s        0            0.0          350.8      0.0      0.0      0.0      0.0 write
             25s        0           65.0          339.4     11.0   4563.4  18253.6  19327.4 write
        Error: ERROR: duplicate key value (v)=('\xf8') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.243.249.205_2019-05-26T10:48:17Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

The problem here is that the workload is producing statements with multiple of the same key in a single UPSERT statement, seen here:

I190528 04:11:03.358978 632563 sql/exec_log.go:162  [n1,client=[::1]:59428,user=root] 182 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8), ($9, $10)" {$1:"1464797858451214926", $2:"'\\xd2'", $3:"-5499256992642714345", $4:"'\\xc2'", $5:"8989193896465373993", $6:"'\\x7b'", $7:"-8472975028287664194", $8:"'\\xeb'", $9:"1464797858451214926", $10:"'\\x96'"} 2.006 0 "duplicate key value (k)=(1464797858451214926) violates unique constraint \"primary\""

Note that the key 1464797858451214926 is repeated twice in the statement. I filed #37880 to investigate this behavior change in the CBO.

It's still unknown why the workload produces an error message about v and not k. But I think it's probably the same root cause.

SHA: https://github.com/cockroachdb/cockroach/commits/c280de40c2bcab93c41fe82bef8353a5ecd95ac4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1311970&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1311970-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        rite
           2m54s        0          728.9          118.9      8.9   1073.7   3892.3  60129.5 write
           2m55s        0          104.1          118.8      5.5   1040.2   2281.7  13421.8 write
           2m56s        0          287.9          119.8      7.3   2013.3   3758.1  33286.0 write
           2m57s        0            0.0          119.1      0.0      0.0      0.0      0.0 write
           2m58s        0            0.0          118.4      0.0      0.0      0.0      0.0 write
           2m59s        0           12.0          117.8      5.2   3623.9   3892.3   3892.3 write
            3m0s        0            0.0          117.2      0.0      0.0      0.0      0.0 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
            3m1s        0           84.0          117.0      6.8   6442.5   7516.2  60129.5 write
        Error: ERROR: duplicate key value (v)=('\xef') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.93.213_2019-05-28T13:08:00Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/8f42e0d9948256af8b3e1994d514314ba1718c48

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1315162&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1315162-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
         6m31s        0          108.0           83.8      7.6  12884.9  14495.5  77309.4 write
           6m32s        0            0.0           83.6      0.0      0.0      0.0      0.0 write
           6m33s        0          218.9           83.9      7.1   9663.7  16643.0  73014.4 write
           6m34s        0          273.0           84.4      7.1   3087.0  10200.5  10200.5 write
           6m35s        0            0.0           84.2      0.0      0.0      0.0      0.0 write
           6m36s        0            0.0           84.0      0.0      0.0      0.0      0.0 write
           6m37s        0            0.0           83.8      0.0      0.0      0.0      0.0 write
           6m38s        0           20.0           83.6      7.6  15032.4  20401.1  20401.1 write
           6m39s        0            0.0           83.4      0.0      0.0      0.0      0.0 write
        Error: ERROR: duplicate key value (v)=('\x49') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.108.232_2019-05-30T10:47:59Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/af1bf3b0b16ec1913cfa945cb2ce76cc45f2d66a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1318685&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,kv.go:179,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1318685-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        I190601 12:42:51.194842 1 workload/workload.go:578  starting 4 splits
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
              1s        0          448.8          646.9    104.9    486.5    738.2    872.4 write
        Error: ERROR: duplicate key value (v)=('\xdb') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.214.58_2019-06-01T12:42:49Z: exit status 1
        : exit status 1
    cluster.go:1875,kv.go:189,test.go:1251: Goexit() was called

@tbg, are you aware of this failure mode?

        Error: ERROR: TransactionStatusError: programming error: found COMMITTED record for prevented implicit commit: id=add05534 key=/Table/53/1/-3313874327987903458/0 rw=true pri=0.13424172 stat=COMMITTED epo=1 ts=1558615331.690889202,1 orig=1558615326.639046519,2 max=0.000000000,0 wto=false seq=13 int=14 (REASON_UNKNOWN) (SQLSTATE XXUUU)

The fix for the latest problem is to change kv to not include duplicate keys in the UPSERT. The problem I pasted looks more serious.

I'm actually no longer sure the duplicate key thing is the problem - that's just a problem in one of Nathan's repro cases. Still looking into this latest failure.

@jordanlewis see https://github.com/cockroachdb/cockroach/pull/37784 -- sorry should've pointed that out.

I've captured an example instance of an UPSERT that fails spuriously, and verified that it has no duplicate keys:

I190604 22:48:47.917258 2759 sql/exec_log.go:162  [n1,client=[::1]:50288,user=root] 25282 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x1a'", $3:"866343863770607923", $4:"'\\xac'", $5:"8481621241235572951", $6:"'\\x3c'", $7:"-4578397311671209003", $8:"'\\x52'"} 33021.310 0 "duplicate key value (v)=('\\xac') violates unique constraint \"kv_v_idx\""

Here's the stack trace of where the "unique constraint violation" (in quotes because in reality it is not a unique constraint) is coming from:

goroutine 2759 [running]:
runtime/debug.Stack(0xc00a44ef80, 0x4b47fd1, 0x76701e0)
        /usr/local/Cellar/go/1.12.4/libexec/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /usr/local/Cellar/go/1.12.4/libexec/src/runtime/debug/stack.go:16 +0x22
github.com/cockroachdb/cockroach/pkg/sql/row.NewUniquenessConstraintViolationError(0x76df9e0, 0xc0078bcfc0, 0xc00b43a400, 0xc0061d4f00, 0x12, 0x20, 0xc0078142a0, 0xc00a44f3a0, 0x4af9112)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/errors.go:138 +0x65e
github.com/cockroachdb/cockroach/pkg/sql/row.ConvertBatchError(0x76df9e0, 0xc0078bcfc0, 0xc00b43a400, 0xc00e6bb180, 0x766ef60, 0xc013a3bf90)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/errors.go:66 +0x1e2
github.com/cockroachdb/cockroach/pkg/sql.(*tableWriterBase).finalize(0xc0009de000, 0x76df9e0, 0xc0078bcfc0, 0xc00b43a400, 0xc00fec06c0, 0xc010208b60)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter.go:168 +0x94
github.com/cockroachdb/cockroach/pkg/sql.(*tableUpserterBase).finalize(0xc0009de000, 0x76df9e0, 0xc0078bcfc0, 0xc00fec0600, 0x0, 0x0, 0x0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter_upsert.go:168 +0x49
github.com/cockroachdb/cockroach/pkg/sql.(*upsertNode).BatchedNext(0xc011712d20, 0x76df9e0, 0xc0078bcfc0, 0xc00fec06c0, 0xc00b5cb950, 0x6aa1500, 0xc000584360, 0xc00a44f5a0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/upsert.go:347 +0x29e
github.com/cockroachdb/cockroach/pkg/sql.(*rowCountNode).startExec(0xc0086cc160, 0x76df9e0, 0xc0078bcfc0, 0xc00fec06c0, 0xc00b5cb950, 0x0, 0xc00a44f5e8)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_batch.go:173 +0xcd
github.com/cockroachdb/cockroach/pkg/sql.startExec.func2(0x6ef5a80, 0x5, 0x76e16e0, 0xc0086cc160, 0xc00a2f57a0, 0xc00a44f5f8)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:488 +0x55
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal.func1(0xc009026300, 0x6ef5a80, 0x5, 0x76e16e0, 0xc0086cc160)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:146 +0x5d
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal(0xc009026300, 0x76e16e0, 0xc0086cc160, 0x6ef5a80, 0x5)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:628 +0x2a8
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visit(0xc009026300, 0x76e16e0, 0xc0086cc160, 0x0, 0x60)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:113 +0x8d
github.com/cockroachdb/cockroach/pkg/sql.walkPlan(...)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:77
github.com/cockroachdb/cockroach/pkg/sql.startExec(0x76df9e0, 0xc0078bcfc0, 0xc00fec06c0, 0xc00b5cb950, 0x76e16e0, 0xc0086cc160, 0x1, 0x970de60)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:491 +0x286
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start(0xc010274a00, 0x76df9e0, 0xc0078bcfc0, 0x970de60, 0x69edec0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:121 +0xd6
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc010274a00, 0x76df9e0, 0xc0078bcfc0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:795 +0x52
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc0144041e0, 0x76df9e0, 0xc0078bcfc0, 0x706b7e0, 0x0, 0x0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:626 +0x1ec
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc00011de00, 0xc00919fce0, 0xc00ff045a0, 0xc00a450868, 0xc013930840, 0xc00b5cba28, 0x0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:253 +0x8fe
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc00011de00, 0x76df9e0, 0xc0078bc810, 0xc00b5cba28, 0xc00919fce0, 0xc00ff045a0, 0x76e16e0, 0xc0086cc160, 0xc013930840)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:841 +0x1fb
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc00b5cb600, 0x76df9e0, 0xc0078bc810, 0xc00b5cb950, 0x2, 0xeb7a3f0, 0xc0089e3d90, 0xc00a450a00, 0x0, 0x0)
        /Users/jordan/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:903 +0x284

Immediately preceding the failure are several writes of the same keys, but different values:

I190604 22:48:47.786118 3167 sql/exec_log.go:162  [n1,client=[::1]:50325,user=root] 25274 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x58'", $3:"866343863770607923", $4:"'\\xe3'", $5:"8481621241235572951", $6:"'\\x1e'", $7:"-4578397311671209003", $8:"'\\x36'"} 129506.128 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=44d71960 key=/Table/3840/1/-8194475532289442352/0 rw=true pri=0.13950700 stat=PENDING epo=0 ts=1559688522.860037000,6 orig=1559688521.088864000,0 max=1559688521.089188000,0 wto=true seq=13"
I190604 22:48:47.788088 3167 sql/exec_log.go:162  [n1,client=[::1]:50325,user=root] 25275 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x58'", $3:"866343863770607923", $4:"'\\xe3'", $5:"8481621241235572951", $6:"'\\x1e'", $7:"-4578397311671209003", $8:"'\\x36'"} 129508.100 0 "TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY): \"sql txn\" id=44d71960 key=/Table/3840/1/-8194475532289442352/0 rw=true pri=0.13950700 stat=PENDING epo=1 ts=1559688523.071164000,1 orig=1559688522.860037000,6 max=1559688521.089188000,0 wto=false seq=13"
I190604 22:48:47.795567 2692 sql/exec_log.go:162  [n1,client=[::1]:50271,user=root] 25276 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x26'", $3:"866343863770607923", $4:"'\\xc3'", $5:"8481621241235572951", $6:"'\\xec'", $7:"-4578397311671209003", $8:"'\\xcc'"} 161694.515 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=a8b3c642 key=/Table/3840/1/-8194475532289442352/0 rw=true pri=0.13950695 stat=PENDING epo=0 ts=1559688522.860037000,6 orig=1559688521.189769000,0 max=1559688521.190152000,0 wto=true seq=13"
I190604 22:48:47.804952 2692 sql/exec_log.go:162  [n1,client=[::1]:50271,user=root] 25277 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x26'", $3:"866343863770607923", $4:"'\\xc3'", $5:"8481621241235572951", $6:"'\\xec'", $7:"-4578397311671209003", $8:"'\\xcc'"} 161703.902 4 ""
I190604 22:48:47.885090 3431 sql/exec_log.go:162  [n1,client=[::1]:50344,user=root] 25278 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-7656810924711545548", $2:"'\\x09'", $3:"855793586072960013", $4:"'\\x09'", $5:"1017287173965809355", $6:"'\\xa1'", $7:"-6232585356493585104", $8:"'\\xaf'"} 70823.755 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=2e2849a8 key=/Table/3840/1/-7656810924711545548/0 rw=true pri=0.11333670 stat=PENDING epo=0 ts=1559688507.625218000,1 orig=1559688507.333902000,0 max=1559688507.334513000,1 wto=true seq=13"
I190604 22:48:47.906688 3299 sql/exec_log.go:162  [n1,client=[::1]:50311,user=root] 25279 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x54'", $3:"866343863770607923", $4:"'\\xd3'", $5:"8481621241235572951", $6:"'\\xd0'", $7:"-4578397311671209003", $8:"'\\xf2'"} 123737.444 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=0ce80ca2 key=/Table/3840/1/-8194475532289442352/0 rw=true pri=0.13950691 stat=PENDING epo=0 ts=1559688522.860037000,6 orig=1559688521.197242000,0 max=1559688521.197667000,0 wto=true seq=13"
I190604 22:48:47.908636 3299 sql/exec_log.go:162  [n1,client=[::1]:50311,user=root] 25280 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x54'", $3:"866343863770607923", $4:"'\\xd3'", $5:"8481621241235572951", $6:"'\\xd0'", $7:"-4578397311671209003", $8:"'\\xf2'"} 123739.403 0 "TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): \"sql txn\" id=0ce80ca2 key=/Table/3840/1/-8194475532289442352/0 rw=true pri=0.13950691 stat=PENDING epo=1 ts=1559688527.805226000,1 orig=1559688522.860037000,6 max=1559688521.197667000,0 wto=true seq=13"
I190604 22:48:47.915600 2759 sql/exec_log.go:162  [n1,client=[::1]:50288,user=root] 25281 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-8194475532289442352", $2:"'\\x1a'", $3:"866343863770607923", $4:"'\\xac'", $5:"8481621241235572951", $6:"'\\x3c'", $7:"-4578397311671209003", $8:"'\\x52'"} 33019.643 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=2036fa06 key=/Table/3840/1/-8194475532289442352/0 rw=true pri=0.13950695 stat=PENDING epo=0 ts=1559688522.860037000,6 orig=1559688521.226186000,0 max=1559688521.226484000,0 wto=true seq=13"

(separately, it seems unlikely that the workload is doing what we want here - upon encountering an error, it should either retry the same statement or do a new one - not modify the values of the one it just tried)

Still digging in more.

Here's the trace of KV operations performed for one of these UPSERT operations:

  2019-06-04 23:55:27.56581+00:00  | 00:00:00.001373 | fetched: /kv/primary/-6249445251639448483/v -> /'\x0834'                                                                                                              | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565816+00:00 | 00:00:00.001379 | fetched: /kv/primary/-2520054918142495237/v -> /'\x02'                                                                                                                | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565821+00:00 | 00:00:00.001384 | fetched: /kv/primary/7230858377498478967/v -> /'\x2130'                                                                                                               | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565824+00:00 | 00:00:00.001387 | fetched: /kv/primary/2375599334895466311/v -> /'\x2431'                                                                                                               | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565867+00:00 | 00:00:00.00143  | Put /Table/3842/1/-6249445251639448483/0 -> /TUPLE/2:2:Bytes/84                                                                                                       | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565887+00:00 | 00:00:00.00145  | Del /Table/3842/2/"\b4"/-6249445251639448483/0                                                                                                                        | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565896+00:00 | 00:00:00.001459 | CPut /Table/3842/2/"84"/-6249445251639448483/0 -> /BYTES/ (expecting does not exist)                                                                                  | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565924+00:00 | 00:00:00.001487 | Put /Table/3842/1/-2520054918142495237/0 -> /TUPLE/2:2:Bytes/02                                                                                                       | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565933+00:00 | 00:00:00.001496 | Del /Table/3842/2/"\x02"/-2520054918142495237/0                                                                                                                       | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565936+00:00 | 00:00:00.001499 | CPut /Table/3842/2/"02"/-2520054918142495237/0 -> /BYTES/ (expecting does not exist)                                                                                  | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565958+00:00 | 00:00:00.001521 | Put /Table/3842/1/7230858377498478967/0 -> /TUPLE/2:2:Bytes/10                                                                                                        | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565961+00:00 | 00:00:00.001524 | Del /Table/3842/2/"!0"/7230858377498478967/0                                                                                                                          | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565967+00:00 | 00:00:00.00153  | CPut /Table/3842/2/"10"/7230858377498478967/0 -> /BYTES/ (expecting does not exist)                                                                                   | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565977+00:00 | 00:00:00.00154  | Put /Table/3842/1/2375599334895466311/0 -> /TUPLE/2:2:Bytes/41                                                                                                        | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.56598+00:00  | 00:00:00.001543 | Del /Table/3842/2/"$1"/2375599334895466311/0                                                                                                                          | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6
  2019-06-04 23:55:27.565982+00:00 | 00:00:00.001545 | CPut /Table/3842/2/"41"/2375599334895466311/0 -> /BYTES/ (expecting does not exist)                                                                                   | [n1,client=[::1]:61224,user=root]              |          | flow                                 |    6

The "uniqueness constraint violation" we're experiencing is a result of one of the CPut operations failing because something did in fact exist.

I added logging to print out the failing request and response, and discovered that the failure is caused by CPut finding that, instead of the expected empty value in the key that it's writing to, it finds the very value that it wants to write:

Request: header:<key:"\367\017\010\212\022\r\000\001\200\305\330S\237&\323|X\210" sequence:9 > value:<raw_bytes:"\270G\264Z\003" timestamp:<> > allow_if_does_not_exist:true
Result: unexpected value: raw_bytes:"\270G\264Z\003" timestamp:<wall_time:1559693454441408000 logical:1 >

This shouldn't ever happen from SQL's perspective. At this point, the value in the secondary index for this key should be empty, if all went well, or something different, if SQL had a bug and didn't delete an old index entry after the value being indexed changed.

Note further that this request itself got retried several times. Here are all of the retries from the log:

I190605 00:10:53.016837 652041 sql/exec_log.go:162  [n1,client=[::1]:56020,user=root] 106590 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-4221407084484719214", $2:"'\\xcd'", $3:"-1252498265032407347", $4:"'\\xce'", $5:"-4190507510251946920", $6:"'\\x0d'", $7:"6722673737277716797", $8:"'\\x0f'"} 22659.224 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=b21d5eeb key=/Table/3848/1/-4221407084484719214/0 rw=true pri=0.06546453 stat=PENDING epo=0 ts=1559693452.049776000,1 orig=1559693450.528623000,0 max=1559693450.529240000,0 wto=true seq=13"
I190605 00:10:53.017908 652041 sql/exec_log.go:162  [n1,client=[::1]:56020,user=root] 106591 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-4221407084484719214", $2:"'\\xcd'", $3:"-1252498265032407347", $4:"'\\xce'", $5:"-4190507510251946920", $6:"'\\x0d'", $7:"6722673737277716797", $8:"'\\x0f'"} 22660.297 0 "TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORT_SPAN): \"sql txn\" id=b21d5eeb key=/Table/3848/1/-4221407084484719214/0 rw=true pri=0.06546453 stat=ABORTED epo=1 ts=1559693452.049776000,1 orig=1559693452.049776000,1 max=1559693450.529240000,0 wto=false seq=0"
I190605 00:10:54.329344 652041 sql/exec_log.go:162  [n1,client=[::1]:56020,user=root] 106630 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-4221407084484719214", $2:"'\\xcd'", $3:"-1252498265032407347", $4:"'\\xce'", $5:"-4190507510251946920", $6:"'\\x0d'", $7:"6722673737277716797", $8:"'\\x0f'"} 23971.746 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=6b70e947 key=/Table/3848/1/-4221407084484719214/0 rw=true pri=0.06546453 stat=PENDING epo=0 ts=1559693453.879009000,1 orig=1559693453.017784000,0 max=1559693453.018321000,0 wto=true seq=13"
I190605 00:10:54.329869 652041 sql/exec_log.go:162  [n1,client=[::1]:56020,user=root] 106631 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-4221407084484719214", $2:"'\\xcd'", $3:"-1252498265032407347", $4:"'\\xce'", $5:"-4190507510251946920", $6:"'\\x0d'", $7:"6722673737277716797", $8:"'\\x0f'"} 23972.271 0 "TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORT_SPAN): \"sql txn\" id=6b70e947 key=/Table/3848/1/-4221407084484719214/0 rw=true pri=0.11085267 stat=ABORTED epo=1 ts=1559693453.879009000,1 orig=1559693453.879009000,1 max=1559693453.018321000,0 wto=false seq=0"
I190605 00:10:54.856972 652041 sql/exec_log.go:162  [n1,client=[::1]:56020,user=root] 106765 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-4221407084484719214", $2:"'\\xcd'", $3:"-1252498265032407347", $4:"'\\xce'", $5:"-4190507510251946920", $6:"'\\x0d'", $7:"6722673737277716797", $8:"'\\x0f'"} 24499.373 0 "TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD): \"sql txn\" id=4cfeb7d5 key=/Table/3848/1/-4221407084484719214/0 rw=true pri=0.11085267 stat=PENDING epo=0 ts=1559693454.541409000,1 orig=1559693454.329810000,0 max=1559693454.330092000,0 wto=true seq=13"
I190605 00:10:54.879232 652041 sql/exec_log.go:162  [n1,client=[::1]:56020,user=root] 106773 exec "" {} "UPSERT INTO kv(k, v) VALUES ($1, $2), ($3, $4), ($5, $6), ($7, $8)" {$1:"-4221407084484719214", $2:"'\\xcd'", $3:"-1252498265032407347", $4:"'\\xce'", $5:"-4190507510251946920", $6:"'\\x0d'", $7:"6722673737277716797", $8:"'\\x0f'"} 24521.636 0 "duplicate key value (v)=('\\x0d') violates unique constraint \"kv_v_idx\""

Could this mean that a CPut from a transaction that got retried didn't get properly cleaned up from the perspective of SQL?

SHA: https://github.com/cockroachdb/cockroach/commits/8e7ef35a8e4169ec63dc5a4df963d8b31a3d5b61

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1324151&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1513,kv.go:176,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1324151-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
              7s        0          310.9          583.4     11.5   2550.1   5100.3   5905.6 write
              8s        0          560.0          580.5     14.7   1208.0   5100.3   6174.0 write
              9s        0          557.0          577.9     11.5   1208.0   4831.8   7784.6 write
             10s        0          262.1          546.3     10.5   1140.9   5368.7   7247.8 write
             11s        0          370.0          530.3     10.0   2013.3   5368.7   6979.3 write
             12s        0          539.0          531.0     13.6   1476.4   4563.4  10200.5 write
             13s        0          678.0          542.3     14.2    771.8   2281.7   8589.9 write
             14s        0          372.0          530.1     13.6    637.5   1073.7  11811.2 write
             15s        0          316.1          515.9      9.4   1140.9   2080.4   2550.1 write
        Error: ERROR: duplicate key value (v)=('\x4c') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.90.185_2019-06-05T11:00:51Z: exit status 1
        : exit status 1
    cluster.go:1872,kv.go:186,test.go:1248: Goexit() was called

I think I finally have a hold on what's going on here. The TL;DR is that we're dealing with a bug in MVCC that's being tickled by a bug in DistSender that's being tickled by a small change in behavior due to parallel commits. Fixing the bug in DistSender is straightforward but insufficient, as I still don't think we can actually enforce a property that MVCC is relying on for correctness - so we'll want to fix both issues.

What @jordanlewis and I were able to see repeatedly is that a transaction performed a Scan on the primary key of a table during an UPSERT, and then issued the following three requests in a batch:

  • Put to update the primary key
  • Delete to delete the old secondary index key
  • CPut to create the new secondary index key

The CPut expected an existing key-value of nil for the secondary index key because it observed a different secondary index value in the primary key. Instead, it observed a non-nil value in the secondary index key. This resulted in a ConditionFailedError and eventually a "unique constraint violation". The fact that the CPut observed a non-nil key in the secondary index is surprising. We weren't seeing the transaction refresh, so the read portion of the CPut was reading at the same time as the initial Scan. So why weren't the primary and secondary indexes lining up? Were we not seeing a consistent snapshot when reading at the same timestamp?

Well... actually no we weren't. I added logging around ScanRequest evaluation and ConditionalPutRequest evaluation and was able to see that for the transaction that hit the issue, the result from the ScanRequest's MVCC scan did not line up with the result of the ConditionalPutRequest scan. The results also didn't have the same timestamp, which we would expect if both had been written by the same transaction. To back this up, I could see the intents for the corresponding values being committed by different transactions. During this time I also noticed that every transaction that hit this issue was in its second epoch. In other words, it had restarted once. This will come up later.

Finally, a big clue jumped out. The result of the ScanRequest had a higher timestamp than the timestamp that the transaction was reading at. This was bizarre. A transaction should only ever be able to see a value above its read timestamp if that value is its own intent. But this wasn't the case. Not only could I see the value being resolved as part of another transaction, but the value also had a higher timestamp than our transaction's own write timestamp. Additionally, the transaction hadn't even written any intents in the current epoch, and intents from previous epochs should be ignored. Digging into MVCCScan led to the first bug:

Bug 1

When performing an MVCC scan, a transaction should read values from the perspective of its read timestamp. The only case where this is not true is when a transaction observes its own intent, at which point it can read the intent regardless of its timestamp. This allows a transaction to continue reading its own writes even as its write timestamp drifts away from its read timestamp. This logic is all implemented in mvccScanner::getAndAdvance. There are a few interesting cases to handle in that method:

  1. transactions ignore other transactions' intents at higher timestamps: see here.
  2. transactions handle other transactions' intents at lower timestamps by kicking off conflict resolution: see here.
  3. transactions can read their own intents at their current epoch as if they were committed (i.e. read-your-writes): see here.
  4. transactions throw errors if they see intents from future epochs: see here.
  5. transactions ignore intents from previous epochs and scan to the MVCC value directly after the intent: see here.

Anyone see a problem there?
...

It turns out that the last condition is making a subtle assumption that a transaction at epoch E will never write an intent at a timestamp higher than the read timestamp at epoch E+1. If this is ever untrue then "scanning to the MVCC value directly after the intent" is incorrect because that committed value might still have too large of a timestamp. If this is the case then the function will return a committed value with a timestamp above the read timestamp of the reader.

Through some dumb luck, it seems like this will very rarely cause issues for us right now due to a combination of only supporting a single intent-per-key and always starting new transaction epochs with a read timestamp equal to the best-known write timestamp of the previous epoch. Intent pushes allow the timestamp of an intent at epoch E to go above the read timestamp of epoch E+1, but luckily the single intent-per-key restriction protects us there because no new MVCC values will be able to slip in between the old and new intent timestamp.

Still, I think there are cases where this is broken and could cause issues. For instance, a Put could be replayed from an earlier epoch after another transaction came in and committed a value. The Put would be pushed above the committed value, resulting in a violation of the assumption and allowing the transaction at a higher epoch to come along, ignore the Put from the earlier epoch, and end up reading the committed value even if it is at a higher timestamp than its read timestamp.

The solution here is to replace this seek with a seek to

min(PrevTimestamp(ToDBTimestamp(meta_.timestamp()), timestamp_)

If the transaction doesn't want to see anything above timestamp_ and is deliberately ignoring intents from old epochs, we shouldn't seek to versions above timestamp_. The same change will be needed here.

With that change, the issue goes away. But what caused this? That question leads to the second bug.

Bug 2

It still seemed strange to me that the transaction was being restarted with a read timestamp below its previous write timestamp. I added instrumentation in evaluateBatch and could see requests sent by the previous epoch of the transaction using the write timestamp of the problematic intent. These write timestamps were being incremented because requests were running into committed values at higher timestamps and hitting WriteTooOldErrors. Remember that a transaction will continue to write even if it hits a WriteTooOld error. Instead of returning the error immediately, it will swallow it and set the WriteTooOld flag on its proto instead.

So I could see writes at the higher timestamp. However, instrumentation inside of PrepareTransactionForRetry showed that the transaction proto being used to create the new proto at a higher epoch after a TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD) error had a lower write timestamp than I was seeing in evaluateBatch. So the information about writes at the higher timestamp was being lost somewhere.

I tracked this down to DistSender's divideAndSendBatchToRanges method. The method's purpose is to fan requests out to ranges and then combine their responses. It should be the one combining the information about write timestamps that may have been incremented during the evaluation of requests on different ranges. This is supposed to happen around here. In this case, one of the requests was returning a TransactionRetryError while the others were simply returning transaction protos with updated write timestamps. Combining all of this information should be handled here, but there's a subtle bug in this code. We stop updating the BatchReponse's transaction proto when collecting responses if we see an error. So if the TransactionRetryError returns before the other responses, the updates from the other responses will be lost.

The fix here is to stop losing information when collecting responses. The easiest way to do that is with a change like this (though the actual fix will look somewhat different):

diff --git a/pkg/kv/dist_sender.go b/pkg/kv/dist_sender.go
index 548e7ec557..83f0ae10ac 100644
--- a/pkg/kv/dist_sender.go
+++ b/pkg/kv/dist_sender.go
@@ -1161,6 +1161,8 @@ func (ds *DistSender) divideAndSendBatchToRanges(
                                if err := br.Combine(resp.reply, resp.positions); err != nil {
                                        pErr = roachpb.NewError(err)
                                }
+                       } else {
+                               pErr.UpdateTxn(resp.reply.Txn)
                        }
                }

This also fixes the issue, even without the MVCC fix.

Why did this start happening?

So what caused all of this? These two bugs have been lingering for a while, but something changed recently to cause them to combine in such a disastrous way.

The answer to that is parallel commits, but not by any fault of its own. Parallel commits allows an EndTransactionRequest to end up in a partial batch with one of the writes in the transaction's first epoch. Where before the writes in the first epoch would be issued before the EndTransaction and would all hit WriteTooOld errors and all swallow them before being combined at the client, now one of the writes hits a WriteTooOld error that is immediately converted to a TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD) during the EndTransaction evaluation before being returning to the client. This allowed the second bug to lose information about the increased write timestamps by the other partial batches issued concurrently with the batch with the EndTransaction request. This, in turn, led to a transaction restart where the new epoch's read timestamp was below the timestamp of a write from the previous epoch. This allowed the first bug to return the incorrect value during the UPSERTs Scan which eventually led to the inconsistent snapshot and in turn the duplicate key value error we've been seeing.

Great work! And I don't mean only the sleuthing, I also mean the very digestible presentation.

SHA: https://github.com/cockroachdb/cockroach/commits/b83798cadfee6447d565688b58657843741f8a45

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1328389&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1513,kv.go:176,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1328389-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        I190607 10:20:36.009121 1 workload/workload.go:576  starting 4 splits
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
              1s        0          522.6          744.8     92.3    419.4    604.0    771.8 write
              2s        0          702.9          723.9     71.3    637.5   1040.2   1946.2 write
              3s        0          651.0          699.6     50.3    906.0   1610.6   2281.7 write
              4s        0          630.1          682.2     33.6    704.6   1409.3   2684.4 write
              5s        0          466.0          639.0     18.9    805.3   2147.5   3355.4 write
        Error: ERROR: duplicate key value (v)=('\x1d') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.237.56.114_2019-06-07T10:20:34Z: exit status 1
        : exit status 1
    cluster.go:1872,kv.go:186,test.go:1248: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/33f9a3c2c5a0d2b756a8827daa74285b1aece401

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1330334&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1513,kv.go:176,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1330334-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        ite
             20s        0           12.0          321.7      7.3   8053.1  16643.0  16643.0 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
             21s        0          215.9          316.7      8.4   3221.2  12884.9  13421.8 write
             22s        0          427.9          321.8      8.4   2952.8   3623.9  20401.1 write
             23s        0          112.0          312.6      8.9    973.1   3892.3  20401.1 write
             24s        0          128.0          305.0      9.4   1946.2   3087.0  16643.0 write
             25s        0            0.0          292.8      0.0      0.0      0.0      0.0 write
             26s        0          271.9          292.0      9.4   4563.4   5100.3  18253.6 write
             27s        0          112.0          285.3      8.9   5100.3  21474.8  24696.1 write
        Error: ERROR: duplicate key value (v)=('\x79') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.231.210.155_2019-06-08T10:31:34Z: exit status 1
        : exit status 1
    cluster.go:1872,kv.go:186,test.go:1248: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/de90b5d84c9e5d2ef562ebfa3d4a1bdc0e834e10

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1332319&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1513,kv.go:176,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1332319-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        rite
           2m54s        0            0.0          151.0      0.0      0.0      0.0      0.0 write
           2m55s        0           12.0          150.2      5.8   6442.5  42949.7  42949.7 write
           2m56s        0            0.0          149.4      0.0      0.0      0.0      0.0 write
           2m57s        0            0.0          148.5      0.0      0.0      0.0      0.0 write
           2m58s        0            0.0          147.7      0.0      0.0      0.0      0.0 write
           2m59s        0            0.0          146.9      0.0      0.0      0.0      0.0 write
            3m0s        0            0.0          146.1      0.0      0.0      0.0      0.0 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
            3m1s        0            8.0          145.3     11.0  49392.1  49392.1  49392.1 write
        Error: ERROR: duplicate key value (v)=('\xe5') violates unique constraint "kv_v_idx" (SQLSTATE 23505)
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.243.180.44_2019-06-09T10:40:28Z: exit status 1
        : exit status 1
    cluster.go:1872,kv.go:186,test.go:1248: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
    cluster.go:1511,kv.go:188,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-kv-contention-nodes-4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
             0.0 write
          52m56s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
          52m57s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
          52m58s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
          52m59s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
           53m0s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
           53m1s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
           53m2s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
           53m3s        0          116.0           37.1      6.0  32212.3  42949.7 103079.2 write
           53m4s        0            0.0           37.1      0.0      0.0      0.0      0.0 write
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.224.186.87_2019-06-28T05:47:07Z: exit status 255
        : exit status 1
    cluster.go:1870,kv.go:198,test.go:1249: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:84,kv.go:195,cluster.go:1849,errgroup.go:57: spent 26.315789% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1870,kv.go:198,test.go:1249: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
    ts_util.go:84,kv.go:195,cluster.go:1849,errgroup.go:57: spent 24.561404% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
    cluster.go:1870,kv.go:198,test.go:1249: Goexit() was called

SHA: https://github.com/cockroachdb/cockroach/commits/4878cb3e960dc26f0946e527e1836f27b3304c97

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1373334&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190704-1373334/kv/contention/nodes=4/run_1
    cluster.go:1724,kv.go:188,cluster.go:2067,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562220659-21-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
        rite
          14m56s        0            0.0           67.6      0.0      0.0      0.0      0.0 write
          14m57s        0            0.0           67.5      0.0      0.0      0.0      0.0 write
          14m58s        0            0.0           67.4      0.0      0.0      0.0      0.0 write
          14m59s        0            0.0           67.4      0.0      0.0      0.0      0.0 write
           15m0s        0            0.0           67.3      0.0      0.0      0.0      0.0 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
           15m1s        0            0.0           67.2      0.0      0.0      0.0      0.0 write
           15m2s        0            0.0           67.1      0.0      0.0      0.0      0.0 write
           15m3s        0            0.0           67.1      0.0      0.0      0.0      0.0 write
           15m4s        0          103.0           67.1      7.9  36507.2  40802.2  85899.3 write
        Error: EOF
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.229.126.184_2019-07-04T10:54:08Z: exit status 1
        : exit status 1
    cluster.go:2088,kv.go:198,test_runner.go:681: unexpected node event: 4: dead

SHA: https://github.com/cockroachdb/cockroach/commits/36b74f0373b040aa873d4c23b415d94deba88378

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1376442&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190708-1376442/kv/contention/nodes=4/run_1
    cluster.go:1724,kv.go:188,cluster.go:2067,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562566386-19-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
          56m30s        0          224.0           75.4      7.9   7516.2  13958.6  90194.3 write
          56m31s        0          160.0           75.5     10.0   2147.5   9126.8  45097.2 write
          56m32s        0           92.0           75.5      9.4   4160.7  10200.5  53687.1 write
          56m33s        0           12.0           75.5      4.7     12.1     12.6     12.6 write
          56m34s        0            0.0           75.4      0.0      0.0      0.0      0.0 write
          56m35s        0            0.0           75.4      0.0      0.0      0.0      0.0 write
          56m36s        0            0.0           75.4      0.0      0.0      0.0      0.0 write
          56m37s        0            0.0           75.4      0.0      0.0      0.0      0.0 write
          56m38s        0           24.0           75.4      8.4   9663.7  60129.5  60129.5 write
          56m39s        0          208.0           75.4      8.4   9126.8  17179.9  98784.2 write
        Error: EOF
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.167.143_2019-07-08T10:56:45Z: exit status 1
        : exit status 1
    cluster.go:2088,kv.go:198,test_runner.go:681: unexpected node event: 2: dead

Was this page helpful?
0 / 5 - 0 ratings