Cockroach: tests: possible regression on a Jepsen test (new write skews)

Created on 18 Jul 2016  ·  65Comments  ·  Source: cockroachdb/cockroach

So I am re-running the tests after a hiatus of 3 months. One test that used to pass and is now failing is the monotonic test, which runs insert into test(ts, val) select current_logical_timestamp(), max(val)+1 from test concurrently from different clients and expects both values 1) _to increase in the same order_ and 2) _to be unique_.

This used to work without errors. Now today I found multiple instances where this is not the case any more, we now have duplicate values on the val column (incremented via max):

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160718T211852#20160718T211852

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160718T205708#20160718T205708

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160718T202608#20160718T202608

C-bug C-investigation

Most helpful comment

All 65 comments

The logs reveal that in all 3 cases there was a timeout upon adding the 2nd entry in the table; and the transaction commit was not canceled properly (i.e. the db is put in a state where a write skew is allowed to happen).

Suspicious log entry found at:

http://jepsen.cockroachdb.com/cgi-bin/display.py?path=cockroachdb-monotonic%3Ablank%2F20160718T213531.000Z&merge-logs=1
on node 3 at timestamp 160718-21:36:03.525584

n3l E   160718-21:36:03.525584  internal/client/txn.go:371  failure aborting transaction: internal/client/txn.go:602: attempting to use transaction with wrong status or finalized: ABORTED; abort caused by: retry txn "sql/executor.go:464 sql txn" id=21eda7c9 key=/Table/51/1/159931968781975556/0 rw=true pri=0.02227686 iso=SERIALIZABLE stat=PENDING epo=1 ts=1468877763.508680691,1 orig=1468877763.508680691,1 max=1468877763.749513034,0 wto=false rop=false

cc @tschottdorf would welcome your comments.

Hmm, that looks serious. We certainly haven't regressed to that bug we had once where the JDBC driver wasn't properly leaving the transactional context (thus mixing multiple txns in one), right?

What do you mean by

and the transaction commit was not canceled properly (i.e. the db is put in a state where a write skew is allowed to happen).

What's the mechanism you suspect here? It looks like the abort failed because the transaction is already considered aborted. As long as the client never tries to commit it later (and it seems it doesn't) there _should_ not be an issue.

What I would do in the absence of a more concrete theory is to log what is actually being written where. You could instrument the code so that it

  • only logs for the relevant SQL transactions (peek into txn.Name)
  • prefixes all messages with txnID, OrigTimestamp, Timestamp, Seq, Epoch
  • logs on tsCache hits
  • pretty prints keys _and_ values (I think we added that recently) read and written. I think the workload only uses Scan (to read the Max) and Put.
  • prints all calls to EndTransaction (with their result) and MVCCResolveIntent (in particular who is committed, moved or aborted, and at what timestamp).

From the unified logs, we should be able to get a good idea and grep out the two txns which actually cause trouble.

It can't hurt to run this on a 3node only cluster.

I would run TestConcurrentIncrements to play with the instrumentation format (shorter feedback cycle).

There's also a code path in EndTransaction for which we return a txn with state ABORTED and start removing intents, but leave the underlying proto PENDING. I think that might be dangerous, but the symptom of that would be losing written values, and that doesn't seem to be happening here (but wanted to mention it).

Sorry I can't be of more concrete help at this point.

Making some progress, it's not looking good...

The error message "failure aborting transaction" was a red herring. In this test results:
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T152418#20160719T152418
it does not occur however there is still a duplicate value inserted.

For this test (and onwards) I added txn / statement logging, so we can also rule out that the txn is being reused for conflicting statements. What you see in the combined log:

http://jepsen.cockroachdb.com/cgi-bin/display.py?path=cockroachdb-monotonic%3Ablank%2F20160719T152418.000Z&merge-logs=1

is one transaction issued on node 1:

n1l W   160719-15:24:48.298572  sql/executor.go:1024    WOO: "1468941888.298549607,0" "BEGIN TRANSACTION" // map[]
n1l W   160719-15:24:48.298799  sql/executor.go:1024    WOO: "1468941888.298549607,0" "SELECT max(val) AS m FROM mono" // map[]
n1l W   160719-15:24:48.301099  sql/executor.go:1024    WOO: "1468941888.298549607,0" "SELECT * FROM mono WHERE val = $1" // map["1":"0"]
n1l W   160719-15:24:48.303086  sql/executor.go:1024    WOO: "1468941888.298549607,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n1l W   160719-15:24:48.321469  sql/executor.go:1024    WOO: "1468941888.298549607,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["3":"1" "4":"0" "1":"1" "2":"'14689418882985496070000000000'"]

and another on node 4, started earlier but where the SELECT that determines the new value is issued after the transaction on node 1 is committed:

n4l W   160719-15:24:47.016905  sql/executor.go:1024    WOO: "1468941887.016816079,0" "BEGIN TRANSACTION" // map[]
[ notice the timestamp for the BEGIN statement above ]
n4l W   160719-15:24:48.606502  sql/executor.go:1024    WOO: "1468941887.016816079,0" "SELECT max(val) AS m FROM mono" // map[]
n4l W   160719-15:24:48.712655  sql/executor.go:1024    WOO: "1468941887.016816079,0" "SELECT * FROM mono WHERE val = $1" // map["1":"0"]
n4l W   160719-15:24:48.714267  sql/executor.go:1024    WOO: "1468941887.016816079,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n4l W   160719-15:24:48.731438  sql/executor.go:1024    WOO: "1468941887.016816079,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["1":"1" "2":"'14689418870168160790000000000'" "3":"4" "4":"0"]

NB: the number printed after "WOO" is the txn logical timestamp which I use as txn ID.

Any clock offsets? Run with an offset of 1sec just to be safe.
The above gives us a good first idea of the failure mode, but I think you're going to have to drop down and instrument KV as I suggested above. In particular, we're going to care about updates to and queries of the timestamp cache. In the above history, the second transaction (the one that started early but finished late) should have failed when trying to insert (since the SELECT max(val) of the first txn translates into a scan of "everything which sorts after val" at its higher timestamp, which should enter the timestamp cache. This in turn should prevent the second txn's write into that key range.

One step further, this time using the following test:
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T165412#20160719T165412

We're looking at the txns that insert the value "4" into the table.

First transaction, on node 4:

n4l W   160719-16:54:41.327122  sql/executor.go:1024    WOO: "1468947281.327099130,0" "BEGIN TRANSACTION" // map[]
n4l W   160719-16:54:41.327354  sql/executor.go:1024    WOO: "1468947281.327099130,0" "SELECT max(val) AS m FROM mono" // map[]
n4l W   160719-16:54:41.327475  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947281.327099130,0 max=0.000000000,0 wto=false rop=false" seq=0 BATCH BEGIN
n4l W   160719-16:54:41.327563  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947281.327099130,0 max=0.000000000,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-16:54:41.327592  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947281.327099130,0 max=0.000000000,0 wto=false rop=false" BATCH END
n4l W   160719-16:54:41.328008  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH PROCESSED err = ""
n4l W   160719-16:54:41.328071  sql/executor.go:1076    WOO: "1468947281.327099130,0" SELECT VAL: 3
n4l W   160719-16:54:41.329021  sql/executor.go:1024    WOO: "1468947281.327099130,0" "SELECT * FROM mono WHERE val = $1" // map["1":"3"]
n4l W   160719-16:54:41.329237  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" seq=2 BATCH BEGIN
n4l W   160719-16:54:41.329301  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-16:54:41.329332  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH END
n4l W   160719-16:54:41.329557  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH PROCESSED err = ""
n4l W   160719-16:54:41.329590  sql/executor.go:1076    WOO: "1468947281.327099130,0" SELECT VAL: 3
n4l W   160719-16:54:41.330539  sql/executor.go:1024    WOO: "1468947281.327099130,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n4l W   160719-16:54:41.345932  sql/executor.go:1024    WOO: "1468947281.327099130,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["1":"4" "2":"'14689472813270991300000000000'" "3":"4" "4":"0"]
n4l W   160719-16:54:41.346106  sql/executor.go:1076    WOO: "1468947281.327099130,0" SELECT VAL: 4
n4l W   160719-16:54:41.346156  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" seq=3 BATCH BEGIN
n4l W   160719-16:54:41.346214  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" // method 8, header {/Table/51/1/160159764794638338/0 /Min}, request header:
n4l W   160719-16:54:41.346315  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" // method 2, header {/Table/51/1/160159764794638338/0 /Min}, request header: value: > blind:false
n4l W   160719-16:54:41.346399  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" // method 2, header {/Table/51/1/160159764794638338/1/1 /Min}, request header: value: > blind:false
n4l W   160719-16:54:41.346422  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Min rw=false pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH END
n4l W   160719-16:54:41.347672  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Table/51/1/160159764794638338/0 rw=true pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH PROCESSED err = ""
n4l W   160719-16:54:41.348551  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Table/51/1/160159764794638338/0 rw=true pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" seq=4 BATCH BEGIN
n4l W   160719-16:54:41.348611  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Table/51/1/160159764794638338/0 rw=true pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" // method 9, header {/Min /Min}, request header:<> commit:true
n4l W   160719-16:54:41.348665  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Table/51/1/160159764794638338/0 rw=true pri=0.08994397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH END
n4l W   160719-16:54:41.349824  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=279448e7 key=/Table/51/1/160159764794638338/0 rw=true pri=0.08994397 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1468947281.327099130,0 orig=1468947281.327099130,0 max=1468947281.577099130,0 wto=false rop=false" BATCH PROCESSED err = ""

so far, so good.

In the meantime, a txn on node 2 started earlier comes in and conflicts:

n2l W   160719-16:54:41.037807  sql/executor.go:1024    WOO: "1468947281.037784939,0" "BEGIN TRANSACTION" // map[]
[ that was earlier ]
n2l W   160719-16:54:41.820466  sql/executor.go:1024    WOO: "1468947281.037784939,0" "SELECT max(val) AS m FROM mono" // map[]
n2l W   160719-16:54:41.820575  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947281.037784939,0 max=0.000000000,0 wto=false rop=false" seq=0 BATCH BEGIN
n2l W   160719-16:54:41.820685  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947281.037784939,0 max=0.000000000,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n2l W   160719-16:54:41.820713  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947281.037784939,0 max=0.000000000,0 wto=false rop=false" BATCH END
n2l W   160719-16:54:41.822485  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" BATCH PROCESSED err = ""
n2l W   160719-16:54:41.822532  sql/executor.go:1076    WOO: "1468947281.037784939,0" SELECT VAL: 3
n2l W   160719-16:54:41.823668  sql/executor.go:1024    WOO: "1468947281.037784939,0" "SELECT * FROM mono WHERE val = $1" // map["1":"3"]
n2l W   160719-16:54:41.823763  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" seq=2 BATCH BEGIN
n2l W   160719-16:54:41.823819  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n2l W   160719-16:54:41.823852  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" BATCH END
n2l W   160719-16:54:41.824678  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" BATCH PROCESSED err = ""
n2l W   160719-16:54:41.824709  sql/executor.go:1076    WOO: "1468947281.037784939,0" SELECT VAL: 3
n2l W   160719-16:54:41.825767  sql/executor.go:1024    WOO: "1468947281.037784939,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n2l W   160719-16:54:41.841830  sql/executor.go:1024    WOO: "1468947281.037784939,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["1":"4" "2":"'14689472810377849390000000000'" "3":"2" "4":"0"]
n2l W   160719-16:54:41.841973  sql/executor.go:1076    WOO: "1468947281.037784939,0" SELECT VAL: 4
n2l W   160719-16:54:41.842036  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" seq=3 BATCH BEGIN
n2l W   160719-16:54:41.842122  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" // method 8, header {/Table/51/1/160159766419537923/0 /Min}, request header:
n2l W   160719-16:54:41.842228  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" // method 2, header {/Table/51/1/160159766419537923/0 /Min}, request header: value: > blind:false
n2l W   160719-16:54:41.842297  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" // method 2, header {/Table/51/1/160159766419537923/1/1 /Min}, request header: value: > blind:false
n2l W   160719-16:54:41.842323  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Min rw=false pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.037784939,0 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" BATCH END
n2l W   160719-16:54:41.844269  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.757952253,1 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" BATCH PROCESSED err = ""
n2l W   160719-16:54:41.845078  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.757952253,1 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" seq=4 BATCH BEGIN
n2l W   160719-16:54:41.845212  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.757952253,1 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" // method 9, header {/Min /Min}, request header:<> commit:true
n2l W   160719-16:54:41.845254  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.02744151 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947281.757952253,1 orig=1468947281.037784939,0 max=1468947281.287784939,0 wto=false rop=false" BATCH END
n2l W   160719-16:54:41.848140  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.03889604 iso=SERIALIZABLE stat=PENDING epo=1 ts=1468947281.757952253,1 orig=1468947281.757952253,1 max=1468947281.287784939,0 wto=false rop=false" BATCH PROCESSED err = "retry txn \"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.03889604 iso=SERIALIZABLE stat=PENDING epo=1 ts=1468947281.757952253,1 orig=1468947281.757952253,1 max=1468947281.287784939,0 wto=false rop=false"
n2l W   160719-16:54:41.848189  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.03889604 iso=SERIALIZABLE stat=PENDING epo=1 ts=1468947281.757952253,1 orig=1468947281.757952253,1 max=1468947281.287784939,0 wto=false rop=false" seq=4 BATCH BEGIN
n2l W   160719-16:54:41.848237  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.03889604 iso=SERIALIZABLE stat=PENDING epo=1 ts=1468947281.757952253,1 orig=1468947281.757952253,1 max=1468947281.287784939,0 wto=false rop=false" // method 9, header {/Min /Min}, request header:<> commit:false
n2l W   160719-16:54:41.848266  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.03889604 iso=SERIALIZABLE stat=PENDING epo=1 ts=1468947281.757952253,1 orig=1468947281.757952253,1 max=1468947281.287784939,0 wto=false rop=false" BATCH END
n2l W   160719-16:54:41.849831  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=6cc81695 key=/Table/51/1/160159766419537923/0 rw=true pri=0.03889604 iso=SERIALIZABLE stat=ABORTED epo=1 ts=1468947281.757952253,1 orig=1468947281.757952253,1 max=1468947281.287784939,0 wto=false rop=false" BATCH PROCESSED err = ""

in other words, conflict, abort, everything is fine.

Meanwhile, another txn on node 3, also started earlier, comes in:

n3l W   160719-16:54:40.548364  sql/executor.go:1024    WOO: "1468947280.548342801,0" "BEGIN TRANSACTION" // map[]
[ that was early ]
n3l W   160719-16:54:42.157050  sql/executor.go:1024    WOO: "1468947280.548342801,0" "SELECT max(val) AS m FROM mono" // map[]
n3l W   160719-16:54:42.157222  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947280.548342801,0 max=0.000000000,0 wto=false rop=false" seq=0 BATCH BEGIN
n3l W   160719-16:54:42.157357  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947280.548342801,0 max=0.000000000,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n3l W   160719-16:54:42.157405  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id= key=/Min rw=false pri=0.00000000 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,0 orig=1468947280.548342801,0 max=0.000000000,0 wto=false rop=false" BATCH END
n3l W   160719-16:54:42.160407  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH PROCESSED err = ""
n3l W   160719-16:54:42.160475  sql/executor.go:1076    WOO: "1468947280.548342801,0" SELECT VAL: 3
[ this value should not be visible here. We really want this to fail. ]
n3l W   160719-16:54:42.161783  sql/executor.go:1024    WOO: "1468947280.548342801,0" "SELECT * FROM mono WHERE val = $1" // map["1":"3"]
n3l W   160719-16:54:42.161935  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" seq=2 BATCH BEGIN
n3l W   160719-16:54:42.162035  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n3l W   160719-16:54:42.162082  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH END
 n3l    W   160719-16:54:42.163220  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH PROCESSED err = ""
n3l W   160719-16:54:42.163268  sql/executor.go:1076    WOO: "1468947280.548342801,0" SELECT VAL: 3
 n3l    W   160719-16:54:42.164263  sql/executor.go:1024    WOO: "1468947280.548342801,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n3l W   160719-16:54:42.180494  sql/executor.go:1024    WOO: "1468947280.548342801,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["3":"3" "4":"0" "1":"4" "2":"'14689472805483428010000000000'"]
n3l W   160719-16:54:42.180642  sql/executor.go:1076    WOO: "1468947280.548342801,0" SELECT VAL: 4
n3l W   160719-16:54:42.180678  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" seq=3 BATCH BEGIN
n3l W   160719-16:54:42.180747  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" // method 8, header {/Table/51/1/160159767529324549/0 /Min}, request header:
n3l W   160719-16:54:42.180852  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" // method 2, header {/Table/51/1/160159767529324549/0 /Min}, request header: value: > blind:false
n3l W   160719-16:54:42.180917  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" // method 2, header {/Table/51/1/160159767529324549/1/1 /Min}, request header: value: > blind:false
n3l W   160719-16:54:42.180942  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Min rw=false pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH END
n3l W   160719-16:54:42.190326  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Table/51/1/160159767529324549/0 rw=true pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH PROCESSED err = ""
n3l W   160719-16:54:42.191181  internal/client/txn.go:672  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Table/51/1/160159767529324549/0 rw=true pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" seq=4 BATCH BEGIN
n3l W   160719-16:54:42.191323  internal/client/txn.go:675  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Table/51/1/160159767529324549/0 rw=true pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" // method 9, header {/Min /Min}, request header:<> commit:true
n3l W   160719-16:54:42.191353  internal/client/txn.go:677  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Table/51/1/160159767529324549/0 rw=true pri=0.00278018 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH END
n3l W   160719-16:54:42.193055  internal/client/txn.go:683  WAA: "\"sql/executor.go:464 sql txn\" id=964834a0 key=/Table/51/1/160159767529324549/0 rw=true pri=0.00278018 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1468947280.548342801,0 orig=1468947280.548342801,0 max=1468947280.798342801,0 wto=false rop=false" BATCH PROCESSED err = ""
[ no error: boom! ]

These are exceirpts from the txn logging.
In the middle (interleaved with this) we can see a number of splits and replica traffic. Needs to be investigated further.

Maybe we messed up something when the lease changes. We also had an issue at some point where during splits, we didn't copy the read timestamp cache synchronously (but we've fixed that when readOnlyCmdMu was introduced). Why are you logging internal/client/txn.go and not from storage/replica.go (as suggested above)? This is still too high level. In the end, all that will be relevant (hopefully) are timestamp cache additions and queries. At least that's where I'd hope to see the anomaly.

One layer down, this time with:
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T184032#20160719T184032

Nodes 1 and 5 inserting duplicate value.

First txn on node 5:

n5l W   160719-18:40:56.406871  sql/executor.go:1024    WOO: "1468953656.406848858,0" "BEGIN TRANSACTION" // map[]
n5l W   160719-18:40:56.407095  sql/executor.go:1024    WOO: "1468953656.406848858,0" "SELECT max(val) AS m FROM mono" // map[]
n5l W   160719-18:40:56.407406  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" seq=2 BATCH BEGIN
n5l W   160719-18:40:56.407480  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n5l W   160719-18:40:56.407504  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.407655  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH PROCESSED err = ""
n5l W   160719-18:40:56.407741  sql/executor.go:1076    WOO: "1468953656.406848858,0" SELECT VAL: 1
n5l W   160719-18:40:56.408904  sql/executor.go:1024    WOO: "1468953656.406848858,0" "SELECT * FROM mono WHERE val = $1" // map["1":"1"]
n5l W   160719-18:40:56.409070  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" seq=3 BATCH BEGIN
n5l W   160719-18:40:56.409136  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n5l W   160719-18:40:56.409165  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.409264  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Min rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH PROCESSED err = ""
n5l W   160719-18:40:56.409339  sql/executor.go:1076    WOO: "1468953656.406848858,0" SELECT VAL: 1
n5l W   160719-18:40:56.410598  sql/executor.go:1024    WOO: "1468953656.406848858,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n5l W   160719-18:40:56.427124  sql/executor.go:1024    WOO: "1468953656.406848858,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["1":"2" "2":"'14689536564068488580000000000'" "3":"5" "4":"0"]
n5l W   160719-18:40:56.427249  sql/executor.go:1076    WOO: "1468953656.406848858,0" SELECT VAL: 2
n5l W   160719-18:40:56.427402  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" seq=4 BATCH BEGIN
n5l W   160719-18:40:56.427452  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 8, header {/Table/51/1/160180654660648963/0 /Min}, request header:
n5l W   160719-18:40:56.427538  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 2, header {/Table/51/1/160180654660648963/0 /Min}, request header: value: > blind:false
n5l W   160719-18:40:56.427607  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 2, header {/Table/51/1/160180654660648963/1/1 /Min}, request header: value: > blind:false
n5l W   160719-18:40:56.427654  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.427723  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" START
n5l W   160719-18:40:56.427757  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" moved forward to "1468953651.733717078,156" due to read
n5l W   160719-18:40:56.427792  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" moved forward to "1468953651.733717078,156" due to read

[ this message "moved forward" just reports tsCache.Next(), in this case the Next ts is actually in the past so nothing happens ]

n5l W   160719-18:40:56.427851  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" DONE
n5l W   160719-18:40:56.429492  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH PROCESSED err = ""
n5l W   160719-18:40:56.430376  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" seq=5 BATCH BEGIN
n5l W   160719-18:40:56.430462  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 9, header {/Table/51/1/160180654660648963/0 /Min}, request header: commit:true intent_spans: intent_spans:
n5l W   160719-18:40:56.430490  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.430545  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" START
n5l W   160719-18:40:56.430571  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" DONE
n5l W   160719-18:40:56.431636  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" BATCH PROCESSED err = ""

(The UTS messages are from applyTimestampCache)

Meanwhile on node 1 with a txn started earlier:

n1l W   160719-18:40:55.913501  sql/executor.go:1024    WOO: "1468953655.913479564,0" "BEGIN TRANSACTION" // map[]
[ that was earlier ]

[ the txn batch is sent to node 4 ]

n4l W   160719-18:40:56.652595  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n4l W   160719-18:40:56.652770  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-18:40:56.652812  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n1l W   160719-18:40:56.652853  sql/executor.go:1024    WOO: "1468953655.913479564,0" "SELECT max(val) AS m FROM mono" // map[]
n4l W   160719-18:40:56.652881  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = "range 21: replica node_id:5 store_id:5 replica_id:4 not lease holder; node_id:3 store_id:3 replica_id:3 is"
n3l W   160719-18:40:56.653593  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n3l W   160719-18:40:56.653701  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n3l W   160719-18:40:56.653725  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n3l W   160719-18:40:56.653782  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = "range 21: replica node_id:2 store_id:2 replica_id:2 not lease holder; node_id:3 store_id:3 replica_id:3 is"
n5l W   160719-18:40:56.655254  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n5l W   160719-18:40:56.655405  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n5l W   160719-18:40:56.655432  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.655529  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = "key range /Table/51/1-/Table/51/2 outside of bounds of range /Table/50-/Table/51"
n4l W   160719-18:40:56.657021  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n4l W   160719-18:40:56.657146  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-18:40:56.657183  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.675852  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""
n1l W   160719-18:40:56.677136  sql/executor.go:1076    WOO: "1468953655.913479564,0" SELECT VAL: 1

[ uh-oh ]

n4l W   160719-18:40:56.678216  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=3 BATCH BEGIN
n4l W   160719-18:40:56.678339  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n1l W   160719-18:40:56.678345  sql/executor.go:1024    WOO: "1468953655.913479564,0" "SELECT * FROM mono WHERE val = $1" // map["1":"1"]
n4l W   160719-18:40:56.678398  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.679065  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""
n1l W   160719-18:40:56.680221  sql/executor.go:1076    WOO: "1468953655.913479564,0" SELECT VAL: 1
n1l W   160719-18:40:56.681221  sql/executor.go:1024    WOO: "1468953655.913479564,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n1l W   160719-18:40:56.699332  sql/executor.go:1024    WOO: "1468953655.913479564,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["1":"2" "2":"'14689536559134795640000000000'" "3":"1" "4":"0"]
n4l W   160719-18:40:56.699431  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=4 BATCH BEGIN
n1l W   160719-18:40:56.699452  sql/executor.go:1076    WOO: "1468953655.913479564,0" SELECT VAL: 2
n4l W   160719-18:40:56.699593  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 8, header {/Table/51/1/160180655552626689/0 /Min}, request header:
n4l W   160719-18:40:56.699675  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655552626689/0 /Min}, request header: value: > blind:false
n4l W   160719-18:40:56.699747  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655552626689/1/1 /Min}, request header: value: > blind:false
n4l W   160719-18:40:56.699789  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.699892  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" START
n4l W   160719-18:40:56.699945  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" moved forward to "1468953647.649568444,1" due to read
n4l W   160719-18:40:56.699973  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" moved forward to "1468953647.649568444,1" due to read

[ this message "moved forward" just reports tsCache.Next(), in this case the Next ts is actually in the past so nothing happens ]


n4l W   160719-18:40:56.700011  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" DONE
n4l W   160719-18:40:56.701315  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""
n4l W   160719-18:40:56.702928  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=5 BATCH BEGIN
n4l W   160719-18:40:56.703065  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 9, header {/Table/51/1/160180655552626689/0 /Min}, request header: commit:true intent_spans: intent_spans:
n4l W   160719-18:40:56.703096  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.703164  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" START
n4l W   160719-18:40:56.703198  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" DONE
n4l W   160719-18:40:56.704551  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""

[ no error: boom! ]

Perhaps interestingly, the issue tends to disappear if I run only 3 nodes.

Ok the logging I put in there is misleading. The code in applyTimestampCache that says "moved forward" does so before the call to Forward and in both instances the new ts is actually lower than the one in the txn. For example:

n5l W   160719-18:40:56.427757  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" moved forward to "1468953651.733717078,156" due to read
n5l W   160719-18:40:56.427792  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=false pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" moved forward to "1468953651.733717078,156" due to read

the tsCache says that its next is 1468953651.733717078,156 whereas the txn has seen 1468953656 already (a later ts).

Wait there is something strange in here. In both cases the txn timestamp is larger than the tsCache's idea of what the Next() ts should be. How can that be? Isn't the invariant of the tsCache that its Next() is always greater than all the timestamps seen so far, including those carried by Transaction objects?

If the bug requires more than three nodes, that suggests something related to rebalancing (since a three-node cluster never rebalances after the initial replication is complete).

I am running again the same test (same parameters, same logging) on CockroachDB f73f125327569ec6c996c5a323f4283c0c290d74 from March 24; that is _after_ Nathan's big rework of the timestamp cache. After multiple rounds I am rather confident the error does not reproduce there.

Will now try with some intermediate points. If you have particular commits in mind let me know.

Now running against 8a69c525a39e03412d0447ce78255fed557f2818 from April 19th, after pete's optimization of tsCache entries. Can't seem to reproduce there either.

Wait there is something strange in here. In both cases the txn timestamp is larger than the tsCache's idea of what the Next() ts should be. How can that be? Isn't the invariant of the tsCache that its Next() is always greater than all the timestamps seen so far, including those carried by Transaction objects?

No, the invariant is that after a key (or range) has been read at ts1, it'll never return anything less than that. That is, a violation would be scanning the range at ts1 and then later doing anything mutating that range at <=ts1.

Now running against 8a69c52 from April 19th, after pete's optimization of tsCache entries. Can't seem to reproduce there either.

Good job tracking this down so far. Bisecting seems reasonable at this point. I'm sorry I haven't been much help so far, but it's really difficult looking at this test from afar (the problem with ad-hoc instrumentation is that there's only one person who has done the ramp-up :-) ). Let me know if I can assist in some way.

Running against 549d9b575e06921fa96b6ff4881ea348d8b6d00c from June 24th, can't seem to repro there either.

Running on e67fbb82415a7ee18740c34850733c936364a005 from July 12th; test fails:

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T202500#20160719T202500

Running on 8ce6033625109d1599cdd1652b9736b24ed0e393 from July 8th. Seems fine.

Running on 2c436c89768e48c5a10686c372c16c2eb4aaa683 from July 11th. Test fails:

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T205352#20160719T205352

Running on 9e54183215d453d48908e6d9008f11f0757776f8 from July 11th. Inconclusive. Seems fine but most test clients time out before they can do anything interesting.

Running on 4f4f5e8ed12212d96cfbcc582ec6dc626725f66d from July 11th. Repros again:

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T211712#20160719T211712

Looked through the commits in between. Nothing that rings a lot of bells (4f4f5e8ed12212d96cfbcc582ec6dc626725f66d, lease transfers, rings most of them but you're probably running an older version right now, so we'll soon know). If you get a failure now, try 4f4f5e8^.

341863708c514126c05362de504ced83c6e4f0fe definitely repros:

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T215330#20160719T215330

... and e3a813a5f806e0d45c3414b105af9ccfb20d4f4a most likely doesn't repro.

[edit, 1 day after: it does]

👍 I'll take a look first thing tomorrow.

Scratch the last result. I let the latest batch run overnight and there's a repro on e3a813a5f806e0d45c3414b105af9ccfb20d4f4a now:

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160719T222148#20160719T222148

Ok. Back to binary search further in the past?
BTW, I'm trying to replicate that test you're running in our acceptance tests.

Repro with 91f85839fea5411894b123d113096551ba201693

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160720T115603#20160720T115603

Now digging further in the past

Repro with 4a0f57e44273970ca6d9d94e68d57267c9a04129 from June 29th

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160720T120319#20160720T120319

Can't reproduce with 7990ffe52c1f9e8727ec1fed609c662b028fdce9 from June 15th

Can't reproduce with 08acf61c26575f7c3aa282c08f3ef2125ee22b86 from June 22nd

Ok what we know:

  • (certain) issue appeared between sometime in March and June 29th
  • (certain) the duplicate record _always_ happens near the beginning of the test, somewhere in the first 5 records inserted, when eager replication is enabled
  • (certain) can't repro / much harder with 30 clients connected (results now with 5 clients, 1 per node)
  • (uncertain) can't repro with 3 nodes (now 5)

Hmm. Nothing stands out to me.

15e0fca..4a0f57e is a pretty narrow range. There's hardly any non-test changes in there. Could it be something innocuous-looking like #7527? That change didn't just add logging, it added a new read operation to the ChangeReplicas transaction. (before the first write).

[edit 1 day after - the range is bigger]

I also don't trust the lower boundary too much. But running that change can't hurt.

Haven't managed to repro using https://github.com/cockroachdb/cockroach/pull/7941. Pointers on potential differences welcome.

"Good" news, repro in 15e0fca from June 29 after 50 runs:
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160720T140032#20160720T140032

08acf61c26575f7c3aa282c08f3ef2125ee22b86 from June 22nd doesn't repro in 50 runs

35d160e94bcdaff89d7e8ac2eba982ad57e2554a from June 24th fails after 40ish runs:

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160720T171024#20160720T171024

96c7b9bddf204168c29a7e8f6ed66ce5dc86c89f from June 23th fails after 10 runs

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160720T173428#20160720T173428

83b9d0cde6059f47b6c990a470dc82408dfe3dbc from June 22nd doesn't seem to repro either

4839efeea54c1c60bd4b5ff2b8c5b7f91707a0c6 from June 23rd fails after 10 runs:
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160720T185002#20160720T185002

2c367918072f47fd07748336d557e7747730ba23 from June 22nd doesn't seem to repro at 30 runs.
As I am now suspecting 4839efeea54c1c60bd4b5ff2b8c5b7f91707a0c6 I wiill go straight to 646ce6c8b5418ec0298812b7b51dd09e8ce0353a and let that run longer

So I couldn't find a repro with 646ce6c8b5418ec0298812b7b51dd09e8ce0353a and I thus talked to @petermattis about this.
The working theory is that the issue occurs when transactions hit the db concurrently with range replications. The changes in 4839efeea54c1c60bd4b5ff2b8c5b7f91707a0c6 do not introduce a new bug but rather accelerate the first occurrence of the issue by ensuring that the first replicas occur soon after the db is started. If the theory is correct, then 646ce6c8b5418ec0298812b7b51dd09e8ce0353a should still exhibit the issue but only after either a longer run or with more frequent range scans.

So I am doing both, running 646ce6c8b5418ec0298812b7b51dd09e8ce0353a with a 2 minutes test and COCKROACH_SCAN_MAX_IDLE_TIME=10ms.

[ edit - this didn't reveal a reproducible after 20 runs ]

Another suggestion from @petermattis: take 4839efeea54c1c60bd4b5ff2b8c5b7f91707a0c6 where we had an easy repro and remove the 4 lines of code responsible for eager replication, and see if the repro disappears. If it does we create more certainty that replication is part of the issue.

Actually, I suggested taking master and commenting out the eager replication in splitTrigger. I believe it is now 3 lines that need to be disabled. But doing this on 4839efe also makes sense.

ok will do that next

So 4839efeea54c1c60bd4b5ff2b8c5b7f91707a0c6 with the 4 lines removed doesn't readily reproduce (not as readily as earlier anyway: no failure in 50 runs). Will now try with master too.

So when patching master manually to disable eager replication the easy reproducible disappears (none in 50 runs). Interesting. Updating the list of known knowns:

  • (certain) the duplicate record _always_ happens near the beginning of the test, somewhere in the first 5 records inserted, _when eager replication is enabled_
  • (certain) issue appeared between sometime since March; OR it was already there but very well hidden (so far that our Jepsen tests did not encounter it)
  • (certain) can't repro / much harder with 30 clients connected (results now with 5 clients, 1 per node)
  • (uncertain) can't repro with 3 nodes (now 5)
  • (certain) much harder to repro when eager replication is disabled

Very curious. I think we can agree that eager replication itself is not the issue, but is simply revealing an underlying bug. I recall that your test doesn't run for that long. I'm curious to know if the bug occurs when an operation is being performed concurrently with replication. Can you try delaying the time between when the cluster is started and when jepsen starts doing its thing? Delaying for 20 seconds should be sufficient for the cluster to reach a fully replicated state.

Ok now I am looking back at the initial repro on master which I extracted yesterday: https://github.com/cockroachdb/cockroach/issues/7899#issuecomment-233731702

And looking at the db log the "old fashioned way".

We have 3 txns, the first on node 5, then node 2, then node 1 in commit order. The error is the txn on node 1 inserts a dup from the one on node 5. The actual ordering of events is:

  1. txn begins on node 1, rather early: n1l W 160719-18:40:55.913501 sql/executor.go:1024 WOO: "1468953655.913479564,0" "BEGIN TRANSACTION" // map[]
  2. txn begins on node 5, somewhat later: n5l W 160719-18:40:56.406871 sql/executor.go:1024 WOO: "1468953656.406848858,0" "BEGIN TRANSACTION" // map[]
  3. data updated on node 5
  4. txn commits on node 5
  5. txn begins on node 2
  6. stuff happens, including replica from node 5 to node 1, and update by node 2's txn (details below)
  7. data updated on node 1, inserting duplicate value
  8. txn commits on node 1 (without error)

During steps #1-#4 there is no inter-node traffic logged whatsoever. The txn received on node 5 is processed fully locally.

So I want to focus on the time afterwards. What do we see?

  • txn on node 5 commits:
n5l W   160719-18:40:56.430376  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" seq=5 BATCH BEGIN
n5l W   160719-18:40:56.430462  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.406848858,0 orig=1468953656.406848858,0 max=1468953656.406848858,0 wto=false rop=false" // method 9, header {/Table/51/1/160180654660648963/0 /Min}, request header: commit:true intent_spans: intent_spans:
n5l W   160719-18:40:56.430490  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=f14f1bab key=/Table/51/1/160180654660648963/0 rw=true pri=0.02747658 is

(method 9 = EndTransaction)

  • follow-up txn starts on node 2:
n2l W   160719-18:40:56.578968  sql/executor.go:1024    WOO: "1468953656.578940009,0" "BEGIN TRANSACTION" // map[]
  • stuff happens on node 3, probably unrelated:
n3l I   160719-18:40:56.582080  storage/replica_command.go:1658 range 1: replicaID 3 transfer raft leadership to replicaID 2
  • node 2 gets new range lease?
n2l I   160719-18:40:56.582813  storage/replica_command.go:1642 store=4:4 ([]=cockroach-data) range=1 [/Min-/Table/11): new range lease replica {4 4 2} 2016-07-19 18:40:56.486912901 +0000 UTC 4.843687426s following replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 408042h40m56.486912901s [physicalTime=2016-07-19 18:40:56.582676207 +0000 UTC]
  • node 1 fails to get lease:
n1l E   160719-18:40:56.588422  storage/queue.go:492    [split] on store=1:1 ([]=cockroach-data) range=21 [/Table/50-/Max): store=1:1 ([]=cockroach-data) range=21 [/Table/50-/Max): could not obtain lease: range 21 was not found
  • node 2 issues 1st SQL unindexed MAX query (will be parallelized below to node 3, 4 , and 5 which have the data)
n2l W   160719-18:40:56.588551  sql/executor.go:1024    WOO: "1468953656.578940009,0" "SELECT max(val) AS m FROM mono" // map[]
  • node 5 (where the update happened earlier) splits its range:
 n5l    I   160719-18:40:56.588573  storage/split_queue.go:100  splitting store=3:3 ([]=cockroach-data) range=21 [/Table/50-/Max) at keys [/Table/51/0]
  • nodes 3 and 4 perform the query on behalf of node 2:
n4l W   160719-18:40:56.593554  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.593469896,0 wto=false rop=false" seq=2 BATCH BEGIN
n4l W   160719-18:40:56.593725  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.593469896,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-18:40:56.593769  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.593469896,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.593863  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.593469896,0 wto=false rop=false" BATCH PROCESSED err = "range 21: replica node_id:5 store_id:5 replica_id:4 not lease holder; node_id:3 store_id:3 replica_id:3 is"
n3l W   160719-18:40:56.596140  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.596021233,0 wto=false rop=false" seq=2 BATCH BEGIN
n3l W   160719-18:40:56.596363  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.596021233,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0

(with a split ongoing on node 5:)

n5l I   160719-18:40:56.596371  storage/replica_command.go:2215 store=3:3 ([]=cockroach-data) range=21 [/Table/50-/Max): initiating a split of this range at key /Table/51

(continuing on nodes 3 & 4 & 5 on behalf of 2)

n3l W   160719-18:40:56.596422  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.596021233,0 wto=false rop=false" BATCH END
n3l W   160719-18:40:56.596522  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.596021233,0 wto=false rop=false" BATCH PROCESSED err = "range 21: replica node_id:2 store_id:2 replica_id:2 not lease holder; node_id:3 store_id:3 replica_id:3 is"
n5l W   160719-18:40:56.598048  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" seq=2 BATCH BEGIN
n5l W   160719-18:40:56.598196  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n5l W   160719-18:40:56.598223  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.598403  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" BATCH PROCESSED err = ""
  • node 2 issues 2nd SQL query:
n2l W   160719-18:40:56.599333  sql/executor.go:1024    WOO: "1468953656.578940009,0" "SELECT * FROM mono WHERE val = $1" // map["1":"2"]
  • node 5 issues split:
 n5l    I   160719-18:40:56.600125  storage/log.go:84   Range Event: "split", range: 21, info: {"UpdatedDesc":{"range_id":21,"start_key":"ug==","end_key":"uw==","replicas":[{"node_id":5,"store_id":5,"replica_id":4},{"node_id":2,"store_id":2,"replica_id":2},{"node_id":3,"store_id":3,"replica_id":3}],"next_replica_id":5},"NewDesc":{"range_id":31,"start_key":"uw==","end_key":"//8=","replicas":[{"node_id":5,"store_id":5,"replica_id":1},{"node_id":2,"store_id":2,"replica_id":2},{"node_id":3,"store_id":3,"replica_id":3}],"next_replica_id":4}}
  • node 2 continues its txn, with support from node 5:
n2l W   160719-18:40:56.600672  sql/executor.go:1076    WOO: "1468953656.578940009,0" SELECT VAL: 2
n5l W   160719-18:40:56.600724  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" seq=3 BATCH BEGIN
n5l W   160719-18:40:56.600867  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n5l W   160719-18:40:56.600917  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.601045  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Min rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" BATCH PROCESSED err = ""
n2l W   160719-18:40:56.601584  sql/executor.go:1024    WOO: "1468953656.578940009,0" "SELECT cluster_logical_timestamp() * CAST(10000000000 AS DECIMAL) AS ts" // map[]
n2l W   160719-18:40:56.618217  sql/executor.go:1024    WOO: "1468953656.578940009,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["4":"0" "1":"3" "2":"'14689536565789400090000000000'" "3":"2"]
n2l W   160719-18:40:56.618328  sql/executor.go:1076    WOO: "1468953656.578940009,0" SELECT VAL: 3
n5l W   160719-18:40:56.619554  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" seq=4 BATCH BEGIN
n5l W   160719-18:40:56.619692  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" // method 8, header {/Table/51/1/160180655286812676/0 /Min}, request header:
n5l W   160719-18:40:56.619765  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655286812676/0 /Min}, request header: value: > blind:false
n5l W   160719-18:40:56.619830  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655286812676/1/1 /Min}, request header: value: > blind:false
n5l W   160719-18:40:56.619862  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.619943  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" START
n5l W   160719-18:40:56.619988  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" moved forward to "1468953651.733717078,156" due to read
n5l W   160719-18:40:56.620015  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" moved forward to "1468953651.733717078,156" due to read
n5l W   160719-18:40:56.620065  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" DONE
n5l W   160719-18:40:56.621461  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.597950119,0 wto=false rop=false" BATCH PROCESSED err = "key range /Table/51/1/160180655286812676/0-/Min outside of bounds of range /Table/50-/Table/51"

(node 5 has said "no can't do" to node 2? now node 4 takes over, still working on node 2's txn)

n4l W   160719-18:40:56.622857  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" seq=4 BATCH BEGIN
n4l W   160719-18:40:56.622995  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" // method 8, header {/Table/51/1/160180655286812676/0 /Min}, request header:
n4l W   160719-18:40:56.623075  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655286812676/0 /Min}, request header: value: > blind:false
n4l W   160719-18:40:56.623127  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655286812676/1/1 /Min}, request header: value: > blind:false
n4l W   160719-18:40:56.623192  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" BATCH END
  • and now the txn on node 1 (the one that will cause the error) wakes up and queries its data, which will be handled by nodes 3, 4 and 5:
n1l W   160719-18:40:56.652853  sql/executor.go:1024    WOO: "1468953655.913479564,0" "SELECT max(val) AS m FROM mono" // map[]
n4l W   160719-18:40:56.652595  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n4l W   160719-18:40:56.652770  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-18:40:56.652812  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.652881  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = "range 21: replica node_id:5 store_id:5 replica_id:4 not lease holder; node_id:3 store_id:3 replica_id:3 is"

(node 4 working for node 1 says "talk to node 3" - ??)

n3l W   160719-18:40:56.653593  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n3l W   160719-18:40:56.653701  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n3l W   160719-18:40:56.653725  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n3l W   160719-18:40:56.653782  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = "range 21: replica node_id:2 store_id:2 replica_id:2 not lease holder; node_id:3 store_id:3 replica_id:3 is"

(node 3 working for node 1 says "talk to node 3" - ??)

n5l W   160719-18:40:56.655254  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n5l W   160719-18:40:56.655405  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n5l W   160719-18:40:56.655432  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n5l W   160719-18:40:56.655529  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = "key range /Table/51/1-/Table/51/2 outside of bounds of range /Table/50-/Table/51"

(node 5 working for node 1 confused?)
(node 4 working for node 1 follows up:)

n4l W   160719-18:40:56.657021  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=2 BATCH BEGIN
n4l W   160719-18:40:56.657146  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-18:40:56.657183  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
  • node 4 gets new range lease replica:
n4l I   160719-18:40:56.671971  storage/replica_command.go:1642 store=5:5 ([]=cockroach-data) range=31 [/Table/51-/Max): new range lease replica {5 5 1} 1970-01-01 00:00:00 +0000 UTC 408042h41m1.373226531s following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0 [physicalTime=2016-07-19 18:40:56.671853065 +0000 UTC]
  • node 5 announces leadership transfer:
n5l I   160719-18:40:56.672140  storage/replica_command.go:1658 range 31: replicaID 3 transfer raft leadership to replicaID 1
  • node 4 continues the work on behalf of node 2's txn started earlier:
n4l W   160719-18:40:56.672183  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" START
n4l W   160719-18:40:56.672411  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" moved forward to "1468953647.649568444,1" due to read
n4l W   160719-18:40:56.672449  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" moved forward to "1468953647.649568444,1" due to read
n4l W   160719-18:40:56.672486  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" DONE
n4l W   160719-18:40:56.675628  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=false pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" BATCH PROCESSED err = ""
n4l W   160719-18:40:56.675852  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""
  • node 1 observes the old value (prior to node 5's initial txn which has since then committed, and not seeing node 2's txn which is not yet committed - this is incorrect!)
n1l W   160719-18:40:56.677136  sql/executor.go:1076    WOO: "1468953655.913479564,0" SELECT VAL: 1
  • node 4 continues the txn commit on behalf of node 2:
n4l W   160719-18:40:56.677423  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=true pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" seq=5 BATCH BEGIN
n4l W   160719-18:40:56.677669  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=true pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" // method 9, header {/Table/51/1/160180655286812676/0 /Min}, request header: commit:true intent_spans: intent_spans:
n4l W   160719-18:40:56.677715  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=true pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.677767  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=true pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" START
n4l W   160719-18:40:56.677816  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=true pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" DONE

(it's received EndTransaction but the batch has not completed yet at this point)

  • node 4 continues the txn lookup on behalf of node 1 (the txn that's going to cause the error)
n4l W   160719-18:40:56.678216  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=3 BATCH BEGIN
n4l W   160719-18:40:56.678339  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 6, header {/Table/51/1 /Table/51/2}, request header: max_results:0
n4l W   160719-18:40:56.678398  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
  • node 1 issues its next SQL query:
n1l W   160719-18:40:56.678345  sql/executor.go:1024    WOO: "1468953655.913479564,0" "SELECT * FROM mono WHERE val = $1" // map["1":"1"]
  • node 4 completes the txn commit on behalf of node 2:
n4l W   160719-18:40:56.678891  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=e87f5219 key=/Table/51/1/160180655286812676/0 rw=true pri=0.01961454 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953656.578940009,0 orig=1468953656.578940009,0 max=1468953656.622774118,0 wto=false rop=false" BATCH PROCESSED err = ""
  • node 4 completes the scan batch on behalf of node 1:
n4l W   160719-18:40:56.679065  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Min rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""
  • node 1 issues its (wrong) INSERT, and gets it to node 4, up to and including successful final commit:
n1l W   160719-18:40:56.699332  sql/executor.go:1024    WOO: "1468953655.913479564,0" "INSERT INTO mono(val, sts, node, tb) VALUES ($1, $2, $3, $4) RETURNING *" // map["1":"2" "2":"'14689536559134795640000000000'" "3":"1" "4":"0"]
n4l W   160719-18:40:56.699431  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=4 BATCH BEGIN

n4l W   160719-18:40:56.699593  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 8, header {/Table/51/1/160180655552626689/0 /Min}, request header:
n4l W   160719-18:40:56.699675  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655552626689/0 /Min}, request header: value: > blind:false
n4l W   160719-18:40:56.699747  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 2, header {/Table/51/1/160180655552626689/1/1 /Min}, request header: value: > blind:false
n4l W   160719-18:40:56.699789  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.699892  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" START
n4l W   160719-18:40:56.699945  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" moved forward to "1468953647.649568444,1" due to read
n4l W   160719-18:40:56.699973  storage/replica.go:1113 UTS: batch txn "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" moved forward to "1468953647.649568444,1" due to read
n4l W   160719-18:40:56.700011  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" DONE
n4l W   160719-18:40:56.701315  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=false pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""
n4l W   160719-18:40:56.702928  storage/replica.go:834  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" seq=5 BATCH BEGIN
n4l W   160719-18:40:56.703065  storage/replica.go:837  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" // method 9, header {/Table/51/1/160180655552626689/0 /Min}, request header: commit:true intent_spans: intent_spans:
n4l W   160719-18:40:56.703096  storage/replica.go:839  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH END
n4l W   160719-18:40:56.703164  storage/replica.go:1069 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" START
n4l W   160719-18:40:56.703198  storage/replica.go:1140 UTS: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" DONE
n4l W   160719-18:40:56.704551  storage/replica.go:874  WAA: "\"sql/executor.go:464 sql txn\" id=3153959b key=/Table/51/1/160180655552626689/0 rw=true pri=0.01521739 iso=SERIALIZABLE stat=PENDING epo=0 ts=1468953655.913479564,0 orig=1468953655.913479564,0 max=1468953656.163479564,0 wto=false rop=false" BATCH PROCESSED err = ""

The plot thickens:

  • commit 2ed649991650c4e634055700a9598a058acd0e89 from yesterday reproduces the issue readily; meanwhile
  • today's commit 1c04b7425b0bca877e6429e1960479ca55ce3668 doesn't!

We're looking at the following commit range:

1c04b7425b0bca877e6429e1960479ca55ce3668 Merge pull request #6309 from dt/fkrfc _[ no clear repros here ]_
68b1b3917ae273a4fd493dce768014e2febb3fc8 Merge pull request #7929 from dt/jdate
040e3b9d25ba73ab2856d7133b53493d25b66ef7 Merge pull request #7895 from mjibson/marc/plaintext_metrics
1dce2a58f0077b32c994276bbb422cdcb6f2db2e Merge pull request #7921 from mrtracy/mtracy/stats_for_span
c22088904954688fe0dbd16d68eab5e7cb36d9fa Merge pull request #7946 from petermattis/pmattis/test-gossip-first-range
ad9452ca5aa8b9a6cbb21b376bac1233d494add6 Merge pull request #7917 from RaduBerinde/testcluster-server-dep
a7641a722667279f5ceeb375cf6aa25cca3bc3da Merge pull request #7943 from BramGruneir/rebalance
ff8b4d381aefaf9850fd8b9413dfd14def2d6e61 Merge pull request #7926 from d4l3k/allow-go1.7rc2
ec007326db720f9028f793438ea40a6ce1dc4da1 Merge pull request #7904 from arjunravinarayan/replica_desc_refactor
4e391fbd9fb34424e0b8629882156b239792d718 Merge pull request #7905 from arjunravinarayan/ensure_nonoverlapping_ranges
80110a30ce18aa5f8a6641e34021cd478d286d5d Merge pull request #7930 from vivekmenezes/vivek/acquirewait
1a4b6d98ad5498ff63fc67a26fdaaff413e2b202 Merge pull request #7938 from petermattis/pmattis/rocksdb-batch-iterator
2fdaa7b16047206c0a793803e000746ee6a396c6 Merge pull request #7939 from petermattis/pmattis/gossip-first-range-redux _[ no clear repro here either ]_
6f3448d682f47b6ac90dbf637fb51f4e1ecfd48a Merge pull request #7937 from tlvenn/patch-1
441a425b2ae012e47643afc2746f9ba9dced0a01 Merge pull request #7924 from petermattis/pmattis/bump-goimports _[ no clear repro here either ]_
3ec14e777e064ffaca56f4caaa263c814dc69694 Merge pull request #7918 from BramGruneir/rebalance
e79f135e26a58a70aaef712bab75b062386e83fb Merge pull request #7903 from cockroachdb/spencerkimball/reverse-split-order _[ repros found here ]_
e3c681da26c6cd97734c26b7bb17952388bd35c8 Merge pull request #7901 from d4l3k/queue-ids
f73ad1ea0ef4647671baead2c04719dfc40f4aec Merge pull request #7909 from tschottdorf/acceptance-ping
c4b66ea244a1484a8e42572d9f71d81e8b8bb924 Merge pull request #7920 from cuongdo/grpc-gateway-rename _[ repros found here ]_
0642095b621b73b4896bbce4dc9beec36ff800af Merge pull request #7919 from dt/eqdate
30649b8f0d5964d627b903f6c3c5de7728bcaa2b Merge pull request #7914 from tschottdorf/log-declutter
b4104bcfafb147afe3de73d74ae5b68b86f21b93 Merge pull request #7888 from BramGruneir/queue
310652d49abc8c3f9cbcc89ffc3e26eff7241b6f Merge pull request #7913 from petermattis/pmattis/debug-sstables
2e93560c5747ab831ed84fea37692dce398ee75e Merge pull request #7910 from tschottdorf/leaktest-panic
2ed649991650c4e634055700a9598a058acd0e89 Merge pull request #7877 from tschottdorf/no-retry-fiddling _[ repros found here ]_

So here's another area of interest:

441a425 Merge pull request #7924 from petermattis/pmattis/bump-goimports [ no clear repro here either;l this is a newer commit ]
3ec14e7 Merge pull request #7918 from BramGruneir/rebalance
e79f135 Merge pull request #7903 from cockroachdb/spencerkimball/reverse-split-order [ repros found here; this is an older commit ]

What is in #7918 that is significant for this test?

Let me blurt out a theory which I think has decent potential to explain all of this: when a range splits, the right-hand-side starts out without a lease. That means that _any replica_ can be the first lease holder. Now think about the timestamp cache - we do copy it in the split trigger. But only on the lease holder is the timestamp cache in fact relevant - the followers never even update it! So for all intents and purposes, the first lease holder of the RHS could have an empty tsCache at the time it processes the RequestLeaseRequest which grants it the lease. That request in itself does bump the timestamp cache's low water mark (since going from no holder to a holder is a transfer), but it bumps it based on the "last lease" (which is just the empty proto). So it doesn't actually initialize its timestamp cache.

From now on it's clear how we get stale reads: the pre-split lease holder serves some reads for the RHS, then the split happens, the lease for the RHS goes to another node, and that node doesn't have its tsCache adjusted for the first reads served. I hope that should explain everything - going to whip up a unit-test repro and a fix (which is going to be a one-liner).

@knz, try a repro with this diff applied (on current master):

diff --git a/storage/replica_command.go b/storage/replica_command.go
index ed8bc69..60ad046 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -1638,9 +1638,11 @@ func (r *Replica) applyNewLeaseLocked(
                        // requests, this is kosher). This means that we don't use the old
                        // lease's expiration but instead use the new lease's start to initialize
                        // the timestamp cache low water.
+                       low := lease.Start
+                       low.Forward(r.store.Clock().Now())
                        log.Infof("%s: new range lease %s following %s [physicalTime=%s]",
                                r, lease, prevLease, r.store.Clock().PhysicalTime())
-                       r.mu.tsCache.SetLowWater(lease.Start)
+                       r.mu.tsCache.SetLowWater(low)

                        // Gossip the first range whenever its lease is acquired. We check to
                        // make sure the lease is active so that a trailing replica won't process

This isn't the right fix, but it _should_ prevent what I wrote above. I hope that with that applied, you'll be unable to get a repro which would lend cred to my theory above.

I hadn't thought about what we do with the range lease on splits before.
This scenario does sound suspicious. Can we copy the old lease to the new
range on splitting, or is that infeasible?

On Thursday, July 21, 2016, Tobias Schottdorf [email protected]
wrote:

Let me blurt out a theory which I think has decent potential to explain
all of this: when a range splits, the right-hand-side starts out without a
lease. That means that _any replica_ can be the first lease holder. Now
think about the timestamp cache - we do copy it in the split trigger. But
only on the lease holder is the timestamp cache in fact relevant - the
followers never even update it! So for all intents and purposes, the first
lease holder of the RHS could have an empty tsCache at the time it
processes the RequestLeaseRequest which grants it the lease. That request
in itself does bump the timestamp cache's low water mark (since going from
no holder to a holder is a transfer), but it bumps it based on the "last
lease" (which is just the empty proto). So it doesn't actually initialize
its timestamp cache.

From now on it's clear how we get stale reads: the pre-split lease holder
serves some reads for the RHS, then the split happens, the lease for the
RHS goes to another node, and that node doesn't have its tsCache adjusted
for the first reads served. I hope that should explain everything - going
to whip up a unit-test repro and a fix (which is going to be a one-liner).

@knz https://github.com/knz, try a repro with this diff applied (on
current master):

diff --git a/storage/replica_command.go b/storage/replica_command.go
index ed8bc69..60ad046 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -1638,9 +1638,11 @@ func (r *Replica) applyNewLeaseLocked(
// requests, this is kosher). This means that we don't use the old
// lease's expiration but instead use the new lease's start to initialize
// the timestamp cache low water.

  • low := lease.Start
  • low.Forward(r.store.Clock().Now())
    log.Infof("%s: new range lease %s following %s [physicalTime=%s]",
    r, lease, prevLease, r.store.Clock().PhysicalTime())
  • r.mu.tsCache.SetLowWater(lease.Start)
  • ```
    r.mu.tsCache.SetLowWater(low)
              // Gossip the first range whenever its lease is acquired. We check to
              // make sure the lease is active so that a trailing replica won't process

```

This isn't the right fix, but it _should_ prevent what I wrote above. I
hope that with that applied, you'll be unable to get a repro which would
lend cred to my theory above.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/7899#issuecomment-234199379,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f9_OEBADRQht4-7UfNTY8eX7HcNqCks5qXzf4gaJpZM4JPKiU
.

Yes, but before looking for solutions (which are easy) I want to make sure this is really it and that we can reproduce it in a unit test setting (trying that now).

Definitely. Seems like it should be easy to reproduce.

On Thursday, July 21, 2016, Tobias Schottdorf [email protected]
wrote:

Yes, but before looking for solutions (which are easy) I want to make sure
this is really it and that we can reproduce it in a unit test setting
(trying that now).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/7899#issuecomment-234202864,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f90CqE4eRVN8m_Pqo7ETG_C1dTG0gks5qXzuCgaJpZM4JPKiU
.

So I tried your patch on top of 2ed649991650c4e634055700a9598a058acd0e89 where I had earlier a repro reliably once every 4 runs. Now with your patch I can't repro after 25 runs.

Is there some reason we can't pre-initialize the RHS's lease to be held by
the same node as pre-split? That was the actual expectation, and it got
lost in the integration between Raft and the Replica/Split machinery.

On Thursday, July 21, 2016, kena [email protected] wrote:

So I tried your patch on top of 2ed6499
https://github.com/cockroachdb/cockroach/commit/2ed649991650c4e634055700a9598a058acd0e89
where I had earlier a repro reliably once every 4 runs. Now with your patch
I can't repro after 25 runs.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/7899#issuecomment-234237074,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTakCFDMu-_a35eJ7bi8zcJBmt-0Uks5qX2QBgaJpZM4JPKiU
.

No, there is no reason and that's the fix I'm looking at right now.

(Will go to https://github.com/cockroachdb/cockroach/pull/7955, right next to the regression test)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

petermattis picture petermattis  ·  4Comments

nvanbenschoten picture nvanbenschoten  ·  3Comments

xudongzheng picture xudongzheng  ·  3Comments

couchand picture couchand  ·  3Comments

melskyzy picture melskyzy  ·  3Comments