Candidate SHA: 250f4c36de2b88eff443cf9be9cd5d2759312c88
Deployment status: http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster
Older: http://mjibson-release-v1920-beta20190930-0001.roachprod.crdb.io:26258/
Even older: http://mjibson-release-v1920-alpha20190805-0001.roachprod.crdb.io:26258/
Release qualification:
Nightly Suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite
Old nightly suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1511951&buildTypeId=Cockroach_Nightlies_NightlySuite
Even older nightly suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1509990&buildTypeId=Cockroach_Nightlies_NightlySuite
Prep date: 2019-09-30
Candidate SHA above, notify #release-process of SHA.Deployment status above with clusters and Nightly Suite with the link to Nightly TeamCity JobRelease date: 2019-10-02
For production or stable releases in the latest major release series
[ ] Update docs
Changing the SHA to cf5c2bd2372e633d2f63e08e5bffca7c2a7ec59f from 77f26d185efb436aaac88243de19a27caa5da9b6 to pick up the fix for #41145.
Test clusters are still running 77f26d1.
Failed roachtest nightlies (https://teamcity.cockroachdb.com/viewLog.html?buildId=1509990&buildTypeId=Cockroach_Nightlies_NightlySuite):
Edit 2019-10-01: this list is replaced by the one below https://github.com/cockroachdb/cockroach/issues/41128#issuecomment-536917628
This release is canceled due to some new bugs. I'm going to start a new roachprod cluster comparison with master as of right now. On Monday the release manager and others will decide what to do. I'll post links to those test clusters here.
Latency on the master cluster is going up. It is below 20ms for the previous release. Read ops are similar.

This is true for most metrics that are related to SQL or storage.

For comparison, the same image of the 0805 release cluster.
Starting a nightly suite for 250f4c36de2b88eff443cf9be9cd5d2759312c88 at https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite
Current status: we are going to try to release 250f4c36de2b88eff443cf9be9cd5d2759312c88 on Wednesday (Oct 2).
(sorry wrong button)
Regarding the 3rd issue on the list: @lucy-zhang found that VALIDATE CONSTRAINT running concurrently with TPC-C 1K reveals invalid FK relations. The output is OK when TPC-C is not running concurrently, or when running lighter TPC-C workloads. This means that there are isolation problems.
I am tempted to interpret this as a real-world instance of #41173, @andreimatei what do you think?
from @lucy-zhang offline:
I'm not sure if andrei's issue is the same thing. the rows that are supposed to be missing are in tpcc.warehouse, which we never update (AFAIK) after we restore the fixtures, so even if the read timestamp were slightly behind for some parts of the reads it shouldn't matter
Failed nightlies as of 2019-10-01:
Note that at 10.00am CEST the test suite is still running (19 hours after starting). There may be more failures incoming.
Edit: at 1pm CEST it's still running.
Edit: at 3.30pm CEST it finished running and one additional issue dripped out (tpccbench/chaos/partition added to list above)
For reference, the following failures from the original list on 2019-09-27 are not there any more:
The following are new:
My analysis of the issues (same order as above):
query completed before it could be canceled. I have not looked at the test source but it may be a bug in the test instead of a bug in cockroachdb.clearrange test does not even get to run. Do we have confidence that the code that the test is checking is OK otherwise? @dt please provide input. If the clearrange test has not ran for a long time, that would be a release blocker.373 tests failed unexpectedly instead of the usual NN tests succeeded unexpectedly. This is likely to mark a regression.import, not about clearrange.distsql=off.Regarding this:
37259: inter-node network mishaps cause a tpcc run to fail.
My understanding: the test uses toxiproxy between nodes to introduce network partitions, and then asserts that there is no goroutine peak in the server. However, the partition also causes legitimate SQL errors in clients, and that makes the test abort in error, in a way that's not relevant to the primary thing being tested. @ajwerner do you agree with my analysis?
Answer from @ajwerner:
That initial analysis seems sound.
(thus classifying the issue as not a release blocker)
Signing off on the cancel test. It's a testing error caused by the operation in question finishing too quickly.
Regarding #40935 update from @lucy-zhang
current update is that @jordan and I have tried to reproduce this by turning loadgen back on for one of the clusters where I got this test failure, and running that select query manually, and neither of us have seen it again
so it looks like there is some additional state necessary to repro aside from "heavy load" (possibly related to the early state of the cluster?)
so this does seem rarer than I first expected
Separately @andy-kimball states with agreement from @bdarnell
If this repros only under heavy stress, and we have seen it nowhere else, I don't think it should be classified as a beta blocker. The bar should be very high now. I'd only block beta for "high severity" (which this is) and "common" (which this isn't).
So I'm checking this as signed off by lucy, jordan and andy.
The Hibernate issues (#40538) are not a beta blocker. The tests are concerningly flaky though, and we will continue investigating this during the rest of the release period. The issues stem from the Hibernate tests occasionally being unable to connect to the DB.
I checked off the sqlsmith failure because it should never block a release.
There's discussion on to whether to adopt the latest changes that improve on #41206 (perf regression).
If we bump the SHA here is the diff:
New features:
Bug fixes:
Perf:
Polish:
Input from @awoods187 and @bdarnell: continue with the same SHA.
Note I have checked the cluster health at http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster
The cluster displays a performance anomaly for the period up to and including Sept 30.
Then yesterday (Oct 1) Nathan uploaded a new binary with the fix from #41220 (and other fixes) which demonstrates that the anomaly disappears. Note however that these fixes are not present in today's release.
@dt about the clearrange tests:
I’m fine with clearrange failures for now — we’re investigating but AFAIK, it is just slow, not wrong.
(considering this as sign-off)
@dt about the restore2TB test:
the RESTORE one I took a very cursory look at and didn’t see what killed it, just exit 255
might have OOMed I guess — i wonder if it was the rocks logging leak? I donno. I’m fine signing off on that too I guess.
@irfansharif and @andreimatei do not have anything to say on the remaining roachtest failure.
So I did go and investigate the log files. I am not seeing errors in the CockroachDB logs themselves other than the expected "cannot connect to node" (the node being shut down under chaos).
@bdarnell says "go"
The version is tagged, the binaries are uploaded, the docker image works, so from engineering the release is ready to go.
The release notes PR is here: https://github.com/cockroachdb/docs/pull/5250
Handing this off to docs.