Candidate SHA: f990441079c686f9eec32d80044c719175a2bee5
Deployment status: Qualifying
Qualification Suite: #167 - started 8/24, 8:46p
Nightly Suite: #2327 - started 8/24, 8:48p
Admin UI for Qualification Clusters:
Prep date: Monday 8/24/2020
Candidate SHA aboveRelease Qualification
One day after prep date:
Release date: Monday 8/31/2020
For production or stable releases in the latest major release series
For production or stable releases
[ ] Update docs
Cleanup:
Tests failed: 13 (2 new), passed: 45289, ignored: 139; Build chain finished (success: 16, failed: 3)
Roachtest Nightly - GCE:
https://teamcity.cockroachdb.com/viewLog.html?buildId=2217981&buildTypeId=Cockroach_Nightlies_NightlySuite
All the bulk io failures are known and should be fixed by https://github.com/cockroachdb/cockroach/pull/53367.
Signing off on TestRandomSyntaxSchemaChangeColumn since it looks like a timeout (again). The schema changes seem to be making progress but are slow due to all the jobs overhead. I'm going to look into increasing the timeout since this happens literally every time.
All the storage ones should be fixed by #53494. Signing off.
Signed off on the appdev tests. Django has a skip PR in progress, I will add one for the flaky ActiveRecord and PGJDBC tests.
The version-upgrade failure on the three-node test is caused by
| I200825 18:11:24.341985 1 workload/cli/run.go:338 retrying after error during init: Could not postload: pq: foreign key requires an existing index on columns ("h_c_w_id", "h_c_d_id", "h_c_id")
I believe @rohany knows about those (https://github.com/cockroachdb/cockroach/pull/52931)
The five node test had n5 crash. I think it is likely an OOM, though neither log nor dmesg have an explicit message about that. The heap profiler shows this (artifacts)

A goroutine dump (taken ~hours before the actual crash) has lots of these:
github.com/cockroachdb/cockroach/pkg/sql/gcjob.schemaChangeGCResumer.Resume(0x81c005e855d0005, 0x4be8140, 0xc000c94960, 0x4304ae0, 0xc063b31b00, 0x0, 0x0, 0x0)
$ grep -cF '.Resume' ~/Downloads/goroutine_dump.2020-08-25T23_24_24.036.double_since_last_dump.000001005.txt
645
goroutine_dump.2020-08-25T23_24_24.036.double_since_last_dump.000001005.txt
memprof.000000006725779984_2020-08-26T01_48_57.305.txt
cockroach.log
The test ran from 5pm until 2:45am the next morning, so I'm sure it got stuck somewhere in a situation where more and more memory was allocated until n5 eventually gave out.
It's final runtime stats line is this:
I200826 01:49:17.009102 244 server/status/runtime.go:498 [n5] runtime stats: 14 GiB RSS, 1382 goroutines, 5.9 GiB/2.3 GiB/7.9 GiB GO alloc/idle/total, 4.5 GiB/5.6 GiB CGO alloc/total, 17274.1 CGO/sec, 256.0/43.6 %(u/s)time, 0.1 %gc (2x), 131 MiB/131 MiB (r/w)net
Going back through the logs, this is steadily growing.
tpcc/interleaved is same index failure:
Could not postload: pq: foreign key requires an existing index on columns ("h_c_w_id", "h_c_d_id", "h_c_id")
so I will leave it to @rohany.
tpccbench failed as follows
Attempt to create load generator failed. It|'s been more than 10m0s since we started trying to create the load generator so we|'re giving up. Last failure: failed to initialize the load generator: preparing \n\t\tUPDATE district\n\t\tSET d_next_o_id = d_next_o_id + 1\n\t\tWHERE d_w_id = $1 AND d_id = $2\n\t\tRETURNING d_tax, d_next_o_id: context deadline exceeded\nError: failed to initialize the load generator: preparing \n\t\tUPDATE district\n\t\tSET d_next_o_id = d_next_o_id + 1\n\t\tWHERE d_w_id = $1 AND d_id = $2\n\t\tRETURNING d_tax, d_next_o_id: context deadline exceeded\n
The restore previously went through in ~5h.
It looks like we didn't get a debug zip because the hard-coded 5 minute limit to do so wasn't enough.
The logs show no KV alerts (i.e. no unavailable ranges, etc), but we see have been waiting messages from DistSender, which indicates that there is in fact a problem at the KV layer. I will need to investigate further.
I'm checking off the tpccbench failure since it has been failing for a long time. This needs to change, but getting this test stable should not block the release.
Not sure when this release qualification started, but PR's https://github.com/cockroachdb/cockroach/pull/53450 and https://github.com/cockroachdb/cockroach/pull/53367 need to be part of the roachtest/workload builds used for qualification to avoid the failures.
I'll sign off on the failures (tpcc/interleaved, version/nodes=3) because they are just workload import failures.
I started the test runs on 8/24 and looks like those PRs went in 22 hours ago.
Signing off on version/mixed/nodes=5. Ultimately this was also #52931-related fallout, and I think nodes=3 didn't OOM just because it's a shorter/smaller test. See https://github.com/cockroachdb/cockroach/issues/53399#issuecomment-682309446 for more details.