Candidate SHA: d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3
Deployment status: otan-release-v2010-alpha20191216-e9e2a803 / otan-release-v2010-alpha20200123-d818f1ff
Nightly Suite: https://teamcity.cockroachdb.com/viewQueued.html?itemId=1708159&tab=queuedBuildOverviewTab
Release Qualification: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708120&tab=buildResultsDiv&buildTypeId=Cockroach_ReleaseQualification
Deployment Dashboards:
Prep date: 23rd January, 2020
Candidate SHA above, notify #release-announce of SHA.Deployment status above with clusters and Nightly Suite with the link to Nightly TeamCity JobOne day after prep date:
Release date: 30th January, 2020
For production or stable releases in the latest major release series
[x] Update docs
old info - obsoleted by new SHA -- formerly release: v20.1.0-alpha.20200113
Candidate SHA: 617cb39e3ced263cb4c253aa2e3edd172368f7cf.
Deployment status: dan-release-v2010-alpha20191216-6bfe4a1/dan-release-v2010-alpha20190113-e9e2a80.
Nightly Suite: ReleaseQual NightlySuite.
Picking the SHA is blocked on https://github.com/cockroachdb/cockroach/issues/43616
Thanks! Picked a SHA: 6bfe4a1febd6b4eab43142228a8e037498b8bf3b
Potentially blocked on https://github.com/cockroachdb/cockroach/issues/43827. A bunch of the release teamcity builds are failing because of this. I've tried to work around it, but we may have to cherry-pick a fix
Peter fixed the makefile issue. I'll cherry-pick a new sha and get the roachtests running tomorrow. I started up the roachprod tests today so assuming nothing else goes wrong, we should have enough data to keep the original release date
I've been waiting to triage the roachtest failures until I can get someone to make a call on weather "SQL Logic Test High VModule Nightly" failing to build is an alpha release blocker or not. At this point, instead of restarting everything, I think I'll try pinging that one build with the candidate SHA plus a makefile change that fixes the build and then triage the roachtest failures we got. This is unlikely to finish on a Friday so my guess is this alpha is slipping until Tuesday or Wednesday
obsoleted by new SHA
CDC
KV
SQL Exec
Something definitely regressed in changefeeds recently. I need to look into exactly what, but I think these failures are okay for an alpha. The initial scan one was making progress, though slowly (the most behind span is logged and it kept changing over the course of the test). The other two got further behind than they should, but eventually caught up.
Checked off all sqlsmith failures since those are never blockers (as Matt always says).
checked off hotspotsplits/nodes=4
failed due to fatal error: runtime: out of memory
This error has been happening periodically for a while now (#33660). Should not be a release blocker.
The scaledata failures are interesting. They're hitting issues because the creation of the initial empty schema it taking a very long time (> 1 minute). This is allowing chaos to start before we expect it to. Here's some logging from the first failure, filtered down to just the schema changes triggered here:
I200109 05:06:45.811349 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 53, info: {TableName:ds.public.resources Statement:CREATE TABLE resources (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, free INT8, total INT8, PRIMARY KEY (id)) User:root}
I200109 05:06:47.330294 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:48.795658 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_resources_deleted_at Statement:CREATE INDEX idx_resources_deleted_at ON resources (deleted_at) User:root MutationID:1}
I200109 05:06:53.756401 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:57.729798 279 sql/backfill.go:143 [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=3, m=1
I200109 05:07:04.577244 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:08.280914 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 53, info: {MutationID:1}
I200109 05:07:08.853128 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]
I200109 05:07:12.633817 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 54, info: {TableName:ds.public.allocations Statement:CREATE TABLE allocations (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, job_id INT8, units INT8, PRIMARY KEY (id)) User:root}
I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:16.644276 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 54, info: {TableName:ds.public.allocations IndexName:idx_allocations_deleted_at Statement:CREATE INDEX idx_allocations_deleted_at ON allocations (deleted_at) User:root MutationID:1}
I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:26.125956 279 sql/backfill.go:143 [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "allocations", v=3, m=1
I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:36.699121 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 54, info: {MutationID:1}
I200109 05:07:37.428304 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]
I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:42.313880 279 sql/event_log.go:132 [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_pk Statement:CREATE UNIQUE INDEX idx_pk ON resources (resource_id) User:root MutationID:2}
I200109 05:07:46.689986 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]
I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:08:05.680050 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]
I200109 05:08:08.248063 279 sql/backfill.go:143 [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=6, m=2
One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.
Okay, sounds good! We'll hold up the alpha until those get investigated, thanks!
cc @lnhsingh (Btw also want to make sure you noticed that the name of the alpha was wrong on the release calendar, it had 2019 in the date part of the version tag. I've fixed all occurrences of this error on the release calendar.)
One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.
I'm trying to reproduce anything even remotely close to this sort of latency but I'm having a hard time. I dug in to some more details. One thing to note is that it took almost 20s at the beginning of the test to up-replicate all of the ranges (I see this from the rangelog).
Other things worth noting is that all of the jobs took 5s to start. Also orth noting is that the initial index creation for idx_resources_deleted_at took 15s after it started. Then all of the other jobs took a while each.
I've modified the test to fail if the schema portion of the test takes more than 30 and I'm running it at scale with roachtest and so far after 400 iterations it hasn't failed.
The question is why are these schema changes taking so long. I don't have a great answer but there's some evidence that there's some waiting for leases to expire. Additionally there's a stats job that seems to have happened in the middle here. I'm trying to understand what could be going on here. I don't have any evidence that this is in any way related to closed timestamps.
After more digging, it seems clear that somehow we're seeing an outstanding lease on every attempt to publish a new version. The other confounding factor is slow up-replication and some lease transfers due to it. All of the other node metrics seem fine. The storage latency numbers are good. The CPU is near zero. The question is whether it's possible for a slow initial lease acquisition to somehow always leave a node with a lease one version behind or something like that.
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:08.853128 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:37.428304 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:46.689986 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462 [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:05.680050 279 sql/lease.go:326 [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]
The question at hand is, how is this lease getting leaked? I've modified the test like so and observed 1000 successful runs. Perhaps the vmodule logging has changed the behavior so I'm going to remove that and try again.
--- a/pkg/cmd/roachtest/scaledata.go
+++ b/pkg/cmd/roachtest/scaledata.go
@@ -38,11 +38,11 @@ func registerScaleData(r *testRegistry) {
for app, flags := range apps {
app, flags := app, flags // copy loop iterator vars
- const duration = 10 * time.Minute
+ const duration = 30 * time.Second
for _, n := range []int{3, 6} {
r.Add(testSpec{
Name: fmt.Sprintf("scaledata/%s/nodes=%d", app, n),
- Timeout: 2 * duration,
+ Timeout: 20 * duration,
Cluster: makeClusterSpec(n + 1),
Run: func(ctx context.Context, t *test, c *cluster) {
runSqlapp(ctx, t, c, app, flags, duration)
@@ -73,7 +73,7 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
c.Put(ctx, b, app, appNode)
c.Put(ctx, cockroach, "./cockroach", roachNodes)
- c.Start(ctx, t, roachNodes)
+ c.Start(ctx, t, roachNodes, startArgs("--args='--vmodule=schema*=2,lease=2,jobs=2,registry=2'"))
// TODO(nvanbenschoten): We are currently running these consistency checks with
// basic chaos. We should also run them in more chaotic environments which
@@ -106,17 +106,22 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
defer sqlappL.close()
t.Status("installing schema")
+ start := time.Now()
err = c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --install_schema "+
"--cockroach_ip_addresses_csv='%s' %s", app, addrStr, flags))
if err != nil {
return err
}
+ if took := time.Since(start); took > 20*time.Second {
+ return fmt.Errorf("it took %v to set up the schema", took)
+ }
t.Status("running consistency checker")
const workers = 16
return c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --duration_secs=%d "+
"--num_workers=%d --cockroach_ip_addresses_csv='%s' %s",
app, int(dur.Seconds()), workers, addrStr, flags))
+ return nil
})
m.Wait()
}
Okay I've discovered something interesting w.r.t. the scaledata failure. The nodes are geo-distributed:
https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1680508:id/scaledata/distributed_semaphore/nodes%3D3/roachprod_state/hosts/teamcity-1680508-1578546114-10-n4cpu4
# user@host locality vpcId
[email protected] cloud=gce,region=us-central1,zone=us-central1-b default
[email protected] cloud=gce,region=us-west1,zone=us-west1-b default
[email protected] cloud=gce,region=europe-west2,zone=europe-west2-b default
[email protected] cloud=gce,region=us-central1,zone=us-central1-b default
@nvanbenschoten confirms that this is the case for the other roachprod failure. Checking off the failures.
It seems that this is all fallout from https://github.com/cockroachdb/cockroach/issues/43898#issuecomment-573732961
@jlinder FYI
Thanks @ajwerner!
Okay, I'm going to let this bake on the long running clusters one more night. If nothing goes wrong, I'll bless the release tomorrow morning
@lnhsingh Binaries have been blessed, turning this over to docs and marketing
We have a late breaking release blocker found just before marketing comms had gone out. Docs PR has been reverted. We'll restart this alpha when there is a fix https://github.com/cockroachdb/cockroach/issues/43979
Going to restart this one, cc @jseldess.
Thanks for the heads-up, @otan. Can we also be sure to a new data in the binary name? Also, can you share the new candidate SHA? I'll have to generate new notes since the last candidate.
haven't picked yet, because waiting for a build to finish - will let you know when it's changed.
it should just be an additional cherry pick of 3c0671c176dc540798e63ed61cd875354d2a6aba
Can we also be sure to a new data in the binary name?
can you clarify what this means?
I think he means that the alpha-<date> in the tag should be updated to the new target release date
Yes, thanks for the clarification, Dan.
sha is d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3
Just to make sure - https://github.com/cockroachdb/cockroach/issues/43110#issuecomment-577868922 suggests that we can't release an alpha containing https://github.com/cockroachdb/cockroach/commit/379fc8f617519ec34fa87bd211df365956ea4880
The SHA @otan posted (d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3) doesn't contain that commit, so no problem. Just making sure that we don't accidentally bump again and pick that PR up.
Failures: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708159&tab=queuedBuildOverviewTab
sqlsmith/setup=empty/setting=default
sqlsmith/setup=rand-tables/setting=default
sqlsmith/setup=seed/setting=default
sqlsmith/setup=seed/setting=no-ddl
sqlsmith/setup=seed/setting=no-mutations
sqlsmith/setup=tpcc/setting=default
sqlsmith/setup=tpcc/setting=no-ddl
sqlsmith/setup=tpcc/setting=no-mutations
sqlsmith/setup=tpch-sf1/setting=default
sqlsmith/setup=tpch-sf1/setting=no-mutations
sqlsmith/setup=empty/setting=no-ddl
sqlsmith/setup=empty/setting=no-mutations
sqlsmith/setup=rand-tables/setting=no-ddl
sqlsmith/setup=rand-tables/setting=no-mutations
sqlsmith/setup=seed-vec/setting=vec
sqlsmith/setup=tpcc/setting=ddl-nodrop
sqlsmith/setup=tpch-sf1/setting=no-ddl
Checked off vmodule panic. It is fixed on master (#43922) but wasn't cherry-picked into this alpha. The crash can occur only with high verbosity in special circumstances, so I'd not consider it a blocker.
signing of on acceptance/decommission I have a handle on that
I'm looking at the mixed-headroom failure. It looks again like the workload just quits 馃し鈥嶁檪 but this has happened before and I will have to investigate. When I ran this manually the other day they either all passed or all failed (depending on whether I had the parent of https://github.com/cockroachdb/cockroach/issues/43754#issuecomment-578127795 or the failing commit). I'll kick off a few runs on the provisional SHA to get more of an idea of what's up here.
By the way, can we mention who signed off on each failure? For example, tpcc/headroom/n4cpu16 is signed off on, but I think we want accountability on who did it, so that a stray click can't cause us to release something broken.
Actually it seems there's something less benign going on here. The log indicates that roachtest hung after this test got terminated. So we didn't run all tests on AWS (this is the build below).
[06:27:12]
[Step 2/2] [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Atpcc%2Fheadroom%2Fn4cpu16.%2A&sort=title&restgroup=false&display=lastcommented+project)
[06:27:12]
[Step 2/2] <sub>powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)</sub></p></details>
[06:27:12]
[Step 2/2] ", Labels:["O-robot" "C-test-failure" "C-test-failure" "branch-provisional_202001231814_v20.1.0-alpha.20200123" "O-roachtest" "release-blocker"], Assignee:"", Milestone:21}: POST https://api.github.com/repos/cockroachdb/cockroach/issues: 422 Validation Failed [{Resource:Label Field:name Code:invalid Message:}]
[06:27:13]
[Step 2/2] Failed tests detected
[06:27:12]
[Step 2/2] Publishing artifacts (19s)
[06:27:12]
[Step 2/2] ##teamcity[publishArtifacts '/home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200123-1708154/tpcc/headroom/n4cpu16/run_1/** => tpcc/headroom/n4cpu16']
[17:03:42]
[Step 2/2] The build Nightlies::Roachtest Nightly - AWS #1401 {buildId=1708154} has been running for more than 1,080 minutes. Terminating...
The failure mode for headroom, mixed-headroom, version/mixed all seems the same. Makes me curious more who signed off on headroom - is there something I'm missing here? If we can't run TPCC I'd be hesitant to release.
You can click the "edited by" drop-down on the checklist and see who marked what.
@andreimatei you're the one who's missing a note about why you signed off.
@knz I had no idea! Thanks, it's really comforting that this exists.
I signed off on tpcc/headroom/n4cpu16 because I didn't see any particular reason not too. I looked for a while trying to understand what timed out, to no avail. It also looked to me that all the AWS tests did run, but perhaps I looked wrong.
FWIW, I just ran mixed-headroom 5x on GCE and it passed.
The AWS build did time out (execution timeout), so we probably missed a fair share of tests there. I'm going to run the offending test (last one to finish failingly before the stall) tpcc/headroom/n4cpu16 5x on AWS.
Signing off on the schema change tests. They look like failures due to context cancellation, which shouldn't block the release.
@andreimatei and I figured out through arcane magic that the AWS build was really failed by tpccbench, but that we didn't manage to get any logs. I'm going to own fixing that infrastructure issue.
Headroom passed 5x on aws as well. Since the failure mode is so similar, I'm checking off the version-mixed ones too.
I believe the cancel failure is owned by sql: https://github.com/cockroachdb/cockroach/issues/42103
It looks like the same failure we've been seen for close to two months now. @asubiotto I'll take the liberty to check the box
tpccbench (cpu=16) has a panic on n2. @andreimatei or @nvanbenschoten do you have context on this?
F200124 03:39:17.909332 8180460 kv/txn_interceptor_heartbeater.go:279 [n2,txn-hb=c715b681] txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" meta={id=c715b681 key=/Table/54/1/1228/2/0 pri=0.05066344 epo=1 ts=1579837154.462135149,2 min=1579837145.904027058,0 seq=49} rw=true stat=ABORTED rts=1579837146.238462972,2 wto=false max=1579837146.404027058,0
goroutine 8180460 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7056901, 0xed5bc5de5, 0x0, 0xc018a3a000)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7053740, 0xc000000004, 0x66fc8be, 0x21, 0x117, 0xc02ff43b80, 0x13f)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:214 +0xb92
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x492ef00, 0xc0393d97c0, 0x4, 0x2, 0x41d4346, 0x51, 0xc049227c98, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x492ef00, 0xc0393d97c0, 0x1, 0x4, 0x41d4346, 0x51, 0xc04dba7c98, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeat(0xc044f37830, 0x492ef00, 0xc0393d97c0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:279 +0x18b
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeatLoop(0xc044f37830, 0x492ef00, 0xc0393d97c0)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:249 +0x1d1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001c0240, 0x492ef00, 0xc0393d97c0, 0xc038e68270, 0x29, 0x4995640, 0xc000231b60, 0xc03a15b930)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0x152
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:317 +0x131
I'm signing off on the other tpccbench failure since it's the same mystery 255 killed thing.
@ajwerner could you pass judgement on the follower reads test failure:
follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after
Signing off on the gossip chaos test. Looks like an infra failure, we weren't even able to connect to the machines via SSH to fetch logs in the end.
the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:
03:18:34 test.go:182: test status:
03:18:34 tpcc.go:839: --- PASS: tpcc 970 resulted in 11017.9 tpmC (90.1% of max tpmC)
03:18:34 test.go:182: test status: stopping cluster
03:18:34 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting..
03:18:36 test.go:182: test status: starting cluster
03:18:36 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: starting......................
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings
03:19:14 test.go:182: test status: running benchmark, warehouses=1050
03:19:14 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1708156-1579821026-27-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1050 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 {pgurl:10} --histograms=perf/warehouses=1050/stats.json
3: 10339
8: 9926
5: 9361
4: 11011
2: 9030
7: 9499
1: 11356
6: 9499
9: 10026
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:20:20.768517 1 workload/cli/run.go:361 retrying after error while creating load: preparing
UPDATE district
SET d_next_o_id = d_next_o_id + 1
WHERE d_w_id = $1 AND d_id = $2
RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:20:44 chaos.go:90: killing :4
03:20:44 test.go:182: test status: stopping cluster
03:20:44 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.
4: kill exited nonzero
4: dead
CHAOS: 03:20:51 chaos.go:119: restarting :4 after 5s of downtime
03:20:51 test.go:182: test status: starting cluster
03:20:51 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: starting.
4: 11429
Initializing 10500 workers and preparing statements...
I200124 03:21:24.783933 1 workload/cli/run.go:361 retrying after error while creating load: preparing
UPDATE district
SET d_next_o_id = d_next_o_id + 1
WHERE d_w_id = $1 AND d_id = $2
RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:22:21 chaos.go:90: killing :1
03:22:21 test.go:182: test status: stopping cluster
03:22:21 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead
CHAOS: 03:22:28 chaos.go:119: restarting :1 after 5s of downtime
03:22:28 test.go:182: test status: starting cluster
03:22:28 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: startingI200124 03:22:29.250432 1 workload/cli/run.go:361 retrying after error while creating load: EOF
..
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 11960
Initializing 2100 connections...
CHAOS: 03:23:58 chaos.go:90: killing :1
03:23:58 test.go:182: test status: stopping cluster
03:23:58 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead
CHAOS: 03:24:05 chaos.go:119: restarting :1 after 5s of downtime
03:24:05 test.go:182: test status: starting cluster
03:24:05 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: starting.
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 12348
Initializing 10500 workers and preparing statements...
I200124 03:24:24.361240 1 workload/cli/run.go:361 retrying after error while creating load: preparing
UPDATE district
SET d_next_o_id = d_next_o_id + 1
WHERE d_w_id = $1 AND d_id = $2
RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:25:28.126433 1 workload/cli/run.go:361 retrying after error while creating load: preparing
UPDATE district
SET d_next_o_id = d_next_o_id + 1
WHERE d_w_id = $1 AND d_id = $2
RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:25:35 chaos.go:90: killing :3
[repeats similary for hours]
@nvanbenschoten have you seen this before/can you make sense of it?
@ajwerner could you pass judgement on the follower reads test failure:
follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after
This failed early on in the test while it seems that the data was still being up-replicated. The test failed at 00:05:03 which is before the user table had been up-replicated fully:
I200124 00:05:06.775279 469 storage/replica_raftstorage.go:806 [n3,s3,r22/3:/Table/2{6-7}] applied LEARNER snapshot [total=17ms ingestion=4@15ms id=c3aa29b3 index=24]
I'll take an action item to wait for the data to be fully replicated before starting the test.
The tpccbench (cpu=16) crash I was just looking at yesterday in another context.
I'm suspect it was broken by the following commit: https://github.com/cockroachdb/cockroach/pull/43606/commits/69977105f331301cd931c3cb857b0e2665653750
We probably shouldn't release with that crash. I'll see what to do.
sqlalchemy, typeorm, pgx are broken tests.
hibernate is failing due to #43969 (#44031 is the fix which did not make the release).
I have #44451 which I hope fixed tpccbench (cpu=16), but I'm not sure it's the right thing to do yet.
the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:
Yes, I've seen this failure mode in the past occasionally. The workload gets stuck trying to prepare statements on its 2100 connections and doesn't seem to be able to complete this before it gets killed again (~1m later), so it loops indefinitely. Signing off on this failure while we look more into the issue because it's not an issue with the alpha.
tpccbench (cpu=16) because the cause will hopefully be fixed.But it'd be good to cherry-pick this PR in the release...
@andreimatei If you feel that this PR is critical to be included in this alpha, please follow instructions: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/160235536/Changing+a+Scheduled+Release