Cockroach: release: v20.1.0-alpha.20200123

Created on 6 Jan 2020 · 55Comments · Source: cockroachdb/cockroach

Click for current signoff comment

Candidate SHA: d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3
Deployment status: otan-release-v2010-alpha20191216-e9e2a803 / otan-release-v2010-alpha20200123-d818f1ff
Nightly Suite: https://teamcity.cockroachdb.com/viewQueued.html?itemId=1708159&tab=queuedBuildOverviewTab
Release Qualification: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708120&tab=buildResultsDiv&buildTypeId=Cockroach_ReleaseQualification

Deployment Dashboards:

Release process checklist

Prep date: 23rd January, 2020

[x] [Pick a SHA](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-PickingaSHA), fill in Candidate SHA above, notify #release-announce of SHA.
[x] [Tag the provisional SHA](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-TagtheprovisionalSHA)
[x] [Publish provisional binaries](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-Publishtheprovisionalbinaries)
[x] [Check binaries](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-Checkbinaries)
[x] [Deploy to test clusters](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-DeploytoTestClusters)
[x] [Verify node crash reports](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-Verifynodecrashreportsappearinsentry.io)
[x] [Start release qualification suite](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-RuntheSuiteofNightlyRoachtests)
[x] [Start nightly suite](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-RuntheSuiteofNightlyRoachtests)
[x] fill in Deployment status above with clusters and Nightly Suite with the link to Nightly TeamCity Job

One day after prep date:

[x] [Get signoff on roachtest failures](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-RoachtestFailureSignoff)

Keep an eye on clusters until release date. Do not proceed below until the release date.

Release date: 30th January, 2020

[x] [Check cluster status](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-Checkclusterstatus)
[x] [Tag release](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-Tagtherelease)
[x] [Bless provisional binaries](https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/73105625/Release+process#Releaseprocess-Blesstheprovisionalbinaries)

For production or stable releases in the latest major release series
- [ ] Update the Homebrew Formula
- [ ] Update the image tag in our Orchestrator configurations
- [ ] Announce version to the registration cluster
[x] Update docs
[ ] External communications for release
[ ] Clean up provisional tag from repository

old info - obsoleted by new SHA -- formerly release: v20.1.0-alpha.20200113

Candidate SHA: 617cb39e3ced263cb4c253aa2e3edd172368f7cf.
Deployment status: dan-release-v2010-alpha20191216-6bfe4a1/dan-release-v2010-alpha20190113-e9e2a80.
Nightly Suite: ReleaseQual NightlySuite.

Source

danhhz

All 55 comments

Picking the SHA is blocked on https://github.com/cockroachdb/cockroach/issues/43616

danhhz on 6 Jan 2020

43616 is now fixed. Please go ahead with picking the SHA.

solongordon on 8 Jan 2020

Thanks! Picked a SHA: 6bfe4a1febd6b4eab43142228a8e037498b8bf3b

danhhz on 8 Jan 2020

Potentially blocked on https://github.com/cockroachdb/cockroach/issues/43827. A bunch of the release teamcity builds are failing because of this. I've tried to work around it, but we may have to cherry-pick a fix

danhhz on 8 Jan 2020

Peter fixed the makefile issue. I'll cherry-pick a new sha and get the roachtests running tomorrow. I started up the roachprod tests today so assuming nothing else goes wrong, we should have enough data to keep the original release date

danhhz on 9 Jan 2020

👍1

I've been waiting to triage the roachtest failures until I can get someone to make a call on weather "SQL Logic Test High VModule Nightly" failing to build is an alpha release blocker or not. At this point, instead of restarting everything, I think I'll try pinging that one build with the candidate SHA plus a makefile change that fixes the build and then triage the roachtest failures we got. This is unlikely to finish on a Friday so my guess is this alpha is slipping until Tuesday or Wednesday

danhhz on 10 Jan 2020

obsoleted by new SHA

Roachtest failures

CDC

[x] cdc/initial-scan/rangefeed=true
[x] cdc/cloud-sink-gcs/rangefeed=true
[x] cdc/ledger/rangefeed=true

[x] hotspotsplits/nodes=4
[x] network/tpcc/nodes=4
[x] scaledata/distributed_semaphore/nodes=3
[x] scaledata/jobcoordinator/nodes=3

SQL Exec

[x] sqlsmith/setup=empty/setting=default
[x] sqlsmith/setup=rand-tables/setting=default
[x] sqlsmith/setup=seed/setting=default
[x] sqlsmith/setup=seed/setting=no-ddl
[x] sqlsmith/setup=tpcc/setting=ddl-nodrop
[x] sqlsmith/setup=tpcc/setting=default

danhhz on 13 Jan 2020

Something definitely regressed in changefeeds recently. I need to look into exactly what, but I think these failures are okay for an alpha. The initial scan one was making progress, though slowly (the most behind span is logged and it kept changing over the course of the test). The other two got further behind than they should, but eventually caught up.

danhhz on 13 Jan 2020

Checked off all sqlsmith failures since those are never blockers (as Matt always says).

yuzefovich on 13 Jan 2020

checked off hotspotsplits/nodes=4
failed due to fatal error: runtime: out of memory

This error has been happening periodically for a while now (#33660). Should not be a release blocker.

rytaft on 13 Jan 2020

The scaledata failures are interesting. They're hitting issues because the creation of the initial empty schema it taking a very long time (> 1 minute). This is allowing chaos to start before we expect it to. Here's some logging from the first failure, filtered down to just the schema changes triggered here:

I200109 05:06:45.811349 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 53, info: {TableName:ds.public.resources Statement:CREATE TABLE resources (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, free INT8, total INT8, PRIMARY KEY (id)) User:root}
I200109 05:06:47.330294 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:48.795658 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_resources_deleted_at Statement:CREATE INDEX idx_resources_deleted_at ON resources (deleted_at) User:root MutationID:1}
I200109 05:06:53.756401 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:57.729798 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=3, m=1
I200109 05:07:04.577244 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:08.280914 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 53, info: {MutationID:1}
I200109 05:07:08.853128 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]
I200109 05:07:12.633817 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 54, info: {TableName:ds.public.allocations Statement:CREATE TABLE allocations (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, job_id INT8, units INT8, PRIMARY KEY (id)) User:root}
I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:16.644276 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 54, info: {TableName:ds.public.allocations IndexName:idx_allocations_deleted_at Statement:CREATE INDEX idx_allocations_deleted_at ON allocations (deleted_at) User:root MutationID:1}
I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:26.125956 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "allocations", v=3, m=1
I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:36.699121 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 54, info: {MutationID:1}
I200109 05:07:37.428304 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]
I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:42.313880 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_pk Statement:CREATE UNIQUE INDEX idx_pk ON resources (resource_id) User:root MutationID:2}
I200109 05:07:46.689986 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]
I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:08:05.680050 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]
I200109 05:08:08.248063 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=6, m=2

One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.

nvanbenschoten on 13 Jan 2020

Okay, sounds good! We'll hold up the alpha until those get investigated, thanks!

cc @lnhsingh (Btw also want to make sure you noticed that the name of the alpha was wrong on the release calendar, it had 2019 in the date part of the version tag. I've fixed all occurrences of this error on the release calendar.)

danhhz on 13 Jan 2020

👍1

One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.

I'm trying to reproduce anything even remotely close to this sort of latency but I'm having a hard time. I dug in to some more details. One thing to note is that it took almost 20s at the beginning of the test to up-replicate all of the ranges (I see this from the rangelog).

Other things worth noting is that all of the jobs took 5s to start. Also orth noting is that the initial index creation for idx_resources_deleted_at took 15s after it started. Then all of the other jobs took a while each.

I've modified the test to fail if the schema portion of the test takes more than 30 and I'm running it at scale with roachtest and so far after 400 iterations it hasn't failed.

The question is why are these schema changes taking so long. I don't have a great answer but there's some evidence that there's some waiting for leases to expire. Additionally there's a stats job that seems to have happened in the middle here. I'm trying to understand what could be going on here. I don't have any evidence that this is in any way related to closed timestamps.

ajwerner on 13 Jan 2020

After more digging, it seems clear that somehow we're seeing an outstanding lease on every attempt to publish a new version. The other confounding factor is slow up-replication and some lease transfers due to it. All of the other node metrics seem fine. The storage latency numbers are good. The CPU is near zero. The question is whether it's possible for a slow initial lease acquisition to somehow always leave a node with a lease one version behind or something like that.

teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:08.853128 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]                                                                                                                           
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                             
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                      
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                      
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:37.428304 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]                                                                                                                         
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                               
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:46.689986 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]                                                                                                                           
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                        
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:05.680050 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]

The question at hand is, how is this lease getting leaked? I've modified the test like so and observed 1000 successful runs. Perhaps the vmodule logging has changed the behavior so I'm going to remove that and try again.

--- a/pkg/cmd/roachtest/scaledata.go
+++ b/pkg/cmd/roachtest/scaledata.go
@@ -38,11 +38,11 @@ func registerScaleData(r *testRegistry) {

        for app, flags := range apps {
                app, flags := app, flags // copy loop iterator vars
-               const duration = 10 * time.Minute
+               const duration = 30 * time.Second
                for _, n := range []int{3, 6} {
                        r.Add(testSpec{
                                Name:    fmt.Sprintf("scaledata/%s/nodes=%d", app, n),
-                               Timeout: 2 * duration,
+                               Timeout: 20 * duration,
                                Cluster: makeClusterSpec(n + 1),
                                Run: func(ctx context.Context, t *test, c *cluster) {
                                        runSqlapp(ctx, t, c, app, flags, duration)
@@ -73,7 +73,7 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur

        c.Put(ctx, b, app, appNode)
        c.Put(ctx, cockroach, "./cockroach", roachNodes)
-       c.Start(ctx, t, roachNodes)
+       c.Start(ctx, t, roachNodes, startArgs("--args='--vmodule=schema*=2,lease=2,jobs=2,registry=2'"))

        // TODO(nvanbenschoten): We are currently running these consistency checks with
        // basic chaos. We should also run them in more chaotic environments which
@@ -106,17 +106,22 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
                defer sqlappL.close()

                t.Status("installing schema")
+               start := time.Now()
                err = c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --install_schema "+
                        "--cockroach_ip_addresses_csv='%s' %s", app, addrStr, flags))
                if err != nil {
                        return err
                }
+               if took := time.Since(start); took > 20*time.Second {
+                       return fmt.Errorf("it took %v to set up the schema", took)
+               }

                t.Status("running consistency checker")
                const workers = 16
                return c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s  --duration_secs=%d "+
                        "--num_workers=%d --cockroach_ip_addresses_csv='%s' %s",
                        app, int(dur.Seconds()), workers, addrStr, flags))
+               return nil
        })
        m.Wait()
 }

ajwerner on 13 Jan 2020

Okay I've discovered something interesting w.r.t. the scaledata failure. The nodes are geo-distributed:

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1680508:id/scaledata/distributed_semaphore/nodes%3D3/roachprod_state/hosts/teamcity-1680508-1578546114-10-n4cpu4

# user@host            locality                                           vpcId
[email protected]  cloud=gce,region=us-central1,zone=us-central1-b    default
[email protected]    cloud=gce,region=us-west1,zone=us-west1-b          default
[email protected]    cloud=gce,region=europe-west2,zone=europe-west2-b  default
[email protected]  cloud=gce,region=us-central1,zone=us-central1-b    default

@nvanbenschoten confirms that this is the case for the other roachprod failure. Checking off the failures.

It seems that this is all fallout from https://github.com/cockroachdb/cockroach/issues/43898#issuecomment-573732961

ajwerner on 13 Jan 2020

👍1

@jlinder FYI

ajwerner on 13 Jan 2020

Thanks @ajwerner!

danhhz on 14 Jan 2020

Okay, I'm going to let this bake on the long running clusters one more night. If nothing goes wrong, I'll bless the release tomorrow morning

danhhz on 14 Jan 2020

👍1

@lnhsingh Binaries have been blessed, turning this over to docs and marketing

danhhz on 14 Jan 2020

We have a late breaking release blocker found just before marketing comms had gone out. Docs PR has been reverted. We'll restart this alpha when there is a fix https://github.com/cockroachdb/cockroach/issues/43979

danhhz on 14 Jan 2020

Going to restart this one, cc @jseldess.

otan on 23 Jan 2020

Thanks for the heads-up, @otan. Can we also be sure to a new data in the binary name? Also, can you share the new candidate SHA? I'll have to generate new notes since the last candidate.

jseldess on 23 Jan 2020

haven't picked yet, because waiting for a build to finish - will let you know when it's changed.

it should just be an additional cherry pick of 3c0671c176dc540798e63ed61cd875354d2a6aba

Can we also be sure to a new data in the binary name?

can you clarify what this means?

otan on 23 Jan 2020

I think he means that the alpha-<date> in the tag should be updated to the new target release date

danhhz on 23 Jan 2020

Yes, thanks for the clarification, Dan.

jseldess on 23 Jan 2020

sha is d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3

otan on 23 Jan 2020

Just to make sure - https://github.com/cockroachdb/cockroach/issues/43110#issuecomment-577868922 suggests that we can't release an alpha containing https://github.com/cockroachdb/cockroach/commit/379fc8f617519ec34fa87bd211df365956ea4880

The SHA @otan posted (d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3) doesn't contain that commit, so no problem. Just making sure that we don't accidentally bump again and pick that PR up.

tbg on 24 Jan 2020

👍1

Failures: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708159&tab=queuedBuildOverviewTab

Signoffs

AWS

kv

[x] tpcc/headroom/n4cpu16

???

[ ] timeout after failing on the above test - not sure what's happening (https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1579895739023300) -- https://teamcity.cockroachdb.com/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=1708154&_focus=13381

GCE

sqlexec

[ ] alterpk
[ ] interleavedpartitioned

kv

[x] acceptance/decommission
[x] cancel/tpcc/distsql/w=10,nodes=3
[ ] tpccbench/nodes=3/cpu=16 @andreimatei
[x] tpccbench/nodes=3/cpu=4
[x] version/mixed/nodes=3
[x] version/mixed/nodes=5
[x] follower-reads/nodes=3 @ajwerner
[x] gossip/chaos/nodes=9
[x] tpcc/mixed-headroom/n5cpu16
[x] tpccbench/nodes=9/cpu=4/chaos/partition @nvanbenschoten

bulkio

[x] schemachange/index/tpcc/w=1000
[x] schemachange/mixed/tpcc

appdev

[x] hibernate
[x] sqlalchemy
[x] typeorm
[x] pgx

sqlsmith (auto sign off)

sqlsmith/setup=empty/setting=default
sqlsmith/setup=rand-tables/setting=default
sqlsmith/setup=seed/setting=default
sqlsmith/setup=seed/setting=no-ddl
sqlsmith/setup=seed/setting=no-mutations
sqlsmith/setup=tpcc/setting=default
sqlsmith/setup=tpcc/setting=no-ddl
sqlsmith/setup=tpcc/setting=no-mutations
sqlsmith/setup=tpch-sf1/setting=default
sqlsmith/setup=tpch-sf1/setting=no-mutations
sqlsmith/setup=empty/setting=no-ddl
sqlsmith/setup=empty/setting=no-mutations
sqlsmith/setup=rand-tables/setting=no-ddl
sqlsmith/setup=rand-tables/setting=no-mutations
sqlsmith/setup=seed-vec/setting=vec
sqlsmith/setup=tpcc/setting=ddl-nodrop
sqlsmith/setup=tpch-sf1/setting=no-ddl

vmodule

sqlexec

[x] (failed to run due to panic)

compose tests

@mjibson

[x] (failed to run)

Random Syntax Tests

sqlexec

[x] (timeout)

otan on 24 Jan 2020

Checked off vmodule panic. It is fixed on master (#43922) but wasn't cherry-picked into this alpha. The crash can occur only with high verbosity in special circumstances, so I'd not consider it a blocker.

yuzefovich on 24 Jan 2020

signing of on acceptance/decommission I have a handle on that

knz on 25 Jan 2020

I'm looking at the mixed-headroom failure. It looks again like the workload just quits 🤷‍♂ but this has happened before and I will have to investigate. When I ran this manually the other day they either all passed or all failed (depending on whether I had the parent of https://github.com/cockroachdb/cockroach/issues/43754#issuecomment-578127795 or the failing commit). I'll kick off a few runs on the provisional SHA to get more of an idea of what's up here.

tbg on 27 Jan 2020

By the way, can we mention who signed off on each failure? For example, tpcc/headroom/n4cpu16 is signed off on, but I think we want accountability on who did it, so that a stray click can't cause us to release something broken.

tbg on 27 Jan 2020

Actually it seems there's something less benign going on here. The log indicates that roachtest hung after this test got terminated. So we didn't run all tests on AWS (this is the build below).

[06:27:12]
[Step 2/2] [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Atpcc%2Fheadroom%2Fn4cpu16.%2A&sort=title&restgroup=false&display=lastcommented+project)
[06:27:12]
[Step 2/2] <sub>powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)</sub></p></details>
[06:27:12]
[Step 2/2] ", Labels:["O-robot" "C-test-failure" "C-test-failure" "branch-provisional_202001231814_v20.1.0-alpha.20200123" "O-roachtest" "release-blocker"], Assignee:"", Milestone:21}: POST https://api.github.com/repos/cockroachdb/cockroach/issues: 422 Validation Failed [{Resource:Label Field:name Code:invalid Message:}]
[06:27:13]
[Step 2/2] Failed tests detected
[06:27:12]
[Step 2/2] Publishing artifacts (19s)
[06:27:12]
[Step 2/2] ##teamcity[publishArtifacts '/home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200123-1708154/tpcc/headroom/n4cpu16/run_1/** => tpcc/headroom/n4cpu16']
[17:03:42]
[Step 2/2] The build Nightlies::Roachtest Nightly - AWS #1401 {buildId=1708154} has been running for more than 1,080 minutes. Terminating...

tbg on 27 Jan 2020

The failure mode for headroom, mixed-headroom, version/mixed all seems the same. Makes me curious more who signed off on headroom - is there something I'm missing here? If we can't run TPCC I'd be hesitant to release.

tbg on 27 Jan 2020

You can click the "edited by" drop-down on the checklist and see who marked what.

knz on 27 Jan 2020

@andreimatei you're the one who's missing a note about why you signed off.

knz on 27 Jan 2020

@knz I had no idea! Thanks, it's really comforting that this exists.

tbg on 27 Jan 2020

I signed off on tpcc/headroom/n4cpu16 because I didn't see any particular reason not too. I looked for a while trying to understand what timed out, to no avail. It also looked to me that all the AWS tests did run, but perhaps I looked wrong.

andreimatei on 27 Jan 2020

FWIW, I just ran mixed-headroom 5x on GCE and it passed.

The AWS build did time out (execution timeout), so we probably missed a fair share of tests there. I'm going to run the offending test (last one to finish failingly before the stall) tpcc/headroom/n4cpu16 5x on AWS.

tbg on 27 Jan 2020

Signing off on the schema change tests. They look like failures due to context cancellation, which shouldn't block the release.

lucy-zhang on 27 Jan 2020

@andreimatei and I figured out through arcane magic that the AWS build was really failed by tpccbench, but that we didn't manage to get any logs. I'm going to own fixing that infrastructure issue.

tbg on 27 Jan 2020

Headroom passed 5x on aws as well. Since the failure mode is so similar, I'm checking off the version-mixed ones too.

tbg on 28 Jan 2020

I believe the cancel failure is owned by sql: https://github.com/cockroachdb/cockroach/issues/42103

It looks like the same failure we've been seen for close to two months now. @asubiotto I'll take the liberty to check the box

tbg on 28 Jan 2020

tpccbench (cpu=16) has a panic on n2. @andreimatei or @nvanbenschoten do you have context on this?

F200124 03:39:17.909332 8180460 kv/txn_interceptor_heartbeater.go:279  [n2,txn-hb=c715b681] txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" meta={id=c715b681 key=/Table/54/1/1228/2/0 pri=0.05066344 epo=1 ts=1579837154.462135149,2 min=1579837145.904027058,0 seq=49} rw=true stat=ABORTED rts=1579837146.238462972,2 wto=false max=1579837146.404027058,0
goroutine 8180460 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7056901, 0xed5bc5de5, 0x0, 0xc018a3a000)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7053740, 0xc000000004, 0x66fc8be, 0x21, 0x117, 0xc02ff43b80, 0x13f)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:214 +0xb92
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x492ef00, 0xc0393d97c0, 0x4, 0x2, 0x41d4346, 0x51, 0xc049227c98, 0x1, 0x1)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x492ef00, 0xc0393d97c0, 0x1, 0x4, 0x41d4346, 0x51, 0xc04dba7c98, 0x1, 0x1)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeat(0xc044f37830, 0x492ef00, 0xc0393d97c0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:279 +0x18b
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeatLoop(0xc044f37830, 0x492ef00, 0xc0393d97c0)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:249 +0x1d1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001c0240, 0x492ef00, 0xc0393d97c0, 0xc038e68270, 0x29, 0x4995640, 0xc000231b60, 0xc03a15b930)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0x152
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:317 +0x131

I'm signing off on the other tpccbench failure since it's the same mystery 255 killed thing.

tbg on 28 Jan 2020

@ajwerner could you pass judgement on the follower reads test failure:

follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after

tbg on 28 Jan 2020

Signing off on the gossip chaos test. Looks like an infra failure, we weren't even able to connect to the machines via SSH to fetch logs in the end.

tbg on 28 Jan 2020

the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:

03:18:34 test.go:182: test status: 
03:18:34 tpcc.go:839: --- PASS: tpcc 970 resulted in 11017.9 tpmC (90.1% of max tpmC)

03:18:34 test.go:182: test status: stopping cluster
03:18:34 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting..
03:18:36 test.go:182: test status: starting cluster
03:18:36 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: starting......................
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings
03:19:14 test.go:182: test status: running benchmark, warehouses=1050
03:19:14 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1708156-1579821026-27-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1050 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 {pgurl:10} --histograms=perf/warehouses=1050/stats.json
3: 10339
8: 9926
5: 9361
4: 11011
2: 9030
7: 9499
1: 11356
6: 9499
9: 10026
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:20:20.768517 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
        UPDATE district
        SET d_next_o_id = d_next_o_id + 1
        WHERE d_w_id = $1 AND d_id = $2
        RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:20:44 chaos.go:90: killing :4
03:20:44 test.go:182: test status: stopping cluster
03:20:44 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.
4: kill exited nonzero
4: dead
CHAOS: 03:20:51 chaos.go:119: restarting :4 after 5s of downtime
03:20:51 test.go:182: test status: starting cluster
03:20:51 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: starting.
4: 11429
Initializing 10500 workers and preparing statements...
I200124 03:21:24.783933 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
        UPDATE district
        SET d_next_o_id = d_next_o_id + 1
        WHERE d_w_id = $1 AND d_id = $2
        RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:22:21 chaos.go:90: killing :1
03:22:21 test.go:182: test status: stopping cluster
03:22:21 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 03:22:28 chaos.go:119: restarting :1 after 5s of downtime
03:22:28 test.go:182: test status: starting cluster
03:22:28 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: startingI200124 03:22:29.250432 1 workload/cli/run.go:361  retrying after error while creating load: EOF
..
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 11960

Initializing 2100 connections...
CHAOS: 03:23:58 chaos.go:90: killing :1
03:23:58 test.go:182: test status: stopping cluster
03:23:58 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 03:24:05 chaos.go:119: restarting :1 after 5s of downtime
03:24:05 test.go:182: test status: starting cluster
03:24:05 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: starting.
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 12348

Initializing 10500 workers and preparing statements...
I200124 03:24:24.361240 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
        UPDATE district
        SET d_next_o_id = d_next_o_id + 1
        WHERE d_w_id = $1 AND d_id = $2
        RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:25:28.126433 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
        UPDATE district
        SET d_next_o_id = d_next_o_id + 1
        WHERE d_w_id = $1 AND d_id = $2
        RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:25:35 chaos.go:90: killing :3
[repeats similary for hours]

@nvanbenschoten have you seen this before/can you make sense of it?

tbg on 28 Jan 2020

@ajwerner could you pass judgement on the follower reads test failure:

follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after

This failed early on in the test while it seems that the data was still being up-replicated. The test failed at 00:05:03 which is before the user table had been up-replicated fully:

I200124 00:05:06.775279 469 storage/replica_raftstorage.go:806  [n3,s3,r22/3:/Table/2{6-7}] applied LEARNER snapshot [total=17ms ingestion=4@15ms id=c3aa29b3 index=24]

I'll take an action item to wait for the data to be fully replicated before starting the test.

ajwerner on 28 Jan 2020

The tpccbench (cpu=16) crash I was just looking at yesterday in another context.
I'm suspect it was broken by the following commit: https://github.com/cockroachdb/cockroach/pull/43606/commits/69977105f331301cd931c3cb857b0e2665653750

We probably shouldn't release with that crash. I'll see what to do.

andreimatei on 28 Jan 2020

sqlalchemy, typeorm, pgx are broken tests.

hibernate is failing due to #43969 (#44031 is the fix which did not make the release).

apantel on 28 Jan 2020

I have #44451 which I hope fixed tpccbench (cpu=16), but I'm not sure it's the right thing to do yet.

andreimatei on 28 Jan 2020

the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:

Yes, I've seen this failure mode in the past occasionally. The workload gets stuck trying to prepare statements on its 2100 connections and doesn't seem to be able to complete this before it gets killed again (~1m later), so it loops indefinitely. Signing off on this failure while we look more into the issue because it's not an issue with the alpha.

nvanbenschoten on 28 Jan 2020

44460 is now in bors. Once that merges, I'll check off `tpccbench (cpu=16)` because the cause will hopefully be fixed.

But it'd be good to cherry-pick this PR in the release...

andreimatei on 28 Jan 2020

@andreimatei If you feel that this PR is critical to be included in this alpha, please follow instructions: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/160235536/Changing+a+Scheduled+Release

miretskiy on 29 Jan 2020

Released - https://github.com/cockroachdb/cockroach/releases/tag/v20.1.0-alpha.20200123

nathanstilwell on 3 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

prod: Is there an API to embed CockroachDB in a Go application?

glycerine · 34Comments

CockroachDB: Health checks time out under load

DuskEagle · 43Comments

build: use `vendor` directory for dependencies

bdarnell · 37Comments

libroach: timebound iterator metadata doesn't consider intents

danhhz · 84Comments

sql/logictest: TestLogic/local/aggregate/other times out under race

cockroach-teamcity · 30Comments

Cockroach: release: v20.1.0-alpha.20200123

Click for current signoff comment

All 55 comments

43616 is now fixed. Please go ahead with picking the SHA.

Signoffs

AWS

kv

???

GCE

sqlexec

kv

bulkio

appdev

sqlsmith (auto sign off)

vmodule

sqlexec

compose tests

@mjibson

Random Syntax Tests

sqlexec

44460 is now in bors. Once that merges, I'll check off tpccbench (cpu=16) because the cause will hopefully be fixed.

Related issues

44460 is now in bors. Once that merges, I'll check off `tpccbench (cpu=16)` because the cause will hopefully be fixed.