Cockroach: release: v19.2.0-beta.20190930

Created on 26 Sep 2019  Â·  28Comments  Â·  Source: cockroachdb/cockroach

Candidate SHA: 250f4c36de2b88eff443cf9be9cd5d2759312c88
Deployment status: http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster
Older: http://mjibson-release-v1920-beta20190930-0001.roachprod.crdb.io:26258/
Even older: http://mjibson-release-v1920-alpha20190805-0001.roachprod.crdb.io:26258/

Release qualification:

Nightly Suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite
Old nightly suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1511951&buildTypeId=Cockroach_Nightlies_NightlySuite
Even older nightly suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1509990&buildTypeId=Cockroach_Nightlies_NightlySuite

Release process checklist

Prep date: 2019-09-30

  • [x] [Pick a SHA](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-PickingaSHA), fill in Candidate SHA above, notify #release-process of SHA.
  • [x] [Tag the provisional SHA](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-TagtheprovisionalSHA)
  • [x] [Publish provisional binaries](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-Publishtheprovisionalbinaries)
  • [x] [Check binaries](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-Checkbinaries)
  • [x] [Deploy to test clusters](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-DeploytoTestClusters)
  • [x] [Verify node crash reports](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-Verifynodecrashreportsappearinsentry.io)
  • [ ] [Start release qualification suite](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-RuntheSuiteofNightlyRoachtests)
  • [x] [Start nightly suite](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-RuntheSuiteofNightlyRoachtests)
  • [x] fill in Deployment status above with clusters and Nightly Suite with the link to Nightly TeamCity Job
  • Keep an eye on clusters until release date. Do not proceed below until the release date.

Release date: 2019-10-02

  • [x] [Check cluster status](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-Checkclusterstatus)
  • [x] [Tag release](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-Tagtherelease)
  • [x] [Bless provisional binaries](https://wiki.crdb.io/wiki/spaces/CRDB/pages/73105625/Release+process#Releaseprocess-Blesstheprovisionalbinaries)
  • For production or stable releases in the latest major release series

  • [ ] Update docs

  • [ ] External communications for release

All 28 comments

Changing the SHA to cf5c2bd2372e633d2f63e08e5bffca7c2a7ec59f from 77f26d185efb436aaac88243de19a27caa5da9b6 to pick up the fix for #41145.

Test clusters are still running 77f26d1.

Failed roachtest nightlies (https://teamcity.cockroachdb.com/viewLog.html?buildId=1509990&buildTypeId=Cockroach_Nightlies_NightlySuite):

  • import/tpcc/warehouses=1000/nodes=32 @dt
  • restore2TB/nodes=10 @dt
  • schemachange/mixed/tpcc @dt
  • tpcc/mixed-headroom/n5cpu16
  • version/mixed/nodes=5
  • cancel/tpcc/distsql/w=10,nodes=3
  • clearrange/checks=false
  • clearrange/checks=true
  • hibernate
  • kv50/rangelookups/relocate/nodes=8
  • tpccbench/nodes=9/cpu=4/chaos/partition
  • typeorm
  • acceptance/bank/zerosum-splits

Edit 2019-10-01: this list is replaced by the one below https://github.com/cockroachdb/cockroach/issues/41128#issuecomment-536917628

This release is canceled due to some new bugs. I'm going to start a new roachprod cluster comparison with master as of right now. On Monday the release manager and others will decide what to do. I'll post links to those test clusters here.

Latency on the master cluster is going up. It is below 20ms for the previous release. Read ops are similar.

ss

This is true for most metrics that are related to SQL or storage.

ss

For comparison, the same image of the 0805 release cluster.

Starting a nightly suite for 250f4c36de2b88eff443cf9be9cd5d2759312c88 at https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite

Current status: we are going to try to release 250f4c36de2b88eff443cf9be9cd5d2759312c88 on Wednesday (Oct 2).

(sorry wrong button)
Regarding the 3rd issue on the list: @lucy-zhang found that VALIDATE CONSTRAINT running concurrently with TPC-C 1K reveals invalid FK relations. The output is OK when TPC-C is not running concurrently, or when running lighter TPC-C workloads. This means that there are isolation problems.

I am tempted to interpret this as a real-world instance of #41173, @andreimatei what do you think?

from @lucy-zhang offline:

I'm not sure if andrei's issue is the same thing. the rows that are supposed to be missing are in tpcc.warehouse, which we never update (AFAIK) after we restore the fixtures, so even if the read timestamp were slightly behind for some parts of the reads it shouldn't matter

Failed nightlies as of 2019-10-01:

  • [x] [cancel/tpcc/distsql/w=10,nodes=3](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId-9001178873075005748) -----------=- #38417 @jordanlewis
  • [x] [clearrange/checks=true](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId4040650493650131270) --------------------------=- #38720 @dt @ajwerner
  • [x] [clearrange/checks=false](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId-7549905676748780804) -------------------------=- #41123 @dt @ajwerner
  • [x] [hibernate](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId-3931757827174594180) ----------------------------------------------=- #40538 @rafiss
  • [x] [import/tpcc/warehouses=1000/nodes=32](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId-4551784729715115627) -=- #41154 @dt @andreimatei
  • [x] [kv50/rangelookups/relocate/nodes=8](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId7808114564804200759) -------=- #40359 @nvanbenschoten
  • [x] [network/tpcc/nodes=4](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId285598859115364203) ---------------------------=- #37259 @ajwerner
  • [x] [restore2TB/nodes=10](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId741854490244253129) -----------------------------=- #41152 @dt
  • [x] [schemachange/mixed/tpcc](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId5401367090847010053) ---------------------=- #40935 @dt @lucy-zhang
  • [x] [tpccbench/nodes=9/cpu=4/chaos/partition](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId7647859149799979980) -=- #41027 @andreimatei
  • [x] [TestRandomSyntaxSQLSmith](https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite#testNameId-1657705641918481982)

Note that at 10.00am CEST the test suite is still running (19 hours after starting). There may be more failures incoming.

Edit: at 1pm CEST it's still running.

Edit: at 3.30pm CEST it finished running and one additional issue dripped out (tpccbench/chaos/partition added to list above)

For reference, the following failures from the original list on 2019-09-27 are not there any more:

  • tpcc/mixed-headroom/n5cpu16
  • version/mixed/nodes=5
  • tpccbench/nodes=9/cpu=4/chaos/partition
  • typeorm
  • acceptance/bank/zerosum-splits

The following are new:

  • network/tpcc/nodes=4
  • TestRandomSyntaxSQLSmith (although I suspect we're going to ignore this since this test is dripping new issues on every run by design)

My analysis of the issues (same order as above):

  • #38417: cancel test fails with query completed before it could be canceled. I have not looked at the test source but it may be a bug in the test instead of a bug in cockroachdb.

    • My assessment: not release blocker + further investigation needed.

    • Edit: @jordanlewis confirms (see comment below)

  • #38720 + #41123: import times out before the test can start.

    • My assessment: unsure. The error means that the clearrange test does not even get to run. Do we have confidence that the code that the test is checking is OK otherwise? @dt please provide input. If the clearrange test has not ran for a long time, that would be a release blocker.

    • Also, we need to change the TC reporting so that a test that does not even start does not get marked as "failed" (and instead keep metric of how often import is holding up a test). This is already on the radar see #33377

  • #40538: hibernate failure. This time we see 373 tests failed unexpectedly instead of the usual NN tests succeeded unexpectedly. This is likely to mark a regression.

    • My assessment: maybe release blocker (depending on the failures). @rafiss @jordanlewis please provide input.

    • Also, we need to change the way this roachtest reports its failures, so that different issues get created for "unexpected success" and "unexpected failure". See #41226

  • #41154: import times out before the test can start. Same as #38720 / #41123 above. However this time the test is about import, not about clearrange.

    • My assessment: not release blocker - we know that import succeeds often enough because many other tests use tpcc.

  • #40359: something about the output of the kv workload. I don't understand the failure log. @nvanbenschoten can you explain what's going on?

    • My assessment: ~unsure + more investigation needed~ not release blocker.

    • (Edit @nvanbenschoten signing off on this one see comment below)

  • #37259: inter-node network mishaps cause a tpcc run to fail.

    • My understanding: the test uses toxiproxy between nodes to introduce network partitions, and then asserts that there is no goroutine peak in the server. However, the partition also causes legitimate SQL errors in clients, and that makes the test abort in error, in a way that's not relevant to the primary thing being tested. ~@ajwerner do you agree with my analysis?~ (Andrew agrees)

    • My assessment: not release blocker

    • Also if the analysis is correct the test needs to be fixed.

  • #41152: restore 2TB fails. This seems to be an import timeout, like #41154 and #38720 / #41123, related to #33377.

    • My assessment: same as for the clearrange above: unsure, this is not a release blocker unless we have confidence that the primary feature exercised by the test has been tested to work OK often enough. @dt please provide input.

  • #40935: a VALIDATE CONSTRAINTS concurrent with large tpc-c finds erroneous rows. There's a repro readily available.

    • My assessment: ~probably release blocker if this confirms to be a serializability violation (isolation error)~

    • Edit: release blocker but not beta blocker after discussion with @andy-kimball @bdarnell @lucy-zhang @jordanlewis (see below)

    • Discussed with @lucy-zhang: need to bisect to see where the problem was introduced. Determine relationship with other issue #41173 found by Andrei, perhaps by running the tpc-c load with distsql=off.

  • #41027: tpccbench + partition chaos fails with a timeout. This might be a bulk i/o timeout, like #41154 and #38720 / #41123 above, but I am much less sure about it (test fails after 10 hours).

    • My assessment: unsure:

    • we could choose to not have it as a release blocker, even if a clear issue underneath is confirmed, because it's behavior under chaos and may be underlining a rare case

    • OTOH there is currently no clear analysis of the root cause so this may hide a serious issue underneath.

    • More investigation needed.

Regarding this:

37259: inter-node network mishaps cause a tpcc run to fail.

My understanding: the test uses toxiproxy between nodes to introduce network partitions, and then asserts that there is no goroutine peak in the server. However, the partition also causes legitimate SQL errors in clients, and that makes the test abort in error, in a way that's not relevant to the primary thing being tested. @ajwerner do you agree with my analysis?

Answer from @ajwerner:

That initial analysis seems sound.

(thus classifying the issue as not a release blocker)

40359 can be ignored. It appears to be a testing error caused by new retryable errors that we see during RELOCATE RANGE statements. See https://github.com/cockroachdb/cockroach/pull/41106 (which I'm planning to return to later today).

Signing off on the cancel test. It's a testing error caused by the operation in question finishing too quickly.

Regarding #40935 update from @lucy-zhang

current update is that @jordan and I have tried to reproduce this by turning loadgen back on for one of the clusters where I got this test failure, and running that select query manually, and neither of us have seen it again
so it looks like there is some additional state necessary to repro aside from "heavy load" (possibly related to the early state of the cluster?)
so this does seem rarer than I first expected

Separately @andy-kimball states with agreement from @bdarnell

If this repros only under heavy stress, and we have seen it nowhere else, I don't think it should be classified as a beta blocker. The bar should be very high now. I'd only block beta for "high severity" (which this is) and "common" (which this isn't).

So I'm checking this as signed off by lucy, jordan and andy.

The Hibernate issues (#40538) are not a beta blocker. The tests are concerningly flaky though, and we will continue investigating this during the rest of the release period. The issues stem from the Hibernate tests occasionally being unable to connect to the DB.

I checked off the sqlsmith failure because it should never block a release.

There's discussion on to whether to adopt the latest changes that improve on #41206 (perf regression).
If we bump the SHA here is the diff:

New features:

41190 - decommissioning via atomic replication changes

40954 - SHOW RANGE FOR ROW

41138 - stats collection in movr

Bug fixes:

41153 - rocksdb assert revert

41244 - rocksdb compaction bug fix

41194 - addsstable bug fix

41195 - addsstable bug fix

41196 - addsstable bug fix

41217 - libroach iterator bug fix

41187 - sql planning fix

41212 - sql planning fix

41241 - fastinset bug fix (sql planning & others)

41231 - mem leak bug fix

Perf:

41220 - sql planning perf / mem usage improvement

Polish:

40493 - sql polish zone config introspection

40948 - sql polish

41129 - distsql plan viz polish

41192 - roachtest improvement

41215 - code polish

41221 - k8s conf update

41235 - test fixes

41237 - movr polish

Input from @awoods187 and @bdarnell: continue with the same SHA.

Note I have checked the cluster health at http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster

The cluster displays a performance anomaly for the period up to and including Sept 30.

Then yesterday (Oct 1) Nathan uploaded a new binary with the fix from #41220 (and other fixes) which demonstrates that the anomaly disappears. Note however that these fixes are not present in today's release.

@dt about the clearrange tests:

I’m fine with clearrange failures for now — we’re investigating but AFAIK, it is just slow, not wrong.

(considering this as sign-off)

@dt about the restore2TB test:

the RESTORE one I took a very cursory look at and didn’t see what killed it, just exit 255
might have OOMed I guess — i wonder if it was the rocks logging leak? I donno. I’m fine signing off on that too I guess.

@irfansharif and @andreimatei do not have anything to say on the remaining roachtest failure.

So I did go and investigate the log files. I am not seeing errors in the CockroachDB logs themselves other than the expected "cannot connect to node" (the node being shut down under chaos).

@bdarnell says "go"

The version is tagged, the binaries are uploaded, the docker image works, so from engineering the release is ready to go.
The release notes PR is here: https://github.com/cockroachdb/docs/pull/5250

Handing this off to docs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bdarnell picture bdarnell  Â·  4Comments

nvanbenschoten picture nvanbenschoten  Â·  3Comments

richardanaya picture richardanaya  Â·  3Comments

danhhz picture danhhz  Â·  3Comments

HeikoOnnebrink picture HeikoOnnebrink  Â·  4Comments