Go: x/build/cmd/release{,bot}: include long tests in pre-release testing

Created on 14 Dec 2018  ·  24Comments  ·  Source: golang/go

I tested CL 154101 and the subsequent security patches using a combination of go test -run TestScript/[…] and all.bash. Unfortunately, significant parts of go get (including path-to-repository-resolution) are only exercised in non-short tests, and all.bash by default only runs the short tests, despite the name. (I remember that latter point occasionally — but apparently not frequently enough.)

Even more unfortunately, releasebot suggests all.bash for security releases as well, and release runs the same all.bash commands as the regular builders.

As a result, a significant regression (#29241) made it all the way through development, code review, and release building without running the existing tests that should have caught it.

We should ensure that the commands release executes and the instructions releasebot prints for both kinds of releases include the non-short tests on at least one platform.

(CC @bradfitz @dmitshur @FiloSottile)

Builders NeedsInvestigation release-blocker

Most helpful comment

CL 234531 adds automatic long tests to the release process. Once submitted, releasebot -mode=prepare will run tests on the linux-amd64-longtest and windows-amd64-longtest builders, and it will not silently proceed if there is a regression caught by a long test.

While it increases safety, we want to be careful not to introduce a situation where a release cannot be made because long tests are not passing on the day of the release. Also, long tests are more likely to be flaky, so there can be issues with false positives.

As part of submitting that CL, I plan to make take these steps:

  1. [x] Make progress on #37827 to reduce the chance of a CL being submitted to a release branch that causes a long test to fail.
  2. [x] Add a flag to releasebot to skip long tests. It is meant to be used in case a long test proves to be flaky, and enough confidence can be gained through testing elsewhere that the failure is not a regression caused by a change merged to the release branch.
  3. [x] Improve releasebot -mode=prepare -dry-run to make it possible to use it to run tests in advance of a release (works for beta releases now, other release types in the future, see #40214).

There is also room for improvement for us to consolidate the mechanism used to execute long tests between post-submit builders and the release command. This is something we will consider in long term improvements to release automation.

This issue is currently marked as a release blocker for 1.15. I don't believe it needs to block the 1.15 release if it's not fully resolved in time, and I believe we've made good progress on it this cycle, so I plan to unmark it release blocker if isn't ready in time. This will let us spend more time on other issues that do block the release and cannot be worked around with manual work in the release process.

All 24 comments

Additionally: normally we also look at build.golang.org for a happy row of "ok" before doing a release, but because this was a security release without any of our normal infrastructure, we didn't have build.golang.org and didn't have the existing linux-amd64-longtest builder.

The longtest builder is highlighted here and shows the breakage, which we would've been much more likely to see using the normal infrastructure:

screen shot 2018-12-14 at 6 03 12 am

Yep, there were a lot of factors at play. This one seems like a relatively easy win, though: we don't cut all that many releases, so “run _all_ of the tests one last time to be sure” seems like a reasonable step in the release process.

CC @toothrot

Marking release-blocker for 1.14. We really ought to be running all of the tests before we cut a release.

@bcmills Would it be sufficient for release to query the dashboard/coordinator for longtest status? On one hand, I would love to run longtest at build time, but on the other hand I am hesitant to increase the duration of our release process significantly.

I also vote we just query the dashboard to gate releases. cmd/release is extra slow in all.bash mode as-is (without adding long tests) because it doesn't shard test execution over N machines. Adding long tests just makes a slow situation even worse.

Now that the build system has a scheduler, we can even tweak the scheduler to make sure that of release-branch HEADs are highest priority. (It might already be doing close to that as-is, actually)

We could even go a step further and have cmd/release not even run make.bash and instead just pick up the artifacts from the previous build (which are already known to be good artifacts if all the tests passed). But that's for another day. (Even further: run cmd/release on every release and then releasebot just downloads them)

Querying the dashboard seems fine for regular releases, as long as we're checking the result for the actual commit that we're about to release.

I think that still leaves a testing gap for security releases, though.

FWIW, it looks like we released 1.12.14 and 1.13.5 with failing longtest builds again. 😞

https://golang.org/cl/205438 and https://golang.org/cl/205439 need to be reviewed and merged before the next point releases.

@bcmills As of today, this is still a manual step in our process. I've noticed another possible brittle test appear in the longtest failures for 1.12, and we'll look into addressing that.

We'll still do the effort to have our release automation query the branch status before tagging a release.

The build dashboards for 1.13 and 1.12 have some ports that are consistently failing. It seems like we should reconsider the validity of some ports based on their builder status.

Change https://golang.org/cl/214433 mentions this issue: dashboard: upsize freebsd-amd64-race builder to 7.2 GB RAM

Assigned the issue to myself since I will be tracking progress.

Moving this to the next major milestone. The long tests have been run manually.

Change https://golang.org/cl/227859 mentions this issue: dashboard: upsize freebsd-amd64-race builder to 16 vCPUs, 14.4 GB mem

Change https://golang.org/cl/227859 mentions this issue: dashboard: upsize freebsd-amd64-race builder to 16 vCPUs, 14.4 GB RAM

Change https://golang.org/cl/233898 mentions this issue: cmd/dist: use GO_TEST_SHORT value more consistently

Change https://golang.org/cl/233901 mentions this issue: dashboard: use more consistent definition of longtest builder

I've been looking into this issue the last few days. We've made progress in various areas that have made it more viable to resolve it now. For instance, work was done to apply fixes to release-branch.go1.13, and the windows-amd64-longtest builder is passing on it now. The addition of SlowBots made it possible to request longtest builders as part of pre-submit trybot testing.

I explored options such as: querying test results from the dashboard, using SlowBots, modifying the tests to make them long instead of short.

The first two don't work for security releases, and the third proved tricky due to #39054.

I ended up iterating to a fourth option, which is an adjustment to the third. The idea is to re-use longtest builders for the purpose running tests only. This has favorable properties including working for security releases, and being closer to what build.golang.org reports. It's a purely additive change: it keeps all the existing (short) tests we have unmodified. Finally, it doesn't increase time to prepare a release significantly, as the longtest builders use more powerful VMs, and will run in parallel with the other builds. This approach makes it easier to also add race builders in future changes.

During this investigation, I noticed it's important to try to reuse as much existing infrastructure and configuration needed to run long tests as possible, so that seeing passing results in post-submit builders at build.golang.org translates to the same outcome during a release. This is especially important now, since we don't have nightly releases, so any deviation from post-submit builders means it'll be caught on release day rather than in advance. This is something we're planning to improve on in future release automation work (/cc @toothrot).

Change https://golang.org/cl/234531 mentions this issue: cmd/releasebot, cmd/release: include long tests in release process

CL 234531 adds automatic long tests to the release process. Once submitted, releasebot -mode=prepare will run tests on the linux-amd64-longtest and windows-amd64-longtest builders, and it will not silently proceed if there is a regression caught by a long test.

While it increases safety, we want to be careful not to introduce a situation where a release cannot be made because long tests are not passing on the day of the release. Also, long tests are more likely to be flaky, so there can be issues with false positives.

As part of submitting that CL, I plan to make take these steps:

  1. [x] Make progress on #37827 to reduce the chance of a CL being submitted to a release branch that causes a long test to fail.
  2. [x] Add a flag to releasebot to skip long tests. It is meant to be used in case a long test proves to be flaky, and enough confidence can be gained through testing elsewhere that the failure is not a regression caused by a change merged to the release branch.
  3. [x] Improve releasebot -mode=prepare -dry-run to make it possible to use it to run tests in advance of a release (works for beta releases now, other release types in the future, see #40214).

There is also room for improvement for us to consolidate the mechanism used to execute long tests between post-submit builders and the release command. This is something we will consider in long term improvements to release automation.

This issue is currently marked as a release blocker for 1.15. I don't believe it needs to block the 1.15 release if it's not fully resolved in time, and I believe we've made good progress on it this cycle, so I plan to unmark it release blocker if isn't ready in time. This will let us spend more time on other issues that do block the release and cannot be worked around with manual work in the release process.

Change https://golang.org/cl/235338 mentions this issue: dashboard: include longtest builders for trybots on release branches

Change https://golang.org/cl/237602 mentions this issue: [release-branch.go1.13] net: add more timing slop for TestDialParallel on Windows

Change https://golang.org/cl/238539 mentions this issue: cmd/releasebot: opt into windows-amd64-longtest target by default

We've made good progress on incorporating automatic long-test testing into the release process by now, and have a better sense of the remaining work to be able to automate this further. We have a manual process in place for filling in the gap until this issue is resolved fully.

This is sufficient progress for the 1.15 timeframe. Moving to 1.16 as this doesn't need to block the Go 1.15 release.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

michaelsafyan picture michaelsafyan  ·  3Comments

Miserlou picture Miserlou  ·  3Comments

natefinch picture natefinch  ·  3Comments

OneOfOne picture OneOfOne  ·  3Comments

myitcv picture myitcv  ·  3Comments