Go: testing: summary and test output interleaved

Created on 13 Aug 2020 · 50Comments · Source: golang/go

What version of Go are you using (`go version`)?

>> go version
go version go1.14.6 linux/amd64

Does this issue reproduce with the latest release?

We haven't yet picked up 1.15, which cut like yesterday and it's an intermittent flake for us.

What operating system and processor architecture are you using (`go env`)?

go env Output

go version
go version go1.14.6 linux/amd64

What did you do?

We are running our e2e tests with t.Parallel() and -v thru gotestsum.

What did you expect to see?

gotestsum uses test2json to parse the build log output, and I expect to see passing test results for all of the tests run.

What did you see instead?

Action: pass events are missing for typically 2 of the tests.

We were originally chasing this here (has links to richer logs): https://github.com/gotestyourself/gotestsum/issues/141

To include some relevant context here, we see several test summaries that are interleaved with other output, which seems to trip up test2json. Here are three from a single run (3 diff API versions):

Normal (what you'd normally see):

--- PASS: TestContainerExitingMsg (0.01s)
    --- PASS: TestContainerExitingMsg/tcp (27.63s)
    --- PASS: TestContainerExitingMsg/http (160.37s)

Weird (interleaved, but surfaces events properly):

--- PASS: TestContainerExitingMsg (0.00s)
        +               Message:            "Waiting for Envoys to receive Endpoints data.",
    --- PASS: TestContainerExitingMsg/http (179.23s)
                    },
                },
                Annotations: nil,
    --- PASS: TestContainerExitingMsg/tcp (194.07s)

Broken (interleaved with another test context switch, breaks things):

--- PASS: TestContainerExitingMsg (0.00s)
=== CONT  TestBlueGreenRoute
    kubelogs.go:201: I 01:01:58.381 webhook-98759d99-pfkgh [conversion/conversion.go:133] [serving-tests/blue-green-route-atsgeusz-q56cd] Converting [kind=Revision group=serving.knative.dev version=v1] to version serving.knative.dev/v1beta1
    --- PASS: TestContainerExitingMsg/http (49.33s)
    --- PASS: TestContainerExitingMsg/tcp (165.25s)

If I change the ordering of lines in this last section, I get the appropriate set of events.

NeedsFix

Source

mattmoor

Most helpful comment

Alrighty, I think I have good news, but you should take this with the grain of salt that this thing is obvs non-deterministic 😅

So in our typical "contour-latest" leg, we see this manifesting typically a couple times a day (periodic run every ~2 hours).

Our test-infra folks added an experimental leg that's been running with this change, and clicking through each of legs to see their results in "spyglass" (we don't have it in "testgrid") I don't see any manifestations of this in ~31 runs of this so far.

So I'm optimistic that this fixes things 🤞 🎉

mattmoor on 8 Sep 2020

🎉2 👍1

All 50 comments

@bcmills @matloob @jayconrod

andybons on 13 Aug 2020

The test2json bug can be seen in these logs:

{"Time":"2020-08-11T01:03:41.641590332Z","Action":"output","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestContainerExitingMsg","Output":"--- PASS: TestContainerExitingMsg (0.00s)\n"}
{"Time":"2020-08-11T01:03:41.641597357Z","Action":"pass","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestContainerExitingMsg","Elapsed":0}
{"Time":"2020-08-11T01:03:41.641625254Z","Action":"output","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestBlueGreenRoute","Output":"    --- PASS: TestContainerExitingMsg/http (49.33s)\n"}
{"Time":"2020-08-11T01:03:41.641632549Z","Action":"output","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestBlueGreenRoute","Output":"    --- PASS: TestContainerExitingMsg/tcp (165.25s)\n"}

The --- PASS TestContainerExitingMsg output is attributed to TestBlueGreenRoute, and test2json does to create Action: pass events for those 2 subtests.

dnephin on 13 Aug 2020

I actually suspect that the testing package has races related to the === markers.

Here's another strange interleaving I see, which ultimately leaves certain log lines under the incorrect heading:

=== CONT  TestSecretVolume             <-- Nothing logged wtf?
=== CONT  TestRevisionTimeout
    kubelogs.go:201: I 01:00:41.328 autoscaler-86d4bd9bcb-hvnt8 [kpa-class-podautoscaler-controller] [serving-tests/revision-timeout-when-b67b736422814c3736c90c658f29bd58-zwcnw8nw] Sleeping additionally for 57.671904984s before can scale to 0
    # These log lines actually belong to the first empty heading!
    kubelogs.go:201: I 01:00:49.594 webhook-98759d99-pfkgh [conversion/conversion.go:133] [serving-tests/secret-volume-dsivzkqq] Converting [kind=Route group=serving.knative.dev version=v1] to version serving.knative.dev/v1alpha1
    kubelogs.go:201: I 01:00:52.333 contour-ingress-controller-5dc846564f-jtmpz [controller/controller.go:520] [serving-tests/projected-secret-volume-erpaxwuw] Reconcile succeeded. Time taken: 31.082µs

Is there a way to disable the streaming interleaved output? If so, then we can probably mitigate many of this issues by simply throwing that in our CI setup.

mattmoor on 13 Aug 2020

👍1

This is probably related to the fix for #39308, and thus also to #40657. (CC @jadekler)

bcmills on 13 Aug 2020

@mattmoor, the output in that last snippet does not look like a call to t.Logf. Was it written directly to os.Stdout or os.Stderr? (If so, the answer is “don't write to os.Stdout or os.Stderr in a parallel test if you want the output to associate correctly”.)

bcmills on 13 Aug 2020

Should all be t.Logf: https://github.com/knative/pkg/blob/c9f9284521f197a17fcdbaf5c4360bef8ef8570d/test/logstream/kubelogs.go#L222

@bcmills @dnephin Is there any way to mitigate this?

Ideally we'd want the -v behavior prior to streaming, but I'd also consider/propose dropping -v in CI if that's the only way until this is fixed.

mattmoor on 13 Aug 2020

You may be looking at this formatting, but that's going through the logf: https://github.com/knative/pkg/blob/c9f9284521f197a17fcdbaf5c4360bef8ef8570d/test/logstream/kubelogs.go#L185

mattmoor on 13 Aug 2020

Switching to gotestsum's equivalent of "no -v" doesn't help, I suspect that's because -json is purely output based? 😞

mattmoor on 14 Aug 2020

@mattmoor, are these t.Logf calls occurring after the test has already completed? (Is the --- PASS line actually chronologically misplaced, or is this more just a matter of missing === CONT headers?)

bcmills on 14 Aug 2020

No, we stop calling t.Logf for a particular test when the cancel function is called, which is registered via t.Cleanup

mattmoor on 14 Aug 2020

Could you please provide a link to an actual test that is failing? I can see how this library is wired to its exported API, but that doesn't really help me understand the sequencing or concurrency properties of the calls in the actual test.

bcmills on 15 Aug 2020

A lot of tests are failing with this. The ones pertinent to the logs in the linked issue are here: https://github.com/knative/serving/blob/d9d93bd2ce10140af538acaa787aec031c526401/test/conformance/api/v1/errorcondition_test.go#L160

The specific callsite I link above is what initializes the logstream here:
https://github.com/knative/serving/blob/d9d93bd2ce10140af538acaa787aec031c526401/test/conformance.go#L72-L73

mattmoor on 15 Aug 2020

@mattmoor, could you provide a complete test output demonstrating the problem? I'd like to see which messages (particularly === CONT and === PAUSE messages) precede the misordered ones.

Did this start as of go1.14.6, or has it been broken across all go1.14 releases?

bcmills on 18 Aug 2020

I believe the event problem was new in 1.14.6, but we had other issues related to the fix that landed in .6

There is a link in the top comment of https://github.com/gotestyourself/gotestsum/issues/141 with a link to download the full build log. This is interleaved with some other bash output, but when I was playing with test2json locally I'd just snipped the go test output out of one of these.

mattmoor on 18 Aug 2020

@mattmoor, one more question: do the tests in question pass if the -race flag is set? Trying to rule out an unlikely data race.

bcmills on 18 Aug 2020

Running go test with args: -race -count=1 -timeout=30m ./test/conformance/api/... ./test/conformance/runtime/... ./test/e2e --resolvabledomain=false --ingressClass=contour.ingress.networking.knative.dev -tags=e2e

Looks like this is with -race.

mattmoor on 18 Aug 2020

Thanks, that's helpful. I haven't been able to reproduce this with all logging occurring synchronously within the test, but I can get a very similar symptom by logging after the test has returned (which appears to bypass both the streaming output and synchronization):

https://play.golang.org/p/SnpZP6n0lvk

func TestWithDelayedOutput(t *testing.T) {
    t.Parallel()

    subtestsDone := make(chan struct{})
    var delayed sync.WaitGroup
    for i := 0; i < 3; i++ {
        t.Run(fmt.Sprint(i), func(t *testing.T) {
            delayed.Add(1)
            go func() {
                <-subtestsDone
                defer delayed.Done()
                t.Logf(t.Name())
            }()
        })
    }

    close(subtestsDone)
    delayed.Wait()
}

=== RUN   TestWithDelayedOutput
=== PAUSE TestWithDelayedOutput
=== CONT  TestWithDelayedOutput
=== RUN   TestWithDelayedOutput/0
=== RUN   TestWithDelayedOutput/1
=== RUN   TestWithDelayedOutput/2
--- PASS: TestWithDelayedOutput (0.00s)
    --- PASS: TestWithDelayedOutput/0 (0.00s)
    --- PASS: TestWithDelayedOutput/1 (0.00s)
    --- PASS: TestWithDelayedOutput/2 (0.00s)
    prog.go:22: TestWithDelayedOutput/0
    prog.go:22: TestWithDelayedOutput/1
    prog.go:22: TestWithDelayedOutput/2
PASS

I suspect that it won't be hard to extrapolate from that to a racy write.

bcmills on 18 Aug 2020

I am able to reproduce the interrupting === CONT line by wrapping this bytes.NewReader in an iotest.OneByteReader, but I still can't get it to reproduce without.

bcmills on 18 Aug 2020

@bcmills What if there is logging between the return of the test, but before the t.Cleanups have executed?

This is actually very possible today since the t.Cleanup is what cancels our logstream to the test's t.Log

mattmoor on 18 Aug 2020

@mattmoor, it appears that logging to a test after the test function but before the t.Cleanup is generally racy.

Logging to a subtest before the main test's Cleanup results in a data race:

https://play.golang.org/p/EaLLhYQejXa

==================
WARNING: DATA RACE
Read at 0x00c000104658 by goroutine 10:
  testing.(*common).frameSkip()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:479 +0x26e
  testing.(*common).decorate()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:524 +0x64
  testing.(*common).logDepth()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:734 +0x1a9
  testing.(*common).log()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:720 +0x8f
  testing.(*common).Logf()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:766 +0x21
  example%2ecom.TestWithDelayedOutput.func1.1()
      /tmp/tmp.BqaVZz8DDs/example.com/main_test.go:20 +0xe5

Previous write at 0x00c000104658 by goroutine 7:
  testing.(*common).Cleanup.func1()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:858 +0xba
  testing.(*common).runCleanup()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:936 +0xca
  testing.tRunner.func2()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1102 +0x74
  testing.tRunner()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1112 +0x22a

Goroutine 10 (running) created at:
  example%2ecom.TestWithDelayedOutput.func1()
      /tmp/tmp.BqaVZz8DDs/example.com/main_test.go:17 +0x92
  testing.tRunner()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202

Goroutine 7 (running) created at:
  testing.(*T).Run()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1159 +0x796
  testing.runTests.func1()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1430 +0xa6
  testing.tRunner()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202
  testing.runTests()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1428 +0x5aa
  testing.(*M).Run()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1338 +0x4eb
  main.main()
      _testmain.go:43 +0x236
==================
--- FAIL: TestWithDelayedOutput (0.00s)
    main_test.go:20: TestWithDelayedOutput/0
    main_test.go:20: TestWithDelayedOutput/2
    main_test.go:20: TestWithDelayedOutput/1
    testing.go:1023: race detected during execution of test
FAIL
exit status 1
FAIL    example.com 0.020s

Likewise for logging to the main test function itself:

https://play.golang.org/p/sGxwXx_D8KH

==================
WARNING: DATA RACE
Read at 0x00c000114358 by goroutine 9:
  testing.(*common).frameSkip()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:479 +0x26e
  testing.(*common).decorate()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:524 +0x64
  testing.(*common).logDepth()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:751 +0x5ab
  testing.(*common).log()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:720 +0x8f
  testing.(*common).Logf()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:766 +0x21
  example%2ecom.TestWithDelayedOutput.func1()
      /tmp/tmp.BqaVZz8DDs/example.com/main_test.go:14 +0x71

Previous write at 0x00c000114358 by goroutine 7:
  testing.(*common).Cleanup.func1()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:858 +0xba
  testing.(*common).runCleanup()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:936 +0xca
  testing.tRunner.func2()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1102 +0x74
  testing.tRunner()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1112 +0x22a

Goroutine 9 (running) created at:
  example%2ecom.TestWithDelayedOutput()
      /tmp/tmp.BqaVZz8DDs/example.com/main_test.go:12 +0x86
  testing.tRunner()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202

Goroutine 7 (running) created at:
  testing.(*T).Run()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1159 +0x796
  testing.runTests.func1()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1430 +0xa6
  testing.tRunner()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202
  testing.runTests()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1428 +0x5aa
  testing.(*M).Run()
      /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1338 +0x4eb
  main.main()
      _testmain.go:43 +0x236
==================
--- FAIL: TestWithDelayedOutput (0.00s)
    main_test.go:14: BRAAAAINS
    testing.go:1023: race detected during execution of test
FAIL
exit status 1
FAIL    example.com 0.022s

On the other hand, logging to a test before that test function's own Cleanup seems to be tolerated (https://play.golang.org/p/X514jUu6UPM). I'm not sure why that is. If you've got a use-case where it matters, I would suggest filing a separate issue about the safety of t.Log during a t.Cleanup function.

bcmills on 19 Aug 2020

Change https://golang.org/cl/249026 mentions this issue: testing: flush test summaries to stdout atomically when streaming output

gopherbot on 19 Aug 2020

@mattmoor, I still can't reproduce the reported behavior without modifying the code, but I think I understand where it's coming from. Could you try patching in https://golang.org/cl/249026 and see if that fixes the test output in your environment?

bcmills on 19 Aug 2020

@gopherbot, please backport to Go 1.14 and 1.15. This appears to be a regression in the 1.14.6 point release, and there is no apparent workaround for users who require JSON-formatted test output.

bcmills on 19 Aug 2020

Backport issue(s) opened: #40880 (for 1.14), #40881 (for 1.15).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

gopherbot on 19 Aug 2020

It's less about logging in t.Cleanup and more about using t.Cleanup to tell the thing being cancelled to stop.

I believe we also have cases where if the cleanup fails, we might t.Error, which I assume it similarly problematic?

mattmoor on 19 Aug 2020

I don't know how hard it would be to patch a change into the Go used in our CI environment, but perhaps @chaodaig or @chizhg know

mattmoor on 19 Aug 2020

I filed the Cleanup issue separately as #40908.

bcmills on 19 Aug 2020

FWIW, it appears that the Cleanup issue is specific to Go 1.15. (It does not affect 1.14.*, and will ideally be fixed in 1.15.1.)

bcmills on 25 Aug 2020

Awesome. Is there a general schedule for patch releases? When should I look to pick this up?

mattmoor on 25 Aug 2020

Minor releases are periodic, although I'm not sure at what interval. (Certainly every couple of months.)

Note that the more general fix for the interleaved output is still out for review (and still lacks a regression test because I couldn't actually reproduce the symptoms you described). It would be really helpful to have some data on whether that pending change fixes your use-case. 🙂

bcmills on 25 Aug 2020

Change https://golang.org/cl/252637 mentions this issue: [release-branch.go1.15] testing: flush test summaries to stdout atomically when streaming output

gopherbot on 2 Sep 2020

Any chance this landed in the 1.15.1 that I just saw was cut?

mattmoor on 2 Sep 2020

@mattmoor, this should be fixed at head, but since I couldn't reproduce the failure without injecting a OneByteReader I'm a little uncomfortable backporting to 1.14.9 or 1.15.2 without further testing. Any chance you could try a build from source (https://golang.org/doc/install/source#head) or use golang.org/dl/gotip to confirm that it fixes the symptoms you observed?

bcmills on 2 Sep 2020

👍1

Change https://golang.org/cl/252638 mentions this issue: [release-branch.go1.14] testing: flush test summaries to stdout atomically when streaming output

gopherbot on 2 Sep 2020

I'll talk to the folks that operator the Knative CI infrastructure. This hits about once a day in each of our post-submit test legs, so if we could convert even just one of those, it'd be a pretty strong signal IMO.

mattmoor on 2 Sep 2020

👍2

If you want to minimize other variables, you could also try building from source on release-branch.go1.14 with CL 252638 manually patched in, and/or release-branch.go1.15 with CL 252637 manually patched in.

bcmills on 2 Sep 2020

👍1

Alrighty, I think I have good news, but you should take this with the grain of salt that this thing is obvs non-deterministic 😅

So in our typical "contour-latest" leg, we see this manifesting typically a couple times a day (periodic run every ~2 hours).

So I'm optimistic that this fixes things 🤞 🎉

mattmoor on 8 Sep 2020

🎉2 👍1

This is still biting us pretty pervasively across our various e2e tests. When do we think we can expect this in a patch release on 1.14 or 1.15?

mattmoor on 17 Sep 2020

@mattmoor, if we don't get any reports of regressions at head, then I would guess that this can make 1.15.3.

(It would help to have more data supporting that it fixes your issue — you mentioned ~31 runs earlier; have you had any occurrences of the bug using the patched version since then?)

bcmills on 17 Sep 2020

I will try to dig the link to those runs back up and see if they are still running, but I haven't looked since posting above.

mattmoor on 17 Sep 2020

Something had broken the test image, but they kicked it today and it's running again. Every build since has been clean, and I'll keep an eye on it over the weekend.

For my own sanity: https://prow.knative.dev/?job=ci-knative-serving-contour-latest-beta-prow-tests

mattmoor on 19 Sep 2020

So far another 16 runs without this issue manifesting 🤞

mattmoor on 19 Sep 2020

👍1

Up to 29 runs without this issue. As a parallel datapoint, in our unmodified leg over roughly the same period 6 of the last 35 runs have failed with this.

mattmoor on 20 Sep 2020

👍1

Up to 44 runs without this issue manifesting. How much longer should we do this? I think we'll soon start to fall off of our run history and start to fall victim to my horrible mental bookmarking / checkpointing in these numbers.

Future self: Green run number 44 was 9/21 at 15:43:56.

mattmoor on 22 Sep 2020

That seems like pretty solid evidence to me. I'll talk with the folks on the release team and see if we can get this into the next minor release.

Thanks for following up!

bcmills on 22 Sep 2020

👍1

Hi.

It seems that this change to only flush on completion essentially prevents you from watching tests in the terminal. This is not too much an issue for CI/CD, but it's unsettling when running tests manually. Is there any workaround to still be able to see some progress while tests run, and not just a wall of text after 30min? Am I missing something big here?

calexicoz on 16 Nov 2020

👍1

@calexicoz, the change for this issue was not to “to only flush on completion”. Flushing on completion is the default behavior for Go tests.

The major change was to _add_ support for streaming output while the test is running, but only if the -v flag is set (#24929). This issue was a fix to the synchronization in the implementation of that feature.

bcmills on 16 Nov 2020

Mmm. Something changed recently though. I am running my tests as:

go test -v -tags integration_testing ./pkg/test/integration/...

And it definitely does not stream. It's a long test suite, so it runs for 20min or so then outputs everything. I'm on go version go1.15.5 darwin/amd64. Should I open a new issue? It was working until yesterday morning when I updated Homebrew and got the newest Go version. Moving the -v flag around to different locations on the command line does not work either.

calexicoz on 17 Nov 2020

@calexicoz, yes, please open a new issue (ideally with steps to reproduce it).

(Is it possible that all of the logging in your test is going to a *testing.T for which the corresponding Test function or Run callback has already returned?)

bcmills on 17 Nov 2020

(Is it possible that all of the logging in your test is going to a *testing.T for which the corresponding Test function or Run callback has already returned?)

I don't have any logging in the test proper. All I expect is simply to see the RUN and PASS statements as it goes. It used to do that and does not anymore. I'll see if I can reproduce it in a test repo and then I'll open a new issue.

calexicoz on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings