go version)?>> go version go version go1.14.6 linux/amd64
We haven't yet picked up 1.15, which cut like yesterday and it's an intermittent flake for us.
go env)?go env Output
go version
go version go1.14.6 linux/amd64
We are running our e2e tests with t.Parallel() and -v thru gotestsum.
gotestsum uses test2json to parse the build log output, and I expect to see passing test results for all of the tests run.
Action: pass events are missing for typically 2 of the tests.
We were originally chasing this here (has links to richer logs): https://github.com/gotestyourself/gotestsum/issues/141
To include some relevant context here, we see several test summaries that are interleaved with other output, which seems to trip up test2json. Here are three from a single run (3 diff API versions):
Normal (what you'd normally see):
--- PASS: TestContainerExitingMsg (0.01s)
--- PASS: TestContainerExitingMsg/tcp (27.63s)
--- PASS: TestContainerExitingMsg/http (160.37s)
Weird (interleaved, but surfaces events properly):
--- PASS: TestContainerExitingMsg (0.00s)
+ Message: "Waiting for Envoys to receive Endpoints data.",
--- PASS: TestContainerExitingMsg/http (179.23s)
},
},
Annotations: nil,
--- PASS: TestContainerExitingMsg/tcp (194.07s)
Broken (interleaved with another test context switch, breaks things):
--- PASS: TestContainerExitingMsg (0.00s)
=== CONT TestBlueGreenRoute
kubelogs.go:201: I 01:01:58.381 webhook-98759d99-pfkgh [conversion/conversion.go:133] [serving-tests/blue-green-route-atsgeusz-q56cd] Converting [kind=Revision group=serving.knative.dev version=v1] to version serving.knative.dev/v1beta1
--- PASS: TestContainerExitingMsg/http (49.33s)
--- PASS: TestContainerExitingMsg/tcp (165.25s)
If I change the ordering of lines in this last section, I get the appropriate set of events.
@bcmills @matloob @jayconrod
The test2json bug can be seen in these logs:
{"Time":"2020-08-11T01:03:41.641590332Z","Action":"output","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestContainerExitingMsg","Output":"--- PASS: TestContainerExitingMsg (0.00s)\n"}
{"Time":"2020-08-11T01:03:41.641597357Z","Action":"pass","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestContainerExitingMsg","Elapsed":0}
{"Time":"2020-08-11T01:03:41.641625254Z","Action":"output","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestBlueGreenRoute","Output":" --- PASS: TestContainerExitingMsg/http (49.33s)\n"}
{"Time":"2020-08-11T01:03:41.641632549Z","Action":"output","Package":"knative.dev/serving/test/conformance/api/v1","Test":"TestBlueGreenRoute","Output":" --- PASS: TestContainerExitingMsg/tcp (165.25s)\n"}
The --- PASS TestContainerExitingMsg output is attributed to TestBlueGreenRoute, and test2json does to create Action: pass events for those 2 subtests.
I actually suspect that the testing package has races related to the === markers.
Here's another strange interleaving I see, which ultimately leaves certain log lines under the incorrect heading:
=== CONT TestSecretVolume <-- Nothing logged wtf?
=== CONT TestRevisionTimeout
kubelogs.go:201: I 01:00:41.328 autoscaler-86d4bd9bcb-hvnt8 [kpa-class-podautoscaler-controller] [serving-tests/revision-timeout-when-b67b736422814c3736c90c658f29bd58-zwcnw8nw] Sleeping additionally for 57.671904984s before can scale to 0
# These log lines actually belong to the first empty heading!
kubelogs.go:201: I 01:00:49.594 webhook-98759d99-pfkgh [conversion/conversion.go:133] [serving-tests/secret-volume-dsivzkqq] Converting [kind=Route group=serving.knative.dev version=v1] to version serving.knative.dev/v1alpha1
kubelogs.go:201: I 01:00:52.333 contour-ingress-controller-5dc846564f-jtmpz [controller/controller.go:520] [serving-tests/projected-secret-volume-erpaxwuw] Reconcile succeeded. Time taken: 31.082µs
Is there a way to disable the streaming interleaved output? If so, then we can probably mitigate many of this issues by simply throwing that in our CI setup.
This is probably related to the fix for #39308, and thus also to #40657. (CC @jadekler)
@mattmoor, the output in that last snippet does not look like a call to t.Logf. Was it written directly to os.Stdout or os.Stderr? (If so, the answer is “don't write to os.Stdout or os.Stderr in a parallel test if you want the output to associate correctly”.)
Should all be t.Logf: https://github.com/knative/pkg/blob/c9f9284521f197a17fcdbaf5c4360bef8ef8570d/test/logstream/kubelogs.go#L222
@bcmills @dnephin Is there any way to mitigate this?
Ideally we'd want the -v behavior prior to streaming, but I'd also consider/propose dropping -v in CI if that's the only way until this is fixed.
You may be looking at this formatting, but that's going through the logf: https://github.com/knative/pkg/blob/c9f9284521f197a17fcdbaf5c4360bef8ef8570d/test/logstream/kubelogs.go#L185
Switching to gotestsum's equivalent of "no -v" doesn't help, I suspect that's because -json is purely output based? 😞
@mattmoor, are these t.Logf calls occurring after the test has already completed? (Is the --- PASS line actually chronologically misplaced, or is this more just a matter of missing === CONT headers?)
No, we stop calling t.Logf for a particular test when the cancel function is called, which is registered via t.Cleanup
Could you please provide a link to an actual test that is failing? I can see how this library is wired to its exported API, but that doesn't really help me understand the sequencing or concurrency properties of the calls in the actual test.
A lot of tests are failing with this. The ones pertinent to the logs in the linked issue are here: https://github.com/knative/serving/blob/d9d93bd2ce10140af538acaa787aec031c526401/test/conformance/api/v1/errorcondition_test.go#L160
The specific callsite I link above is what initializes the logstream here:
https://github.com/knative/serving/blob/d9d93bd2ce10140af538acaa787aec031c526401/test/conformance.go#L72-L73
@mattmoor, could you provide a complete test output demonstrating the problem? I'd like to see which messages (particularly === CONT and === PAUSE messages) precede the misordered ones.
Did this start as of go1.14.6, or has it been broken across all go1.14 releases?
I believe the event problem was new in 1.14.6, but we had other issues related to the fix that landed in .6
There is a link in the top comment of https://github.com/gotestyourself/gotestsum/issues/141 with a link to download the full build log. This is interleaved with some other bash output, but when I was playing with test2json locally I'd just snipped the go test output out of one of these.
@mattmoor, one more question: do the tests in question pass if the -race flag is set? Trying to rule out an unlikely data race.
Running go test with args: -race -count=1 -timeout=30m ./test/conformance/api/... ./test/conformance/runtime/... ./test/e2e --resolvabledomain=false --ingressClass=contour.ingress.networking.knative.dev -tags=e2e
Looks like this is with -race.
Thanks, that's helpful. I haven't been able to reproduce this with all logging occurring synchronously within the test, but I can get a very similar symptom by logging after the test has returned (which appears to bypass both the streaming output and synchronization):
https://play.golang.org/p/SnpZP6n0lvk
func TestWithDelayedOutput(t *testing.T) {
t.Parallel()
subtestsDone := make(chan struct{})
var delayed sync.WaitGroup
for i := 0; i < 3; i++ {
t.Run(fmt.Sprint(i), func(t *testing.T) {
delayed.Add(1)
go func() {
<-subtestsDone
defer delayed.Done()
t.Logf(t.Name())
}()
})
}
close(subtestsDone)
delayed.Wait()
}
=== RUN TestWithDelayedOutput
=== PAUSE TestWithDelayedOutput
=== CONT TestWithDelayedOutput
=== RUN TestWithDelayedOutput/0
=== RUN TestWithDelayedOutput/1
=== RUN TestWithDelayedOutput/2
--- PASS: TestWithDelayedOutput (0.00s)
--- PASS: TestWithDelayedOutput/0 (0.00s)
--- PASS: TestWithDelayedOutput/1 (0.00s)
--- PASS: TestWithDelayedOutput/2 (0.00s)
prog.go:22: TestWithDelayedOutput/0
prog.go:22: TestWithDelayedOutput/1
prog.go:22: TestWithDelayedOutput/2
PASS
I suspect that it won't be hard to extrapolate from that to a racy write.
I am able to reproduce the interrupting === CONT line by wrapping this bytes.NewReader in an iotest.OneByteReader, but I still can't get it to reproduce without.
@bcmills What if there is logging between the return of the test, but before the t.Cleanups have executed?
This is actually very possible today since the t.Cleanup is what cancels our logstream to the test's t.Log
@mattmoor, it appears that logging to a test after the test function but before the t.Cleanup is generally racy.
Logging to a subtest before the main test's Cleanup results in a data race:
https://play.golang.org/p/EaLLhYQejXa
==================
WARNING: DATA RACE
Read at 0x00c000104658 by goroutine 10:
testing.(*common).frameSkip()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:479 +0x26e
testing.(*common).decorate()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:524 +0x64
testing.(*common).logDepth()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:734 +0x1a9
testing.(*common).log()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:720 +0x8f
testing.(*common).Logf()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:766 +0x21
example%2ecom.TestWithDelayedOutput.func1.1()
/tmp/tmp.BqaVZz8DDs/example.com/main_test.go:20 +0xe5
Previous write at 0x00c000104658 by goroutine 7:
testing.(*common).Cleanup.func1()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:858 +0xba
testing.(*common).runCleanup()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:936 +0xca
testing.tRunner.func2()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1102 +0x74
testing.tRunner()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1112 +0x22a
Goroutine 10 (running) created at:
example%2ecom.TestWithDelayedOutput.func1()
/tmp/tmp.BqaVZz8DDs/example.com/main_test.go:17 +0x92
testing.tRunner()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202
Goroutine 7 (running) created at:
testing.(*T).Run()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1159 +0x796
testing.runTests.func1()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1430 +0xa6
testing.tRunner()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202
testing.runTests()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1428 +0x5aa
testing.(*M).Run()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1338 +0x4eb
main.main()
_testmain.go:43 +0x236
==================
--- FAIL: TestWithDelayedOutput (0.00s)
main_test.go:20: TestWithDelayedOutput/0
main_test.go:20: TestWithDelayedOutput/2
main_test.go:20: TestWithDelayedOutput/1
testing.go:1023: race detected during execution of test
FAIL
exit status 1
FAIL example.com 0.020s
Likewise for logging to the main test function itself:
https://play.golang.org/p/sGxwXx_D8KH
==================
WARNING: DATA RACE
Read at 0x00c000114358 by goroutine 9:
testing.(*common).frameSkip()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:479 +0x26e
testing.(*common).decorate()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:524 +0x64
testing.(*common).logDepth()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:751 +0x5ab
testing.(*common).log()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:720 +0x8f
testing.(*common).Logf()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:766 +0x21
example%2ecom.TestWithDelayedOutput.func1()
/tmp/tmp.BqaVZz8DDs/example.com/main_test.go:14 +0x71
Previous write at 0x00c000114358 by goroutine 7:
testing.(*common).Cleanup.func1()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:858 +0xba
testing.(*common).runCleanup()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:936 +0xca
testing.tRunner.func2()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1102 +0x74
testing.tRunner()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1112 +0x22a
Goroutine 9 (running) created at:
example%2ecom.TestWithDelayedOutput()
/tmp/tmp.BqaVZz8DDs/example.com/main_test.go:12 +0x86
testing.tRunner()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202
Goroutine 7 (running) created at:
testing.(*T).Run()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1159 +0x796
testing.runTests.func1()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1430 +0xa6
testing.tRunner()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1108 +0x202
testing.runTests()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1428 +0x5aa
testing.(*M).Run()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1338 +0x4eb
main.main()
_testmain.go:43 +0x236
==================
--- FAIL: TestWithDelayedOutput (0.00s)
main_test.go:14: BRAAAAINS
testing.go:1023: race detected during execution of test
FAIL
exit status 1
FAIL example.com 0.022s
On the other hand, logging to a test before that test function's own Cleanup seems to be tolerated (https://play.golang.org/p/X514jUu6UPM). I'm not sure why that is. If you've got a use-case where it matters, I would suggest filing a separate issue about the safety of t.Log during a t.Cleanup function.
Change https://golang.org/cl/249026 mentions this issue: testing: flush test summaries to stdout atomically when streaming output
@mattmoor, I still can't reproduce the reported behavior without modifying the code, but I think I understand where it's coming from. Could you try patching in https://golang.org/cl/249026 and see if that fixes the test output in your environment?
@gopherbot, please backport to Go 1.14 and 1.15. This appears to be a regression in the 1.14.6 point release, and there is no apparent workaround for users who require JSON-formatted test output.
Backport issue(s) opened: #40880 (for 1.14), #40881 (for 1.15).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.
It's less about logging in t.Cleanup and more about using t.Cleanup to tell the thing being cancelled to stop.
I believe we also have cases where if the cleanup fails, we might t.Error, which I assume it similarly problematic?
I don't know how hard it would be to patch a change into the Go used in our CI environment, but perhaps @chaodaig or @chizhg know
I filed the Cleanup issue separately as #40908.
FWIW, it appears that the Cleanup issue is specific to Go 1.15. (It does not affect 1.14.*, and will ideally be fixed in 1.15.1.)
Awesome. Is there a general schedule for patch releases? When should I look to pick this up?
Minor releases are periodic, although I'm not sure at what interval. (Certainly every couple of months.)
Note that the more general fix for the interleaved output is still out for review (and still lacks a regression test because I couldn't actually reproduce the symptoms you described). It would be really helpful to have some data on whether that pending change fixes your use-case. 🙂
Change https://golang.org/cl/252637 mentions this issue: [release-branch.go1.15] testing: flush test summaries to stdout atomically when streaming output
Any chance this landed in the 1.15.1 that I just saw was cut?
@mattmoor, this should be fixed at head, but since I couldn't reproduce the failure without injecting a OneByteReader I'm a little uncomfortable backporting to 1.14.9 or 1.15.2 without further testing. Any chance you could try a build from source (https://golang.org/doc/install/source#head) or use golang.org/dl/gotip to confirm that it fixes the symptoms you observed?
Change https://golang.org/cl/252638 mentions this issue: [release-branch.go1.14] testing: flush test summaries to stdout atomically when streaming output
I'll talk to the folks that operator the Knative CI infrastructure. This hits about once a day in each of our post-submit test legs, so if we could convert even just one of those, it'd be a pretty strong signal IMO.
Alrighty, I think I have good news, but you should take this with the grain of salt that this thing is obvs non-deterministic 😅
So in our typical "contour-latest" leg, we see this manifesting typically a couple times a day (periodic run every ~2 hours).

Our test-infra folks added an experimental leg that's been running with this change, and clicking through each of legs to see their results in "spyglass" (we don't have it in "testgrid") I don't see any manifestations of this in ~31 runs of this so far.
So I'm optimistic that this fixes things 🤞 🎉
This is still biting us pretty pervasively across our various e2e tests. When do we think we can expect this in a patch release on 1.14 or 1.15?
@mattmoor, if we don't get any reports of regressions at head, then I would guess that this can make 1.15.3.
(It would help to have more data supporting that it fixes your issue — you mentioned ~31 runs earlier; have you had any occurrences of the bug using the patched version since then?)
I will try to dig the link to those runs back up and see if they are still running, but I haven't looked since posting above.
Something had broken the test image, but they kicked it today and it's running again. Every build since has been clean, and I'll keep an eye on it over the weekend.
For my own sanity: https://prow.knative.dev/?job=ci-knative-serving-contour-latest-beta-prow-tests
So far another 16 runs without this issue manifesting 🤞
Up to 29 runs without this issue. As a parallel datapoint, in our unmodified leg over roughly the same period 6 of the last 35 runs have failed with this.
Up to 44 runs without this issue manifesting. How much longer should we do this? I think we'll soon start to fall off of our run history and start to fall victim to my horrible mental bookmarking / checkpointing in these numbers.
Future self: Green run number 44 was 9/21 at 15:43:56.
That seems like pretty solid evidence to me. I'll talk with the folks on the release team and see if we can get this into the next minor release.
Thanks for following up!
Hi.
It seems that this change to only flush on completion essentially prevents you from watching tests in the terminal. This is not too much an issue for CI/CD, but it's unsettling when running tests manually. Is there any workaround to still be able to see some progress while tests run, and not just a wall of text after 30min? Am I missing something big here?
@calexicoz, the change for this issue was not to “to only flush on completion”. Flushing on completion is the default behavior for Go tests.
The major change was to _add_ support for streaming output while the test is running, but only if the -v flag is set (#24929). This issue was a fix to the synchronization in the implementation of that feature.
Mmm. Something changed recently though. I am running my tests as:
go test -v -tags integration_testing ./pkg/test/integration/...
And it definitely does not stream. It's a long test suite, so it runs for 20min or so then outputs everything. I'm on go version go1.15.5 darwin/amd64. Should I open a new issue? It was working until yesterday morning when I updated Homebrew and got the newest Go version. Moving the -v flag around to different locations on the command line does not work either.
@calexicoz, yes, please open a new issue (ideally with steps to reproduce it).
(Is it possible that all of the logging in your test is going to a *testing.T for which the corresponding Test function or Run callback has already returned?)
(Is it possible that all of the logging in your test is going to a
*testing.Tfor which the correspondingTestfunction orRuncallback has already returned?)
I don't have any logging in the test proper. All I expect is simply to see the RUN and PASS statements as it goes. It used to do that and does not anymore. I'll see if I can reproduce it in a test repo and then I'll open a new issue.
Most helpful comment
Alrighty, I think I have good news, but you should take this with the grain of salt that this thing is obvs non-deterministic 😅
So in our typical "contour-latest" leg, we see this manifesting typically a couple times a day (periodic run every ~2 hours).
Our test-infra folks added an experimental leg that's been running with this change, and clicking through each of legs to see their results in "spyglass" (we don't have it in "testgrid") I don't see any manifestations of this in ~31 runs of this so far.
So I'm optimistic that this fixes things 🤞 🎉