Nightly build unit test passes
Nightly build unit test failed:
TaskRun is: pipeline-release-nightly-gkd24-unit-tests-l2pdb
Failure:
--- FAIL: TestReconcile_ExplicitDefaultSA/success (0.08s)
Tekton Pipeline version: v0.13.2
I tried to reproduce this locally and failed:
go test -race -cover -count=10 -run="TestReconcile_ExplicitDefaultSA" ./pkg/reconciler/taskrun
@dibyom any logs below the FAIL: ? (to see where did it fail)
/area testing
Nothing unfortunately:
[unit-test] --- FAIL: TestReconcile_ExplicitDefaultSA (0.14s)
[unit-test] --- FAIL: TestReconcile_ExplicitDefaultSA/success (0.08s)
[unit-test] FAIL
[unit-test] FAIL github.com/tektoncd/pipeline/pkg/reconciler/taskrun 22.590s
[unit-test] FAIL
The command I used is (you have to be connected to the dogfooding cluster) :
tkn tr logs pipeline-release-nightly-gkd24-unit-tests-l2pdb
This looks like it's passing now: https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-snb9k
Should this be closed/dropped in priority? It's marked critical-urgent still
I think the point was that this is non-deterministic - let's leave it open until we feel we understand why the race occurred.
This test has failed during nightlies on two of my build cop rotations in a row. I'll assign this to myself and see if I can figure out what's triggering the random behaviour.
Error capture from latest nightly: https://gist.github.com/sbwsg/ce24e70e389e205dcb9d0890afc4193c
* failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
I'm not able to reproduce this. Tried with various combinations of -cpu, -race and -count with no failures. I'm going to unassign myself because I'm not sure how to make progress.
fyi this test failed again last night: https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-h75d5
Closing for now. We've merged a couple fixes towards this so should reopen or start a new one if we see the problem surface again.
Nice!!! Thanks @sbwsg and @jerop :D
Looks like this failed last night (https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-qnsg6?pipelineTask=unit-tests&step=unit-test):
[unit-test] 2020-10-09T02:04:12.139Z ERROR TestReconcile_ExplicitDefaultSA/success taskrun/reconciler.go:294 Returned an error {"targetMethod": "ReconcileKind", "targetMethod": "ReconcileKind", "error": "1 error occurred:\n\t* failed to create task run pod \"test-taskrun-run-success\": translating TaskSpec to Pod: serviceaccounts \"default\" not found. Maybe missing or invalid Task foo/test-task\n\n"}
[unit-test] github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile
[unit-test] /workspace/src/github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294
[unit-test] github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestReconcile_ExplicitDefaultSA.func1
[unit-test] /workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:532
[unit-test] testing.tRunner
[unit-test] /usr/local/go/src/testing/testing.go:1127
[unit-test] taskrun_test.go:533: expected no error. Got error 1 error occurred:
[unit-test] * failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
...
[unit-test] --- FAIL: TestReconcile_ExplicitDefaultSA (0.17s)
[unit-test] --- FAIL: TestReconcile_ExplicitDefaultSA/success (0.03s)
[unit-test] --- PASS: TestReconcile_ExplicitDefaultSA/serviceaccount (0.14s)
Failed again last night (https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-wpbpz?pipelineTask=unit-tests&step=unit-test):
=== RUN TestReconcile_ExplicitDefaultSA/success
2020-11-02T02:05:13.149Z ERROR TestReconcile_ExplicitDefaultSA/success taskrun/taskrun.go:394 Failed to create task run pod for taskrun "test-taskrun-run-success": failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).reconcile
/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:394
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).ReconcileKind
/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:192
github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile
/workspace/src/github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:240
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestReconcile_ExplicitDefaultSA.func1
/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:536
testing.tRunner
/usr/local/go/src/testing/testing.go:1123
2020-11-02T02:05:13.149Z ERROR TestReconcile_ExplicitDefaultSA/success taskrun/taskrun.go:193 Reconcile: failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).ReconcileKind
/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:193
github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile
/workspace/src/github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:240
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestReconcile_ExplicitDefaultSA.func1
/workspace/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:536
testing.tRunner
/usr/local/go/src/testing/testing.go:1123
/kind bug
Okay so I think I know what's going on here and I think it's a bug (though not one that would be hit very often, or maybe ever).
馃悰 馃悰 馃悰
_TL;DR: There is a race where the config store sometimes hasn't loaded default values before the test code executes b/c it loads them in separate go routines. This could manifest in actual problems if TaskRuns are created immediately on controller startup (tho it's quite unlikely)._
馃悰 馃悰 馃悰
Because of the way the reconciler code is generated I can't easily reproduce it in a test that we can commit back, but I hacked together a test that usually reproduced it in https://github.com/bobcatfish/pipeline/commit/1fbc3889b8ab1fbdcb2f0510a8c0674b436c3bf7
Even then, it only reproduced the issue every once in a while - running doit.sh will usually cause the test to fail eventually (but it's running a test that spawns 200 subtests 1000 times!)
(The other way to more easily make this failure happen is to introduce a sleep into InformedWatcher.addConfigMapEvent, i.e. delay processing of the config map by the config store such that the test code always executes first.)
I believe that what is happening is:
default in the case of the service accountWhat this means is that there is a (very small) chance that if a TaskRun is created immediately after a controller starts up, the config store might not have yet loaded the default values and will create the TaskRun using the fall back hard coded default values and not whatever is configured in the config maps (this would apply to all values configurable via config maps).
I think this might be the same problem that is causing the knative/pkg folks to see a flakey test knative/pkg#1907
If you look at the code that fails in that case:
https://github.com/knative/pkg/blob/c5296cd61e9f34e183f8010a40dc866f895252b9/configmap/informed_watcher_test.go#L86-L92
The comment says:
When Start returns the callbacks should have been called with the version of the objects that is available.
So it seems the intention is that when Start returns, the callbacks have been called (and in the config store case, this means the callbacks that process the config map), but the flakes in this test and the knative test are showing that this isn't the case.
I think the fix is to introduce that guarantee; i.e. that after Start has returned, the callbacks have been called with the available version of the objects, so I'm gonna update that bug with my findings and see if the knative folks agree.
(Oh yeah one last piece of trivia around this bug that I thought was interesting: note that this was reported June 12 and https://github.com/tektoncd/pipeline/pull/2637 (making it so that we started using the config map watcher to hot reload these values) was merged on June 8.)
Being able to fix this will depend on how the knative folks want to handle the underlying issue: https://github.com/knative/pkg/issues/1960