TaskRun to complete successfully
Controller panics
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.17", GitCommit:"bdceba0734835c6cb1acbd1c447caf17d8613b44", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:13Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
v0.10.1
The panic from the controller logs:
panic: runtime error: index out of range [5] with length 5
goroutine 258 [running]:
github.com/tektoncd/pipeline/pkg/pod.(*stepStateSorter).changeIndex(...)
github.com/tektoncd/pipeline/pkg/pod/status.go:386
github.com/tektoncd/pipeline/pkg/pod.(*stepStateSorter).Swap(0xc000caad40, 0x2, 0x1)
github.com/tektoncd/pipeline/pkg/pod/status.go:393 +0x514
sort.insertionSort(0x1b7c1a0, 0xc000caad40, 0x0, 0x5)
sort/sort.go:28 +0x57
sort.quickSort(0x1b7c1a0, 0xc000caad40, 0x0, 0x5, 0x6)
sort/sort.go:209 +0x201
sort.Sort(0x1b7c1a0, 0xc000caad40)
sort/sort.go:218 +0x79
github.com/tektoncd/pipeline/pkg/pod.sortTaskRunStepOrder(0xc000fea240, 0x5, 0x8, 0xc000ab8000, 0x7, 0x8, 0x2, 0x0, 0x0)
github.com/tektoncd/pipeline/pkg/pod/status.go:357 +0xb5
github.com/tektoncd/pipeline/pkg/pod.MakeTaskRunStatus(0x0, 0x0, 0x0, 0x0, 0xc000828630, 0x2c, 0x0, 0x0, 0xc0008286f0, 0x29, ...)
github.com/tektoncd/pipeline/pkg/pod/status.go:177 +0x4f7
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).reconcile(0xc000106750, 0x1b8e3a0, 0xc0001f01e0, 0xc000aadb80, 0xed5fc68ab, 0x27380c0)
github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:360 +0x115e
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).Reconcile(0xc000106750, 0x1b8e3a0, 0xc0001f01e0, 0xc0009bfc80, 0x56, 0xc0000b9e00, 0x1b8e3a0)
github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:153 +0x841
knative.dev/pkg/controller.(*Impl).processNextWorkItem(0xc0002e4600, 0x0)
knative.dev/[email protected]/controller/controller.go:335 +0x654
knative.dev/pkg/controller.(*Impl).Run.func1(0xc0009224f0, 0xc0002e4600)
knative.dev/[email protected]/controller/controller.go:285 +0x53
created by knative.dev/pkg/controller.(*Impl).Run
knative.dev/[email protected]/controller/controller.go:283 +0x1ac
I've managed to modify the ./pkg/pod/status_test.go file (TestSortTaskRunStepOrder) to recreate the error:
diff --git a/pkg/pod/status_test.go b/pkg/pod/status_test.go
index 9215e116..3f96a41a 100644
--- a/pkg/pod/status_test.go
+++ b/pkg/pod/status_test.go
@@ -635,6 +635,10 @@ func TestSidecarsReady(t *testing.T) {
func TestSortTaskRunStepOrder(t *testing.T) {
steps := []v1alpha1.Step{{Container: corev1.Container{
Name: "hello",
+ }}, {Container: corev1.Container{
+ Name: "extra-1",
+ }}, {Container: corev1.Container{
+ Name: "extra-2",
}}, {Container: corev1.Container{
Name: "exit",
}}, {Container: corev1.Container{
go test ./pkg/pod --run TestSortTaskRunStepOrder -v
I can reproduce with the modifications to the unit test you mentioned but I'm curious if you see this with all ClusterTasks? I'm not able to reproduce with the trivial ClusterTask example in examples/v1beta1/taskruns/clustertask.yaml.
I have a fix that seems to work ok for the unit test example but I'd really like to see an example of a ClusterTask / TaskRun that's hitting the same error so that I can confirm my fix works for that too. Can you nail down what specifically in the ClusterTask configuration is the source of the problem or provide your ClusterTask for me to test?
(Even just copy/pasting the output of describe taskrun would be helpful - I'm trying to understand where the mismatch between number of steps and number of step states is coming from)
So it looks like the problem happens when a step fails.
Woo this is a doozy.
The image digest exporter (part of the Image Output Resource) is configured with "terminationMessagePolicy": "FallbackToLogsOnError",.
When a previous step has failed in the Task our entrypoint wrapping the exporter emits the following log line: 2020/03/13 12:03:26 Skipping step because a previous step failed.
That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse the message from the image digest exporter and stops trying to read step statuses (here).
That results in a mismatch in the length of the list of steps and the length of the list of step statuses and finally our sort method panics with an out of bounds error because it assumes the lengths of the two lists are the same.
I'm working on a couple of fixes for this and will make a PR later today.
Reopening until this is backported to 0.10
Marking as closed now that we've backported to 0.10 and released in 0.11.
Most helpful comment
Woo this is a doozy.
The image digest exporter (part of the Image Output Resource) is configured with
"terminationMessagePolicy": "FallbackToLogsOnError",.When a previous step has failed in the Task our entrypoint wrapping the exporter emits the following log line:
2020/03/13 12:03:26 Skipping step because a previous step failed.That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse the message from the image digest exporter and stops trying to read step statuses (here).
That results in a mismatch in the length of the list of steps and the length of the list of step statuses and finally our sort method panics with an out of bounds error because it assumes the lengths of the two lists are the same.
I'm working on a couple of fixes for this and will make a PR later today.