Pipeline: panic when submitting TaskRun

Created on 12 Mar 2020 · 6Comments · Source: tektoncd/pipeline

Expected Behavior

TaskRun to complete successfully

Actual Behavior

Controller panics

Steps to Reproduce the Problem

This happens when I create a TaskRun that references a ClusterTask.
2.
3.

Additional Info

Kubernetes version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.17", GitCommit:"bdceba0734835c6cb1acbd1c447caf17d8613b44", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:13Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

v0.10.1

The panic from the controller logs:

panic: runtime error: index out of range [5] with length 5

goroutine 258 [running]:
github.com/tektoncd/pipeline/pkg/pod.(*stepStateSorter).changeIndex(...)
    github.com/tektoncd/pipeline/pkg/pod/status.go:386
github.com/tektoncd/pipeline/pkg/pod.(*stepStateSorter).Swap(0xc000caad40, 0x2, 0x1)
    github.com/tektoncd/pipeline/pkg/pod/status.go:393 +0x514
sort.insertionSort(0x1b7c1a0, 0xc000caad40, 0x0, 0x5)
    sort/sort.go:28 +0x57
sort.quickSort(0x1b7c1a0, 0xc000caad40, 0x0, 0x5, 0x6)
    sort/sort.go:209 +0x201
sort.Sort(0x1b7c1a0, 0xc000caad40)
    sort/sort.go:218 +0x79
github.com/tektoncd/pipeline/pkg/pod.sortTaskRunStepOrder(0xc000fea240, 0x5, 0x8, 0xc000ab8000, 0x7, 0x8, 0x2, 0x0, 0x0)
    github.com/tektoncd/pipeline/pkg/pod/status.go:357 +0xb5
github.com/tektoncd/pipeline/pkg/pod.MakeTaskRunStatus(0x0, 0x0, 0x0, 0x0, 0xc000828630, 0x2c, 0x0, 0x0, 0xc0008286f0, 0x29, ...)
    github.com/tektoncd/pipeline/pkg/pod/status.go:177 +0x4f7
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).reconcile(0xc000106750, 0x1b8e3a0, 0xc0001f01e0, 0xc000aadb80, 0xed5fc68ab, 0x27380c0)
    github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:360 +0x115e
github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).Reconcile(0xc000106750, 0x1b8e3a0, 0xc0001f01e0, 0xc0009bfc80, 0x56, 0xc0000b9e00, 0x1b8e3a0)
    github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:153 +0x841
knative.dev/pkg/controller.(*Impl).processNextWorkItem(0xc0002e4600, 0x0)
    knative.dev/[email protected]/controller/controller.go:335 +0x654
knative.dev/pkg/controller.(*Impl).Run.func1(0xc0009224f0, 0xc0002e4600)
    knative.dev/[email protected]/controller/controller.go:285 +0x53
created by knative.dev/pkg/controller.(*Impl).Run
    knative.dev/[email protected]/controller/controller.go:283 +0x1ac

I've managed to modify the ./pkg/pod/status_test.go file (TestSortTaskRunStepOrder) to recreate the error:

diff --git a/pkg/pod/status_test.go b/pkg/pod/status_test.go
index 9215e116..3f96a41a 100644
--- a/pkg/pod/status_test.go
+++ b/pkg/pod/status_test.go
@@ -635,6 +635,10 @@ func TestSidecarsReady(t *testing.T) {
 func TestSortTaskRunStepOrder(t *testing.T) {
    steps := []v1alpha1.Step{{Container: corev1.Container{
        Name: "hello",
+   }}, {Container: corev1.Container{
+       Name: "extra-1",
+   }}, {Container: corev1.Container{
+       Name: "extra-2",
    }}, {Container: corev1.Container{
        Name: "exit",
    }}, {Container: corev1.Container{

go test ./pkg/pod --run TestSortTaskRunStepOrder -v

kinbug

Source

poy

Most helpful comment

Woo this is a doozy.

The image digest exporter (part of the Image Output Resource) is configured with "terminationMessagePolicy": "FallbackToLogsOnError",.

When a previous step has failed in the Task our entrypoint wrapping the exporter emits the following log line: 2020/03/13 12:03:26 Skipping step because a previous step failed.

That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse the message from the image digest exporter and stops trying to read step statuses (here).

That results in a mismatch in the length of the list of steps and the length of the list of step statuses and finally our sort method panics with an out of bounds error because it assumes the lengths of the two lists are the same.

I'm working on a couple of fixes for this and will make a PR later today.

sbwsg on 13 Mar 2020

👍3

All 6 comments

I can reproduce with the modifications to the unit test you mentioned but I'm curious if you see this with all ClusterTasks? I'm not able to reproduce with the trivial ClusterTask example in examples/v1beta1/taskruns/clustertask.yaml.

I have a fix that seems to work ok for the unit test example but I'd really like to see an example of a ClusterTask / TaskRun that's hitting the same error so that I can confirm my fix works for that too. Can you nail down what specifically in the ClusterTask configuration is the source of the problem or provide your ClusterTask for me to test?

sbwsg on 12 Mar 2020

(Even just copy/pasting the output of describe taskrun would be helpful - I'm trying to understand where the mismatch between number of steps and number of step states is coming from)

sbwsg on 12 Mar 2020

So it looks like the problem happens when a step fails.

poy on 12 Mar 2020

Woo this is a doozy.

The image digest exporter (part of the Image Output Resource) is configured with "terminationMessagePolicy": "FallbackToLogsOnError",.

When a previous step has failed in the Task our entrypoint wrapping the exporter emits the following log line: 2020/03/13 12:03:26 Skipping step because a previous step failed.

I'm working on a couple of fixes for this and will make a PR later today.

sbwsg on 13 Mar 2020

👍3

Reopening until this is backported to 0.10

sbwsg on 17 Mar 2020

Marking as closed now that we've backported to 0.10 and released in 0.11.

sbwsg on 30 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

enable script for conditions

pritidesai · 4Comments

Install Task/TaskRun Object Internal error occurred: failed calling webhook "webhook.tekton.dev"

cnych · 4Comments

If the WhenExpression of the task is evaluated to false immediately after the pipelinerun is started, the pipelinerun stays in "Running(Started)" state

VeereshAradhya · 3Comments

Pipeline doesn't allow different service accounts for different tasks

hrishin · 3Comments

Two quick questions: Github Webhook tutorial & how to rerun pipelines and TaskRuns?

JCzz · 3Comments