Pipeline: TaskRun status wrong when unnamed step fails (followed by named step)

Created on 16 Apr 2020 · 7Comments · Source: tektoncd/pipeline

Expected Behavior

TaskRun status should correctly identify the failed step.

Actual Behavior

When an unnamed step fails and it is followed by a named step, the TaskRun status incorrectly identifies the named step as the reason for the failure.

Steps to Reproduce the Problem

kubectl create -f - << EOF
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  generateName: test-unnamed-correct-
spec:
  taskSpec:
    steps:
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 0
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 1
        false
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 2
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 3
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  generateName: test-unnamed-wrong-
spec:
  taskSpec:
    steps:
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 0
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 1
        false
    - image: ubuntu
      name: named-step
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 2
    - image: ubuntu
      script: |
        #!/usr/bin/env bash
        sleep 3 && echo step 3
EOF

In both cases 'step 1' fails (expected).
The TaskRun status for test-unnamed-correct- identifies 'step 1' correctly (unnamed-1 in this case).
test-unnamed-wrong- incorrectly identifies 'step 2' ('named-step') as the culprit.

Status:
  Conditions:
    Message:               "step-named-step" exited with code 1 (image: "docker-pullable://ubuntu@sha256:bec5a2727be7fff3d308193cfde3491f8fba1a2ba392b7546b43a051853a341d"); for logs run: kubectl -n default logs test-unnamed-wrong-8l6sd-pod-dfsn6 -c step-named-step

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: v1.16.3
Server Version: v1.15.5

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.11.1

kinbug

Source

AlanGreene

Most helpful comment

This is fixed by #2455, closing

vdemeester on 24 Jul 2020

🎉1 👍1

All 7 comments

/kind bug

vdemeester on 16 Apr 2020

Checked the code, we sort the pod.Status.ContainerStatuses with FinishedAt, then get the first one whose exit code is not 0.
https://github.com/tektoncd/pipeline/blob/6b1579c89d75ec4e58c5630819429709e21f7332/pkg/pod/status.go#L265-L274

Sometimes the FinishedAt are the same for several steps, then the sort will do nothing, that's means the container(step) with same FinishedAt will sort by its name. that's not expected.

Could we sort with step defined in task? @vdemeester

vincent-pli on 16 Apr 2020

Sounds like there might be some overlap with https://github.com/tektoncd/pipeline/issues/2416

I wonder is there a reason we're sorting on the finish time instead of start time? It looks like we don't have enough precision in the recorded times either way to rely on times for accurate sorting.

AlanGreene on 16 Apr 2020

May be we can use FinishedAt and StartAt together.

vincent-pli on 20 Apr 2020

Same discussion about this problem in #2029 but it ground to a halt looking for a solution. Using resolution of seconds (which is all that k8 api server has for finish time) doesn't work. Tekton is managing its own start times so that might work with a higher resolution.

GregDritschler on 20 Apr 2020

@GregDritschler
I checked all comments in #2029, very helpful, thanks.
What do you think to introduce StartAt for the sorting when FinishedAt are exactly the same.
Since the goal is to find the first failed step, the StartAt and FinishedAt are most simple and directly solution.
Moreover, as you mentioned tekton control the StartAt, we can adopt a higher resolution one.