TaskRun status should correctly identify the failed step.
When an unnamed step fails and it is followed by a named step, the TaskRun status incorrectly identifies the named step as the reason for the failure.
kubectl create -f - << EOF
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
generateName: test-unnamed-correct-
spec:
taskSpec:
steps:
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 0
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 1
false
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 2
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 3
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
generateName: test-unnamed-wrong-
spec:
taskSpec:
steps:
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 0
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 1
false
- image: ubuntu
name: named-step
script: |
#!/usr/bin/env bash
sleep 3 && echo step 2
- image: ubuntu
script: |
#!/usr/bin/env bash
sleep 3 && echo step 3
EOF
In both cases 'step 1' fails (expected).
The TaskRun status for test-unnamed-correct- identifies 'step 1' correctly (unnamed-1 in this case).
test-unnamed-wrong- incorrectly identifies 'step 2' ('named-step') as the culprit.
Status:
Conditions:
Message: "step-named-step" exited with code 1 (image: "docker-pullable://ubuntu@sha256:bec5a2727be7fff3d308193cfde3491f8fba1a2ba392b7546b43a051853a341d"); for logs run: kubectl -n default logs test-unnamed-wrong-8l6sd-pod-dfsn6 -c step-named-step
Kubernetes version:
Output of kubectl version:
Client Version: v1.16.3
Server Version: v1.15.5
Tekton Pipeline version:
Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
v0.11.1
/kind bug
Checked the code, we sort the pod.Status.ContainerStatuses with FinishedAt, then get the first one whose exit code is not 0.
https://github.com/tektoncd/pipeline/blob/6b1579c89d75ec4e58c5630819429709e21f7332/pkg/pod/status.go#L265-L274
Sometimes the FinishedAt are the same for several steps, then the sort will do nothing, that's means the container(step) with same FinishedAt will sort by its name. that's not expected.
Could we sort with step defined in task? @vdemeester
Sounds like there might be some overlap with https://github.com/tektoncd/pipeline/issues/2416
I wonder is there a reason we're sorting on the finish time instead of start time? It looks like we don't have enough precision in the recorded times either way to rely on times for accurate sorting.
May be we can use FinishedAt and StartAt together.
Same discussion about this problem in #2029 but it ground to a halt looking for a solution. Using resolution of seconds (which is all that k8 api server has for finish time) doesn't work. Tekton is managing its own start times so that might work with a higher resolution.
@GregDritschler
I checked all comments in #2029, very helpful, thanks.
What do you think to introduce StartAt for the sorting when FinishedAt are exactly the same.
Since the goal is to find the first failed step, the StartAt and FinishedAt are most simple and directly solution.
Moreover, as you mentioned tekton control the StartAt, we can adopt a higher resolution one.
This is fixed by #2455, closing
Most helpful comment
This is fixed by #2455, closing