The (*PipelineRun).IsDone() function returns true when the Success condition is no longer Unknown. However, as soon as the first task in the PipelineRun fails it sets the PipelineRun's Success condition to False, and IsDone() immediately returns true, despite several other tasks still Running.
Instead, IsDone() should check Status.CompletionTime or iterate through all tasks and check whether all tasks have finished executing.
Two new HasSucceeded() and HasFailed() functions can take the place of IsDone() if needed.
I'm happy to submit a PR if this sounds sensible.
@gpaul thanks for the issue, I think this is by design "initially" โ aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.
That said, it might not be the best behaviour, and even more in relation to #1394, #1376 and #1023 :angel:
/cc @bobcatfish @ImJasonH
/kind question
/area api
@gpaul thanks for the issue, I think this is by design "initially" โ aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.
Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)
Yes, that makes total sense. The issue is really that we treat this state as
IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)
That's fair :stuck_out_tongue_closed_eyes:.
Two new HasSucceeded() and HasFailed() functions can take the place of IsDone() if needed.
Sounds reasonable!
I'm happy to submit a PR if this sounds sensible.
That'd be awesome! Thanks :pray:
(Also remove area/api label since this only technically affects a function in the reconciler)
I've had a look and it's actually rather tricky.
Two strategies present themselves:
PipelineRun.CompletionTime to determine whether or not a PipelineRun is done:func (pr *PipelineRun) IsDone() bool {
return pr.Status.CompletionTime != nil && !pr.Status.CompletionTime.IsZero()
}
This is very simple and assumes that a PipelineRun is done if and only if it has a CompletionTime set.
This sounds very reasonable, but it is not currently true.
For example, this line sets the PipelineRun's CompletionTime when a Success/Failed condition is set, which happens when the first taskrun fails:
https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L627-L631
PipelineRun.Status.TaskRuns and check whether all task runs have a status of Success or Failure.func (pr *PipelineRun) IsDone() bool {
for _, trs := range pr.Status.TaskRuns {
if trs.Status.GetCondition(apis.ConditionSucceeded).IsUnknown() {
return false
}
}
return true
}
However, this naive approach means that until a PipelineRun that does not yet have .Status.TaskRuns populated with all the tasks that need to execute will be considered done. For example, a PipelineRun that was created a moment ago and has not yet had its Status updated at all will report IsDone() == true. That's also pretty bad.
It looks like the only place where we really know what state a PipelineRun is in, is in the reconciler code (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446). Here we seem to have intimate knowledge of which tasks should have been executed, which have done so successfully, which failed, and which are still pending.
I propose we introduce a new invariant: PipelineRun.CompletionTime is set if and only if all TaskRuns that comprise that PipelineRun and will execute as part of that PipelineRun have completed.
A special case is where the PipelineRun has timed out/canceled. At that point, its TaskRuns are already timed out/canceled, too.
We then update the reconciliation logic here (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446) to set the CompletionTime if and only if all non-skipped TaskRuns have succeeded or failed (whether due to regular failure, cancellation, or timeout.)
I've prepared a WIP PR (some outstanding tests) here: https://github.com/tektoncd/pipeline/pull/1749
I'm not sure how you'd like to to proceed. Using CompletionTime to carry a binary "Completed" state feels a little strange to me, I would prefer to set the ConditionSucceeded=False condition on a PipelineRun once all tasks have completed, timed out, or been cancelled. And if someone is interested in whether any of its individual tasks have failed, they can query such a condition on the pipelinerun or loop through its Status.TaskRuns.
@gpaul thanks for the issue, I think this is by design "initially" โ aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.
Yes, that makes total sense. The issue is really that we treat this state as
IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)
I agree that the semantic of the name isDone is not entirely correct. The function is what the controller needs as an early check to stop processing the resource, so it's "done" from a controller POV.
What are the use cases for which we need an isDone function that takes in account if all tasks are completed?
I'm not sure if this has an impact on metrics @hrishin . If the duration reported for a pipeline is that until isDone is reached, it would be a misleading. While the time spent by the pipelinerun controller on the pipeline would be correct, it would probably not match the expectation, which would probably be the time between the creating in etcd until the time the last taskrun completed.
Looking at the code I see a few issues with the current implementation:
@gpaul your considerations about sounds very reasonable to me. An alternative could be to automatically cancel all running tasks the moment the first task fails. Even if we did that, it might take a bit of time for all the tasks to end, as they may have cloudEvent to send and - soon enough - "finally" type of steps to be executed.
What are the use cases for which we need an isDone function that takes in account if all tasks are completed?
I have a two real use cases where I need to wait for the entire pipeline to finish:
I have a process that waits for a pipeline to finish, then archives the pod logs of all the taskruns that ran. That process needs to know when it should run and when it can stop watching a given pipelinerun.
I have a process that sets CI task statuses on GitHub PRs, one for each Task in the pipeline. If one of two TaskRuns that are executing concurrently fails, the other will continue to run. However, if my process uses pipelinerun.IsDone() to determine whether or not it can stop watching the given pipelinerun, then it needs to immediately mark the pending TaskRun as Succeeded/Failed/Canceled despite the fact that the TaskRun will continue running and succeed/fail at a later point. This is awkward as a user clicking through to the tekton UI will see tasks succeeding / failing that are shown as neither on the PR.
I currently work around this by waiting for all taskruns to finish if the pipelinerun has failed, or for all expected tasks to have run and completed if the pipelinerun has not failed. In general knowing what tasks are expected to execute is tricky (e.g., Condition resources) and the logic necessary to calculate whether a pipelinerun has finally completed requires the implementer to have intimate knowledge of the inner workings of tekton, in fact I ended up copy/pasting a bunch of code from its guts.
I'm happy to capture the new semantics in a new function. Something like "IsCompleted", perhaps.
In general, the pattern of watching a pipelinerun for updates and waiting until all tasks have completed is a very convenient place to hook into its lifecycle. I would prefer a solution where a pipelinerun transitions through a sequence of lifecycle events and any controller watching it can take appropriate action by querying its current state.
Looking at the code I see a few issues with the current implementation:
Thanks for looking at the code. I think a more in-depth design is called for to expose pipelinerun lifecycle hooks. I think my PR is too narrow.
@gpaul, In pipelinerun.go there is an additional check pipelineState.IsDone() which checks whether any of the tasks still has taskRun associated with it and returns false if there is any (reconciler does check pipelineState for any remaining tasks):
https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L409

Hope this helps ๐
Ran a small test with the following PipelineRun:
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: failure-task
spec:
steps:
- name: fail
image: ubuntu
script: exit 1
---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: successful-task
spec:
steps:
- name: success
image: ubuntu
script: |
sleep 10;
exit 0
---
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
name: 01-pipeline-failure
spec:
tasks:
- name: task1
taskRef:
Name: failure-task
- name: task2
taskRef:
Name: successful-task
---
apiVersion: tekton.dev/v1alpha1
kind: PipelineRun
metadata:
name: 01-pipelinerun-failure
spec:
pipelineRef:
name: 01-pipeline-failure
Factored in task2 completion time even when task1 failed before task2:
tkn pr describe 01-pipelinerun-failure
Name: 01-pipelinerun-failure
Namespace: default
Pipeline Ref: 01-pipeline-failure
๐ก๏ธ Status
STARTED DURATION STATUS
37 seconds ago 19 seconds Failed
๐ Message
TaskRun 01-pipelinerun-failure-task1-kct6t has failed ("step-fail" exited with code 1 (image: "docker-pullable://ubuntu@sha256:04d48df82c938587820d7b6006f5071dbbffceb7ca01d2814f81857c631d44df"); for logs run: kubectl -n default logs 01-pipelinerun-failure-task1-kct6t-pod-knztl -c step-fail)
๐ฆ Resources
No resources
โ Params
No params
๐ Taskruns
NAME TASK NAME STARTED DURATION STATUS
โ 01-pipelinerun-failure-task1-kct6t task1 37 seconds ago 12 seconds Failed
โ 01-pipelinerun-failure-task2-x9gbv task2 37 seconds ago 19 seconds Succeeded
Good to know, thanks. Note that the problem is not that the completion time is incorrect, but that pr.IsDone returns true before completionTime is set.
๐ค for the above sample run, PipelineRun has
CompletionTime: (*v1.Time)(0xc000b83000)(2020-03-18 16:33:50 +0000 UTC),
which matches with task2 completion time here
CompletionTime: (*v1.Time)(0xc000b831e0)(2020-03-18 16:33:50 +0000 UTC),
instead of task1 completion time here
CompletionTime: (*v1.Time)(0xc000b83080)(2020-03-18 16:33:43 +0000 UTC),
OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed?
OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed?
yup, PipelineRun completion time is set to the completion time of a TaskRun which was completed last. When a task fails while other tasks are still running, PipelineRun waits for those tasks to finish and the longest task's completion time out of those, is picked up as PipelineRun completion time.
pr.IsDone() does return true after PipelineRun.Status.Status.Conditions[*].Status is no longer Unknown i.e. set to False on task failure, but pipeline completion time is set to the longest of the CompletionTime among PipelineRun.Status.PipelineRunStatusFields[*].PipelineRunTaskRunStatus[*].CompletionTime().
PipelineRun completion time keeps on updating and in the end is set to last task completion time.
You might be retrieving completion time as soon as pr.IsDone() returns true, but reconciler does not return on pr.IsDone (checks pipelineState as well before returning).
I am trying to think what can we do here without breaking existing callers of IsDone ๐ค
The pipelinerun reconciler defines a Reconcile method, which invokes isDone().
When isDone() returns true, there are a few consequences:
reconcile method is not invoked anymore on the pipeline, which is where the extra check on pipelinestate is executed As far as I can tell, the reason for the completion time being correct on the pipelinerun is that the taskrun controller updates the parent pipelinerun when it changes status.
As far as I can tell, the reason for the completion time being correct on the
pipelinerunis that thetaskruncontroller updates the parentpipelinerunwhen it changes status.
As soon as the first failure is hit, pipelineRun status is updated to Status: (v1.ConditionStatus) (len=7) "False" which is not equal to original status as original still has Status: (v1.ConditionStatus) (len=7) "Unknown", this results in a call to updateStatus.
The competition time on the pipelinerun is updated by the reconciler updateStatus method when pipelineRun Status is either False (matches here in case of first failure) or True using
pr.Status.CompletionTime = &metav1.Time{Time: time.Now()}
/assign pritidesai
I need to get this fix while working on finally, finally tasks are executed after one of the pipelinetasks fail, pipeline status should not be updated until finally tasks have finished executing.
finally is getting implemented with PR #2661
Lovely!
indeed, all thanks to @afrittoli ๐