Pipeline: (*PipelineRun).IsDone() is incorrect

Created on 4 Dec 2019 · 21Comments · Source: tektoncd/pipeline

The (*PipelineRun).IsDone() function returns true when the Success condition is no longer Unknown. However, as soon as the first task in the PipelineRun fails it sets the PipelineRun's Success condition to False, and IsDone() immediately returns true, despite several other tasks still Running.

Instead, IsDone() should check Status.CompletionTime or iterate through all tasks and check whether all tasks have finished executing.

Two new HasSucceeded() and HasFailed() functions can take the place of IsDone() if needed.

I'm happy to submit a PR if this sounds sensible.

kinbug kinquestion

Source

gpaul

All 21 comments

@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.

That said, it might not be the best behaviour, and even more in relation to #1394, #1376 and #1023 :angel:

/cc @bobcatfish @ImJasonH
/kind question
/area api

vdemeester on 4 Dec 2019

@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.

Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)

gpaul on 4 Dec 2019

Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)

That's fair :stuck_out_tongue_closed_eyes:.

vdemeester on 5 Dec 2019

Two new HasSucceeded() and HasFailed() functions can take the place of IsDone() if needed.

Sounds reasonable!

I'm happy to submit a PR if this sounds sensible.

That'd be awesome! Thanks :pray:

(Also remove area/api label since this only technically affects a function in the reconciler)

dibyom on 12 Dec 2019

😄1

I've had a look and it's actually rather tricky.

Two strategies present themselves:

The first strategy is to use the presence of PipelineRun.CompletionTime to determine whether or not a PipelineRun is done:

func (pr *PipelineRun) IsDone() bool {
    return pr.Status.CompletionTime != nil && !pr.Status.CompletionTime.IsZero()
}

This is very simple and assumes that a PipelineRun is done if and only if it has a CompletionTime set.

This sounds very reasonable, but it is not currently true.

For example, this line sets the PipelineRun's CompletionTime when a Success/Failed condition is set, which happens when the first taskrun fails:
https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L627-L631

The second alternative is to use PipelineRun.Status.TaskRuns and check whether all task runs have a status of Success or Failure.

func (pr *PipelineRun) IsDone() bool {
    for _, trs := range pr.Status.TaskRuns {
        if trs.Status.GetCondition(apis.ConditionSucceeded).IsUnknown() {
            return false
        }
    }
    return true
}

However, this naive approach means that until a PipelineRun that does not yet have .Status.TaskRuns populated with all the tasks that need to execute will be considered done. For example, a PipelineRun that was created a moment ago and has not yet had its Status updated at all will report IsDone() == true. That's also pretty bad.

It looks like the only place where we really know what state a PipelineRun is in, is in the reconciler code (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446). Here we seem to have intimate knowledge of which tasks should have been executed, which have done so successfully, which failed, and which are still pending.

I propose we introduce a new invariant: PipelineRun.CompletionTime is set if and only if all TaskRuns that comprise that PipelineRun and will execute as part of that PipelineRun have completed.

A special case is where the PipelineRun has timed out/canceled. At that point, its TaskRuns are already timed out/canceled, too.

We then update the reconciliation logic here (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446) to set the CompletionTime if and only if all non-skipped TaskRuns have succeeded or failed (whether due to regular failure, cancellation, or timeout.)

I've prepared a WIP PR (some outstanding tests) here: https://github.com/tektoncd/pipeline/pull/1749

I'm not sure how you'd like to to proceed. Using CompletionTime to carry a binary "Completed" state feels a little strange to me, I would prefer to set the ConditionSucceeded=False condition on a PipelineRun once all tasks have completed, timed out, or been cancelled. And if someone is interested in whether any of its individual tasks have failed, they can query such a condition on the pipelinerun or loop through its Status.TaskRuns.

gpaul on 16 Dec 2019

@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.

Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)

I agree that the semantic of the name isDone is not entirely correct. The function is what the controller needs as an early check to stop processing the resource, so it's "done" from a controller POV.
What are the use cases for which we need an isDone function that takes in account if all tasks are completed?

I'm not sure if this has an impact on metrics @hrishin . If the duration reported for a pipeline is that until isDone is reached, it would be a misleading. While the time spent by the pipelinerun controller on the pipeline would be correct, it would probably not match the expectation, which would probably be the time between the creating in etcd until the time the last taskrun completed.

afrittoli on 11 Mar 2020

Looking at the code I see a few issues with the current implementation:

the timeout handler is released, which means that the pipeline timeout will not be applied to tasks that might be still running. They will still get the task timeout, but that may be not set
the metric reported for the duration of the pipeline is invalid

@gpaul your considerations about sounds very reasonable to me. An alternative could be to automatically cancel all running tasks the moment the first task fails. Even if we did that, it might take a bit of time for all the tasks to end, as they may have cloudEvent to send and - soon enough - "finally" type of steps to be executed.

afrittoli on 13 Mar 2020

What are the use cases for which we need an isDone function that takes in account if all tasks are completed?

I have a two real use cases where I need to wait for the entire pipeline to finish:

I have a process that waits for a pipeline to finish, then archives the pod logs of all the taskruns that ran. That process needs to know when it should run and when it can stop watching a given pipelinerun.
I have a process that sets CI task statuses on GitHub PRs, one for each Task in the pipeline. If one of two TaskRuns that are executing concurrently fails, the other will continue to run. However, if my process uses pipelinerun.IsDone() to determine whether or not it can stop watching the given pipelinerun, then it needs to immediately mark the pending TaskRun as Succeeded/Failed/Canceled despite the fact that the TaskRun will continue running and succeed/fail at a later point. This is awkward as a user clicking through to the tekton UI will see tasks succeeding / failing that are shown as neither on the PR.

I currently work around this by waiting for all taskruns to finish if the pipelinerun has failed, or for all expected tasks to have run and completed if the pipelinerun has not failed. In general knowing what tasks are expected to execute is tricky (e.g., Condition resources) and the logic necessary to calculate whether a pipelinerun has finally completed requires the implementer to have intimate knowledge of the inner workings of tekton, in fact I ended up copy/pasting a bunch of code from its guts.

I'm happy to capture the new semantics in a new function. Something like "IsCompleted", perhaps.

In general, the pattern of watching a pipelinerun for updates and waiting until all tasks have completed is a very convenient place to hook into its lifecycle. I would prefer a solution where a pipelinerun transitions through a sequence of lifecycle events and any controller watching it can take appropriate action by querying its current state.

gpaul on 13 Mar 2020

Looking at the code I see a few issues with the current implementation:

Thanks for looking at the code. I think a more in-depth design is called for to expose pipelinerun lifecycle hooks. I think my PR is too narrow.

gpaul on 13 Mar 2020

@gpaul, In pipelinerun.go there is an additional check pipelineState.IsDone() which checks whether any of the tasks still has taskRun associated with it and returns false if there is any (reconciler does check pipelineState for any remaining tasks):

https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L409

Screen Shot 2020-03-16 at 6 16 41 PM

Hope this helps 😉

pritidesai on 17 Mar 2020

Ran a small test with the following PipelineRun:

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: failure-task
spec:
  steps:
    - name: fail
      image: ubuntu
      script: exit 1
---

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: successful-task
spec:
  steps:
    - name: success
      image: ubuntu
      script: |
        sleep 10;
        exit 0
---

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: 01-pipeline-failure
spec:
  tasks:
    - name: task1
      taskRef:
        Name: failure-task
    - name: task2
      taskRef:
        Name: successful-task
---

apiVersion: tekton.dev/v1alpha1
kind: PipelineRun
metadata:
  name: 01-pipelinerun-failure
spec:
  pipelineRef:
    name: 01-pipeline-failure

Factored in task2 completion time even when task1 failed before task2:

tkn pr describe 01-pipelinerun-failure
Name:           01-pipelinerun-failure
Namespace:      default
Pipeline Ref:   01-pipeline-failure

🌡️  Status

STARTED          DURATION     STATUS
37 seconds ago   19 seconds   Failed

💌 Message

TaskRun 01-pipelinerun-failure-task1-kct6t has failed (&#34;step-fail&#34; exited with code 1 (image: &#34;docker-pullable://ubuntu@sha256:04d48df82c938587820d7b6006f5071dbbffceb7ca01d2814f81857c631d44df&#34;); for logs run: kubectl -n default logs 01-pipelinerun-failure-task1-kct6t-pod-knztl -c step-fail)

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                         TASK NAME   STARTED          DURATION     STATUS
 ∙ 01-pipelinerun-failure-task1-kct6t   task1       37 seconds ago   12 seconds   Failed
 ∙ 01-pipelinerun-failure-task2-x9gbv   task2       37 seconds ago   19 seconds   Succeeded

pritidesai on 18 Mar 2020

Good to know, thanks. Note that the problem is not that the completion time is incorrect, but that pr.IsDone returns true before completionTime is set.

gpaul on 18 Mar 2020

👍1

🤔 for the above sample run, PipelineRun has

CompletionTime: (*v1.Time)(0xc000b83000)(2020-03-18 16:33:50 +0000 UTC),

here

which matches with task2 completion time here

CompletionTime: (*v1.Time)(0xc000b831e0)(2020-03-18 16:33:50 +0000 UTC),

instead of task1 completion time here

CompletionTime: (*v1.Time)(0xc000b83080)(2020-03-18 16:33:43 +0000 UTC),

pritidesai on 18 Mar 2020

OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed?

gpaul on 18 Mar 2020

OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed?

yup, PipelineRun completion time is set to the completion time of a TaskRun which was completed last. When a task fails while other tasks are still running, PipelineRun waits for those tasks to finish and the longest task's completion time out of those, is picked up as PipelineRun completion time.

pritidesai on 18 Mar 2020

pr.IsDone() does return true after PipelineRun.Status.Status.Conditions[*].Status is no longer Unknown i.e. set to False on task failure, but pipeline completion time is set to the longest of the CompletionTime among PipelineRun.Status.PipelineRunStatusFields[*].PipelineRunTaskRunStatus[*].CompletionTime().

PipelineRun completion time keeps on updating and in the end is set to last task completion time.

You might be retrieving completion time as soon as pr.IsDone() returns true, but reconciler does not return on pr.IsDone (checks pipelineState as well before returning).

I am trying to think what can we do here without breaking existing callers of IsDone 🤔

pritidesai on 19 Mar 2020

The pipelinerun reconciler defines a Reconcile method, which invokes isDone().

When isDone() returns true, there are a few consequences:

the pipeline artifact storage is cleaned up, which may break tasks that are still running
the timeout handler is released for the pipelinerun
metrics are recorded
the reconcile method is not invoked anymore on the pipeline, which is where the extra check on pipelinestate is executed

As far as I can tell, the reason for the completion time being correct on the pipelinerun is that the taskrun controller updates the parent pipelinerun when it changes status.

afrittoli on 20 Mar 2020

As far as I can tell, the reason for the completion time being correct on the pipelinerun is that the taskrun controller updates the parent pipelinerun when it changes status.

As soon as the first failure is hit, pipelineRun status is updated to Status: (v1.ConditionStatus) (len=7) "False" which is not equal to original status as original still has Status: (v1.ConditionStatus) (len=7) "Unknown", this results in a call to updateStatus.

The competition time on the pipelinerun is updated by the reconciler updateStatus method when pipelineRun Status is either False (matches here in case of first failure) or True using

pr.Status.CompletionTime = &metav1.Time{Time: time.Now()}

pritidesai on 20 Mar 2020

/assign pritidesai

I need to get this fix while working on finally, finally tasks are executed after one of the pipelinetasks fail, pipeline status should not be updated until finally tasks have finished executing.

finally is getting implemented with PR #2661

pritidesai on 29 May 2020

🎉1

Lovely!

gpaul on 11 Jun 2020

indeed, all thanks to @afrittoli 🙏

pritidesai on 11 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings