Pipeline: Add a field to Step that allows it to ignore failed prior Steps *within the same Task*

Created on 12 Nov 2019  ·  16Comments  ·  Source: tektoncd/pipeline

This is a sub-issue of #1376 and goes a small way to supporting the functionality described there. This issue is focused solely on running Steps inside a Task when prior steps (within the same Task) have failed. This issue specifically does not deal with running whole Tasks when previous Tasks have failed.

Motivating Use Cases

1) A unit test step fails but a subsequent step in the same Task should upload the results to a bucket regardless of that failure.
2) A Pull Request pipeline resource should update a PR with a comment reflecting the status of a failed build. The build happens in one step and then the PR resource could inject a step at the end which runs regardless of the error and knows how to update github/gitlab/etc with the status comment.

Current Support

There currently isn't support for running a subsequent Step if a prior one has already failed in the same Task.

Suggested User-Facing Change

Add a field to the Step type that is named something like ignorePreviousStepErrors. It defaults to false. Setting this field to true will tell that Step to run regardless of whether prior steps in the same Task failed.

Example YAML:

apiVersion: tekton.dev/v1alpha1
kind: TaskRun
metadata:
  generateName: test-and-upload-to-gcs-
spec:
  taskSpec:
    steps:
    - name: test
      image: node-run-tests
      command: ['npm']
      args: ['run', 'tests']
    - name: publish-npm-package
      image: node-publish-package
      command: ['npm']
      args: ['run', 'tests']
    - name: upload-test-results
      ignorePreviousStepErrors: true
      image: upload-test-results
      command: ['gsutil']
      args: ['cp', '/workspace/test-results.xml', 'gs://mybucket/test-results.xml']
    - name: deploy-to-prod
      image: deploy-image
      command: deploy

How would this example execute?

  1. First the node-run-tests step would execute.
  2. Let's say the unit tests fail.
  3. The next step, publish-npm-package would be skipped. This is the behaviour we currently enforce with Tekton.
  4. The next step, upload-test-results includes the ignorePreviousStepErrors field and has it set to true. Therefore this step is executed regardless of the failure in the first step.
  5. The final step, deploy-to-prod does not run because a prior step has failed.
  6. Any subsequent steps (in the same Task) that also include ignorePreviousStepErrors: true would be allowed to run.

The TaskRun would end with a Failed status, since one of its steps ended with an error, but the step to upload test results would be allowed to run regardless.

kinfeature prioritawaiting-more-evidence

Most helpful comment

  • @dibyom can probably comment on the Resources
  • re: finally -- it was more just an observation although finally really would be useful for a "final" task in a PipelineRun.
  • re: generalized
    (i) suggesting we do something a bit more flexible like instead of...
    ignorePrevStepErrors:true/false
    Do something like...
    entrypointStrategy: IgnorePreviousStepErrors
    (ii) If we're feeling ambitious we can add an entrypointStrategy to the Task itself as a mechanism to provide the default strategy for each Step
  1. I liked the idea in the current proposal where effectively all Steps make a decision independently
  2. Maybe let the last step decide??

All 16 comments

/kind feature

Presented during working group today. Feedback I can remember from the meeting:

  • What's the interaction here with output resources?
  • This feels like it's a "finally" clause in a try/catch.
  • It would be neat if this were generalized to handle other failure strategies e.g.:

    1. Ignore this step's failures and continue with the rest as if nothing errored

    2. Ignore all failures in this task and always consider it a success

It seems to me there are two specific levers that a user might want to be able to pull in relation to this feature:

  1. After one step fails, how does it and the remaining steps react?
  2. In the face of a step failure, how will the Task's status be affected?
  • @dibyom can probably comment on the Resources
  • re: finally -- it was more just an observation although finally really would be useful for a "final" task in a PipelineRun.
  • re: generalized
    (i) suggesting we do something a bit more flexible like instead of...
    ignorePrevStepErrors:true/false
    Do something like...
    entrypointStrategy: IgnorePreviousStepErrors
    (ii) If we're feeling ambitious we can add an entrypointStrategy to the Task itself as a mechanism to provide the default strategy for each Step
  1. I liked the idea in the current proposal where effectively all Steps make a decision independently
  2. Maybe let the last step decide??

I put up a design doc that for an alternative that's aligned with #2437:
https://docs.google.com/document/d/1e1fagYDErliLnIwU1g_N3YKVyAProhs1l14UyPIoZuM/edit?usp=sharing

Let me add another use case here:

I have a pipeline where I want to:

  1. provision some resources,
  2. deploy an application in those provisioned resources,
  3. run some integration tests against deployed application,
  4. tear down the resources (undeploy / unprovision any cloud resources).

I want the tear down task to run regardless of whether tests passed or failed.

@gsaslis This sounds like a use case for a Pipeline comprised of reusable single-purpose Tasks, rather than a single Task which is responsible for doing everything at once. In this case, your Pipeline would have Tasks for each bullet point (provision, deploy, integration-test, teardown), and the ongoing Pipeline-level finally work (#2437) should be useful to you.

By putting this logic into a single Task, you make it less reusable, and since steps within a Task are executed sequentially, you lose out on the option to, for instance, to multiple locations concurrently.

@ImJasonH thanks for the pointer. I absolutely meant for those to be separate Tasks (which is why i referred to it as a "task" above.

What I didn't notice was the big, fat "within the same Task" on this issue title 🤕

Yes, this use case was not meant for this particular task - I just found this one first. Apologies for the noise.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Reopening as this is still discussed with some regularity during API WGs and on Slack.

/reopen
/remove-lifecycle stale
/remove-lifecycle rotten

@sbwsg: Reopened this issue.

In response to this:

Reopening as this is still discussed with some regularity during API WGs and on Slack.

/reopen
/remove-lifecycle stale
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

/remove-lifecycle stale

Was this page helpful?
0 / 5 - 0 ratings