Argo: Exit handlers don't run for terminated workflows

Created on 6 Jun 2019  Â·  13Comments  Â·  Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?: I'm not sure

What happened:

If a workflow that has an onExit step is terminated, the onExit step is not run.

What you expected to happen:

I would like a way to force an exit handler to run even if the workflow is terminated. We create test infrastructure and would like to make sure sure it gets torn down whether the workflow exited normally, errored out, or was manually terminated.

If this is the expected behavior of onExit (not clear from the example doc), an onTerm variant that does run even if a workflow is terminated would be very useful.

How to reproduce it (as minimally and precisely as possible):

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: term-test-
spec:
  entrypoint: term-test
  # Always run cleanup
  onExit: cleanup

  templates:
  - name: term-test
    steps:
    - - name: sleep
        template: sleep

  - name: sleep
    container:
      image: alpine:3.9
      command: ["/bin/sh", "-c"]
      args: ["sleep 600"]

  - name: cleanup
    container:
      image: alpine:3.9
      command: ["/bin/sh", "-c"]
      args: ["sleep 60; echo bye > /tmp/bye"]
    outputs:
      artifacts:
      - name: bye
        path: /tmp/bye
$ argo submit term-test.yaml
Name:                term-test-fw552
Namespace:           default
ServiceAccount:      default
Status:              Pending
Created:             Thu Jun 06 11:51:13 -0700 (now)
$ argo terminate term-test-fw552
Workflow 'term-test-fw552' terminated
$ argo watch term-test-fw552
Name:                term-test-fw552
Namespace:           default
ServiceAccount:      default
Status:              Failed (Terminated)
Message:             terminated
Created:             Thu Jun 06 11:51:13 -0700 (12 seconds ago)
Started:             Thu Jun 06 11:51:13 -0700 (12 seconds ago)
Finished:            Thu Jun 06 11:51:23 -0700 (2 seconds ago)
Duration:            10 seconds

STEP                       PODNAME                     DURATION  MESSAGE
 ✖ term-test-fw552                                               child 'term-test-fw552-1901370753' failed
 └---✖ sleep               term-test-fw552-1901370753  9s        terminated

 ✖ term-test-fw552.onExit  term-test-fw552-1268251629  1s        terminated

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version
argo: v2.3.0
  BuildDate: 2019-05-20T22:11:23Z
  GitCommit: 88fcc70dcf6e60697e6716edc7464a403c49b27e
  GitTreeState: clean
  GitTag: v2.3.0
  GoVersion: go1.11.5
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: "2018-12-13T19:44:19Z"
  compiler: gc
  gitCommit: eec55b9ba98609a46fee712359c7b5b365bdd920
  gitTreeState: clean
  gitVersion: v1.13.1
  goVersion: go1.11.2
  major: "1"
  minor: "13"
  platform: darwin/amd64
serverVersion:
  buildDate: "2019-04-12T22:59:24Z"
  compiler: gc
  gitCommit: 8d9b8641e72cf7c96efa61421e87f96387242ba1
  gitTreeState: clean
  gitVersion: v1.12.7-gke.10
  goVersion: go1.10.8b4
  major: "1"
  minor: 12+
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
$ argo get <workflowname>
  • executor logs:
$ kubectl logs <failedpodname> -c init
$ kubectl logs <failedpodname> -c wait
  • workflow-controller logs:
$ kubectl logs -n kube-system $(kubectl get pods -l app=workflow-controller -n kube-system -o name)
enhancement

Most helpful comment

+1 for onTerminate feature

All 13 comments

+1 Any updates on this one?

This makes sense. Is it useful if we have onFailure, onSuccess and onComplete (took the names from scala)?

OnExit step is behaving as expected. OnExit is a special step which will execute after all Step or DAG tasks completed with regardless of Succeed or Failed. In the above case, User is forcefully terminating Workflow that means killing process. So, it is terminating all remaining steps.

@sarabala1979 So should I open a separate issue to request exit handler support for terminated workflows?

@kbruner

We'd also find an onTerminate feature useful. We have some tear-down we'd like to do if a workflow is forcefully stopped in the middle of execution.

Should onTerminate be an another template?
Or Can there be a single onExit template for workflow like today and onTerminate is like boolean.
onTerminate: true will trigger the workflow exit handler when workflow is terminated.
Default will be false

Is there any update for this issue? onTerminate feature is critical for some specific cases.

+1 for onTerminate feature

+1 too, using argo for my Continuous Deployment and wanna propose to the users to Abort the Deployment without implementing any complex logic.

What is the status of this issue? I would also be very happy with the onTerminate feature.

I've been thinking about this request for a bit this morning.

It seems from how it's implemented in the code that "Terminate" is supposed to act as a non-graceful shutdown of the Workflow: if it is triggered for whatever reason, it will shut things down immediately. It is easy to imagine scenarios where a user would want such functionality to stop execution without any regard.

It therefore seems a bit counter-intuitive (both in principle and in its would-be implementation) that new nodes be executed after a user requests an immediate shutdown. Perhaps the reason the Workflow was terminated had nothing to do with the Workflow itself, but with the cluster it was running on. In such a scenario, scheduling more nodes would defeat the purpose of the executing "Terminate".

Perhaps a better solution would be a softer, more graceful shutdown that will still run the onExit handler when activated. "Stop" would be the natural verb for this.

For those interested in this issue, I've opened #2352

@simster7 I think this PR only answers partially to the need.
IMO, terminating a workflow is not a failure/success. Therefore, we should be able to run different steps than the ones in onExit.

Let me give you my example: I am using Argo for my CICD. I'd like to give the user the possibility to use an Abort button. But when using this button, the status should not be Failed but Aborted.

Does it make sense?

Was this page helpful?
0 / 5 - 0 ratings