Argo: steps and dag exhibit different behavior when using continueOn failed true

Created on 9 Apr 2020  ยท  1Comment  ยท  Source: argoproj/argo

Checklist:

  • [x] I've included the version.
  • [x] I've included reproduction steps.
  • [x] I've included the workflow YAML.
  • [x] I've included the logs.

What happened:

When setting continuneOn.failed to true on a given task, whether it's in a step or dag, and if there are subsequent tasks depending on said task, then I expect the Workflow to have a consistent status after-the-fact.

What you expected to happen:
For both to either succeed or fail.

How to reproduce it (as minimally and precisely as possible):

argo submit https://raw.githubusercontent.com/argoproj/argo/master/examples/status-reference.yaml
argo submit https://raw.githubusercontent.com/argoproj/argo/master/examples/dag-continue-on-fail.yaml

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version
argo: vv2.5.2+4b25e2a.dirty
  BuildDate: 2020-02-24T22:49:11Z
  GitCommit: 4b25e2ac1d495991261e97c86d211d658423ab7f
  GitTreeState: dirty
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: 2019-02-13T11:15:10Z
  compiler: gc
  gitCommit: 954ff68d59e9dc62fa8252ffa9023a90ff8a358c
  gitTreeState: clean
  gitVersion: v1.10.13
  goVersion: go1.9.3
  major: "1"
  minor: "10"
  platform: linux/amd64
serverVersion:
  buildDate: 2018-11-13T11:33:04Z
  compiler: gc
  gitCommit: be1a908c6aa47e0ae1b1dc861a1de6ccfe963aa2
  gitTreeState: clean
  gitVersion: v1.10.10
  goVersion: go1.9.3
  major: "1"
  minor: "10"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:

Logs for status-reference:

Name:                status-reference-sslm2
Namespace:           anomaly-detection
ServiceAccount:      review-feature-fa-1ro31a-anomaly-detection-backend
Status:              Succeeded
Created:             Thu Apr 09 13:19:44 -0500 (13 seconds ago)
Started:             Thu Apr 09 13:19:44 -0500 (13 seconds ago)
Finished:            Thu Apr 09 13:19:57 -0500 (now)
Duration:            13 seconds

STEP                                          PODNAME                            DURATION  MESSAGE
 โœ” status-reference-sslm2 (status-reference)                                               
 โ”œ---โœ– flakey-container (flakey-container)    status-reference-sslm2-141736205   8s        failed with exit code 1
 โ””-ยท-โœ” failed (failed)                        status-reference-sslm2-1903185625  3s        
   โ””-โ—‹ succeeded (succeeded)                                                               when 'Failed == Succeeded' evaluated false

Logs for dag-continue-on-fail:

Name:                dag-contiue-on-fail-7jgvr
Namespace:           anomaly-detection
ServiceAccount:      review-feature-fa-1ro31a-anomaly-detection-backend
Status:              Failed
Created:             Thu Apr 09 13:06:01 -0500 (11 minutes ago)
Started:             Thu Apr 09 13:06:01 -0500 (11 minutes ago)
Finished:            Thu Apr 09 13:07:20 -0500 (10 minutes ago)
Duration:            1 minute 19 seconds

STEP                                     PODNAME                               DURATION  MESSAGE
 โœ– dag-contiue-on-fail-7jgvr (workflow)                                                  
 โ”œ-โœ” A (whalesay)                        dag-contiue-on-fail-7jgvr-3913376868  49s       
 โ”œ-โœ– B (intentional-fail)                dag-contiue-on-fail-7jgvr-3963709725  3s        failed with exit code 1
 โ”œ-โœ” C (whalesay)                        dag-contiue-on-fail-7jgvr-3946932106  18s       
 โ”œ-โœ– E (intentional-fail)                dag-contiue-on-fail-7jgvr-3846266392  2s        failed with exit code 1
 โ”œ-โœ” F (whalesay)                        dag-contiue-on-fail-7jgvr-3896599249  19s       
 โ””-โœ” D (whalesay)                        dag-contiue-on-fail-7jgvr-3863044011  5s  

Logs

kubectl logs <failedpodname> -c init

time="2020-04-09T18:19:47Z" level=info msg="Creating a docker executor"
time="2020-04-09T18:19:47Z" level=info msg="Executor (version: vHEAD+unknown, build_date: 2020-03-04T21:31:18Z) initialized (pod: anomaly-detection/status-reference-sslm2-141736205) with template:\n{\"name\":\"flakey-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"alpine:3.6\",\"command\":[\"sh\",\"-c\"],\"args\":[\"exit 1\"],\"resources\":{},\"source\":\"\"}}"
time="2020-04-09T18:19:47Z" level=info msg="Loading script source to /argo/staging/script"
time="2020-04-09T18:19:47Z" level=info msg="Start loading input artifacts..."
time="2020-04-09T18:19:47Z" level=info msg="Alloc=2357 TotalAlloc=3498 Sys=68610 NumGC=1 Goroutines=3"

kubectl logs <failedpodname> -c wait

time="2020-04-09T18:19:49Z" level=info msg="Creating a docker executor"
time="2020-04-09T18:19:49Z" level=info msg="Executor (version: vHEAD+unknown, build_date: 2020-03-04T21:31:18Z) initialized (pod: anomaly-detection/status-reference-sslm2-141736205) with template:\n{\"name\":\"flakey-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"alpine:3.6\",\"command\":[\"sh\",\"-c\"],\"args\":[\"exit 1\"],\"resources\":{},\"source\":\"\"}}"
time="2020-04-09T18:19:49Z" level=info msg="Waiting on main container"
time="2020-04-09T18:19:51Z" level=info msg="main container started with container ID: a1fbba5c9be4976b61b031a6dee81288c44ee37c2f97147d0d78e4a0f67a4e1f"
time="2020-04-09T18:19:51Z" level=info msg="Starting annotations monitor"
time="2020-04-09T18:19:51Z" level=info msg="docker wait a1fbba5c9be4976b61b031a6dee81288c44ee37c2f97147d0d78e4a0f67a4e1f"
time="2020-04-09T18:19:51Z" level=info msg="Starting deadline monitor"
time="2020-04-09T18:19:51Z" level=info msg="Main container completed"
time="2020-04-09T18:19:51Z" level=info msg="No output parameters"
time="2020-04-09T18:19:51Z" level=info msg="No output artifacts"
time="2020-04-09T18:19:51Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-04-09T18:19:51Z" level=info msg="Killing sidecars"
time="2020-04-09T18:19:51Z" level=info msg="Annotations monitor stopped"
time="2020-04-09T18:19:52Z" level=info msg="Alloc=3055 TotalAlloc=5359 Sys=69442 NumGC=2 Goroutines=8"

For dag-continue-on-fail:

kubectl logs <failedpodname> -c init
# invalid
kubectl logs <failedpodname> -c wait
time="2020-04-09T18:31:09Z" level=info msg="Creating a docker executor"
time="2020-04-09T18:31:09Z" level=info msg="Executor (version: vHEAD+unknown, build_date: 2020-03-04T21:31:18Z) initialized (pod: anomaly-detection/dag-contiue-on-fail-dgxtt-1052672616) with template:\n{\"name\":\"intentional-fail\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"echo intentional failure; exit 1\"],\"resources\":{}}}"
time="2020-04-09T18:31:09Z" level=info msg="Waiting on main container"
time="2020-04-09T18:31:33Z" level=info msg="main container started with container ID: c4bfe034ea2fa7375b2c7ad9a64a04b767a6b87d78944e256ed46fc5fdd28725"
time="2020-04-09T18:31:33Z" level=info msg="Starting annotations monitor"
time="2020-04-09T18:31:33Z" level=info msg="docker wait c4bfe034ea2fa7375b2c7ad9a64a04b767a6b87d78944e256ed46fc5fdd28725"
time="2020-04-09T18:31:33Z" level=info msg="Starting deadline monitor"
time="2020-04-09T18:31:33Z" level=info msg="Main container completed"
time="2020-04-09T18:31:33Z" level=info msg="No output parameters"
time="2020-04-09T18:31:33Z" level=info msg="No output artifacts"
time="2020-04-09T18:31:33Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-04-09T18:31:33Z" level=info msg="Killing sidecars"
time="2020-04-09T18:31:33Z" level=info msg="Annotations monitor stopped"
time="2020-04-09T18:31:33Z" level=info msg="Alloc=3367 TotalAlloc=5377 Sys=70848 NumGC=2 Goroutines=8"
  • workflow-controller logs:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
# no access :(


Message from the maintainers:

If you are impacted by this bug please add a ๐Ÿ‘ reaction to this issue! We often sort issues this way to know what to prioritize.

bug

Most helpful comment

Opened https://github.com/argoproj/argo/pull/2656 to fix this.

Note that dag-continue-on-fail actually has two failed tasks, A and E, but only A has continueOn set. The bug in this issue is still valid (dag-continue-on-fail still fails if E is removed), but note that dag-continue-on-fail should still not succeed as-is even after this bug is fixed.

Also, we kindly ask that only one ๐Ÿ‘ is added _per organization_ for issues. It helps keep things fair ๐Ÿ™‚

>All comments

Opened https://github.com/argoproj/argo/pull/2656 to fix this.

Note that dag-continue-on-fail actually has two failed tasks, A and E, but only A has continueOn set. The bug in this issue is still valid (dag-continue-on-fail still fails if E is removed), but note that dag-continue-on-fail should still not succeed as-is even after this bug is fixed.

Also, we kindly ask that only one ๐Ÿ‘ is added _per organization_ for issues. It helps keep things fair ๐Ÿ™‚

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hden picture hden  ยท  3Comments

salanki picture salanki  ยท  3Comments

tigerwings picture tigerwings  ยท  3Comments

nelsonfassis picture nelsonfassis  ยท  4Comments

0xdevalias picture 0xdevalias  ยท  3Comments