Checklist:
What happened:
When setting continuneOn.failed to true on a given task, whether it's in a step or dag, and if there are subsequent tasks depending on said task, then I expect the Workflow to have a consistent status after-the-fact.
What you expected to happen:
For both to either succeed or fail.
How to reproduce it (as minimally and precisely as possible):
argo submit https://raw.githubusercontent.com/argoproj/argo/master/examples/status-reference.yaml
argo submit https://raw.githubusercontent.com/argoproj/argo/master/examples/dag-continue-on-fail.yaml
Anything else we need to know?:
Environment:
$ argo version
argo: vv2.5.2+4b25e2a.dirty
BuildDate: 2020-02-24T22:49:11Z
GitCommit: 4b25e2ac1d495991261e97c86d211d658423ab7f
GitTreeState: dirty
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
$ kubectl version -o yaml
clientVersion:
buildDate: 2019-02-13T11:15:10Z
compiler: gc
gitCommit: 954ff68d59e9dc62fa8252ffa9023a90ff8a358c
gitTreeState: clean
gitVersion: v1.10.13
goVersion: go1.9.3
major: "1"
minor: "10"
platform: linux/amd64
serverVersion:
buildDate: 2018-11-13T11:33:04Z
compiler: gc
gitCommit: be1a908c6aa47e0ae1b1dc861a1de6ccfe963aa2
gitTreeState: clean
gitVersion: v1.10.10
goVersion: go1.9.3
major: "1"
minor: "10"
platform: linux/amd64
Other debugging information (if applicable):
Logs for status-reference:
Name: status-reference-sslm2
Namespace: anomaly-detection
ServiceAccount: review-feature-fa-1ro31a-anomaly-detection-backend
Status: Succeeded
Created: Thu Apr 09 13:19:44 -0500 (13 seconds ago)
Started: Thu Apr 09 13:19:44 -0500 (13 seconds ago)
Finished: Thu Apr 09 13:19:57 -0500 (now)
Duration: 13 seconds
STEP PODNAME DURATION MESSAGE
โ status-reference-sslm2 (status-reference)
โ---โ flakey-container (flakey-container) status-reference-sslm2-141736205 8s failed with exit code 1
โ-ยท-โ failed (failed) status-reference-sslm2-1903185625 3s
โ-โ succeeded (succeeded) when 'Failed == Succeeded' evaluated false
Logs for dag-continue-on-fail:
Name: dag-contiue-on-fail-7jgvr
Namespace: anomaly-detection
ServiceAccount: review-feature-fa-1ro31a-anomaly-detection-backend
Status: Failed
Created: Thu Apr 09 13:06:01 -0500 (11 minutes ago)
Started: Thu Apr 09 13:06:01 -0500 (11 minutes ago)
Finished: Thu Apr 09 13:07:20 -0500 (10 minutes ago)
Duration: 1 minute 19 seconds
STEP PODNAME DURATION MESSAGE
โ dag-contiue-on-fail-7jgvr (workflow)
โ-โ A (whalesay) dag-contiue-on-fail-7jgvr-3913376868 49s
โ-โ B (intentional-fail) dag-contiue-on-fail-7jgvr-3963709725 3s failed with exit code 1
โ-โ C (whalesay) dag-contiue-on-fail-7jgvr-3946932106 18s
โ-โ E (intentional-fail) dag-contiue-on-fail-7jgvr-3846266392 2s failed with exit code 1
โ-โ F (whalesay) dag-contiue-on-fail-7jgvr-3896599249 19s
โ-โ D (whalesay) dag-contiue-on-fail-7jgvr-3863044011 5s
Logs
kubectl logs <failedpodname> -c init
time="2020-04-09T18:19:47Z" level=info msg="Creating a docker executor"
time="2020-04-09T18:19:47Z" level=info msg="Executor (version: vHEAD+unknown, build_date: 2020-03-04T21:31:18Z) initialized (pod: anomaly-detection/status-reference-sslm2-141736205) with template:\n{\"name\":\"flakey-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"alpine:3.6\",\"command\":[\"sh\",\"-c\"],\"args\":[\"exit 1\"],\"resources\":{},\"source\":\"\"}}"
time="2020-04-09T18:19:47Z" level=info msg="Loading script source to /argo/staging/script"
time="2020-04-09T18:19:47Z" level=info msg="Start loading input artifacts..."
time="2020-04-09T18:19:47Z" level=info msg="Alloc=2357 TotalAlloc=3498 Sys=68610 NumGC=1 Goroutines=3"
kubectl logs <failedpodname> -c wait
time="2020-04-09T18:19:49Z" level=info msg="Creating a docker executor"
time="2020-04-09T18:19:49Z" level=info msg="Executor (version: vHEAD+unknown, build_date: 2020-03-04T21:31:18Z) initialized (pod: anomaly-detection/status-reference-sslm2-141736205) with template:\n{\"name\":\"flakey-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"alpine:3.6\",\"command\":[\"sh\",\"-c\"],\"args\":[\"exit 1\"],\"resources\":{},\"source\":\"\"}}"
time="2020-04-09T18:19:49Z" level=info msg="Waiting on main container"
time="2020-04-09T18:19:51Z" level=info msg="main container started with container ID: a1fbba5c9be4976b61b031a6dee81288c44ee37c2f97147d0d78e4a0f67a4e1f"
time="2020-04-09T18:19:51Z" level=info msg="Starting annotations monitor"
time="2020-04-09T18:19:51Z" level=info msg="docker wait a1fbba5c9be4976b61b031a6dee81288c44ee37c2f97147d0d78e4a0f67a4e1f"
time="2020-04-09T18:19:51Z" level=info msg="Starting deadline monitor"
time="2020-04-09T18:19:51Z" level=info msg="Main container completed"
time="2020-04-09T18:19:51Z" level=info msg="No output parameters"
time="2020-04-09T18:19:51Z" level=info msg="No output artifacts"
time="2020-04-09T18:19:51Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-04-09T18:19:51Z" level=info msg="Killing sidecars"
time="2020-04-09T18:19:51Z" level=info msg="Annotations monitor stopped"
time="2020-04-09T18:19:52Z" level=info msg="Alloc=3055 TotalAlloc=5359 Sys=69442 NumGC=2 Goroutines=8"
For dag-continue-on-fail:
kubectl logs <failedpodname> -c init
# invalid
kubectl logs <failedpodname> -c wait
time="2020-04-09T18:31:09Z" level=info msg="Creating a docker executor"
time="2020-04-09T18:31:09Z" level=info msg="Executor (version: vHEAD+unknown, build_date: 2020-03-04T21:31:18Z) initialized (pod: anomaly-detection/dag-contiue-on-fail-dgxtt-1052672616) with template:\n{\"name\":\"intentional-fail\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"echo intentional failure; exit 1\"],\"resources\":{}}}"
time="2020-04-09T18:31:09Z" level=info msg="Waiting on main container"
time="2020-04-09T18:31:33Z" level=info msg="main container started with container ID: c4bfe034ea2fa7375b2c7ad9a64a04b767a6b87d78944e256ed46fc5fdd28725"
time="2020-04-09T18:31:33Z" level=info msg="Starting annotations monitor"
time="2020-04-09T18:31:33Z" level=info msg="docker wait c4bfe034ea2fa7375b2c7ad9a64a04b767a6b87d78944e256ed46fc5fdd28725"
time="2020-04-09T18:31:33Z" level=info msg="Starting deadline monitor"
time="2020-04-09T18:31:33Z" level=info msg="Main container completed"
time="2020-04-09T18:31:33Z" level=info msg="No output parameters"
time="2020-04-09T18:31:33Z" level=info msg="No output artifacts"
time="2020-04-09T18:31:33Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-04-09T18:31:33Z" level=info msg="Killing sidecars"
time="2020-04-09T18:31:33Z" level=info msg="Annotations monitor stopped"
time="2020-04-09T18:31:33Z" level=info msg="Alloc=3367 TotalAlloc=5377 Sys=70848 NumGC=2 Goroutines=8"
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
# no access :(
Message from the maintainers:
If you are impacted by this bug please add a ๐ reaction to this issue! We often sort issues this way to know what to prioritize.
Opened https://github.com/argoproj/argo/pull/2656 to fix this.
Note that dag-continue-on-fail actually has two failed tasks, A and E, but only A has continueOn set. The bug in this issue is still valid (dag-continue-on-fail still fails if E is removed), but note that dag-continue-on-fail should still not succeed as-is even after this bug is fixed.
Also, we kindly ask that only one ๐ is added _per organization_ for issues. It helps keep things fair ๐
Most helpful comment
Opened https://github.com/argoproj/argo/pull/2656 to fix this.
Note that
dag-continue-on-failactually has two failed tasks,AandE, but onlyAhascontinueOnset. The bug in this issue is still valid (dag-continue-on-failstill fails ifEis removed), but note thatdag-continue-on-failshould still not succeed as-is even after this bug is fixed.Also, we kindly ask that only one ๐ is added _per organization_ for issues. It helps keep things fair ๐