Is this a BUG REPORT or FEATURE REQUEST?:
BUG
What happened:
Workflow stay running forever
What you expected to happen:
Workflow failed
How to reproduce it (as minimally and precisely as possible):
Here is Workflow Status
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
creationTimestamp: 2019-01-24T06:44:57Z
generateName: dag-diamond-
generation: 1
labels:
workflows.argoproj.io/completed: "true"
workflows.argoproj.io/phase: Succeeded
name: dag-diamond-8q456
namespace: default
resourceVersion: "3363752"
selfLink: /apis/argoproj.io/v1alpha1/namespaces/default/workflows/dag-diamond-8q456
uid: 9059470d-1fa3-11e9-bc14-00163e0337d4
spec:
activeDeadlineSeconds: 0
arguments: {}
entrypoint: diamond
templates:
- container:
command:
- echo
- '{{inputs.parameters.message}}'
image: alpine:3.7
name: ""
resources: {}
inputs:
parameters:
- name: message
metadata: {}
name: echo
outputs: {}
- dag:
tasks:
- arguments:
parameters:
- name: message
value: A
name: A
template: echo
- arguments:
parameters:
- name: message
value: B
dependencies:
- A
name: B
template: echo
- arguments:
parameters:
- name: message
value: C
dependencies:
- A
name: C
template: echo
- arguments:
parameters:
- name: message
value: D
dependencies:
- B
- C
name: D
template: echo
inputs: {}
metadata: {}
name: diamond
outputs: {}
status:
finishedAt: 2019-01-24T06:45:17Z
nodes:
dag-diamond-8q456:
children:
- dag-diamond-8q456-1022901548
displayName: dag-diamond-8q456
finishedAt: 2019-01-24T06:45:17Z
id: dag-diamond-8q456
name: dag-diamond-8q456
outboundNodes:
- dag-diamond-8q456-972568691
phase: Running
startedAt: 2019-01-24T06:44:57Z
templateName: diamond
type: DAG
dag-diamond-8q456-972568691:
boundaryID: dag-diamond-8q456
displayName: D
finishedAt: 2019-01-24T06:45:16Z
id: dag-diamond-8q456-972568691
inputs:
parameters:
- name: message
value: D
name: dag-diamond-8q456.D
phase: Succeeded
startedAt: 2019-01-24T06:45:14Z
templateName: echo
type: Pod
dag-diamond-8q456-1022901548:
boundaryID: dag-diamond-8q456
children:
- dag-diamond-8q456-1073234405
- dag-diamond-8q456-1056456786
displayName: A
finishedAt: 2019-01-24T06:45:08Z
id: dag-diamond-8q456-1022901548
inputs:
parameters:
- name: message
value: A
name: dag-diamond-8q456.A
phase: Succeeded
startedAt: 2019-01-24T06:44:57Z
templateName: echo
type: Pod
dag-diamond-8q456-1056456786:
boundaryID: dag-diamond-8q456
children:
- dag-diamond-8q456-972568691
displayName: C
finishedAt: 2019-01-24T06:45:11Z
id: dag-diamond-8q456-1056456786
inputs:
parameters:
- name: message
value: C
name: dag-diamond-8q456.C
phase: Succeeded
startedAt: 2019-01-24T06:45:09Z
templateName: echo
type: Pod
dag-diamond-8q456-1073234405:
boundaryID: dag-diamond-8q456
children:
- dag-diamond-8q456-972568691
displayName: B
finishedAt: 2019-01-24T06:45:11Z
id: dag-diamond-8q456-1073234405
inputs:
parameters:
- name: message
value: B
name: dag-diamond-8q456.B
phase: Succeeded
startedAt: 2019-01-24T06:45:09Z
templateName: echo
type: Pod
phase: Running
startedAt: 2019-01-24T06:44:57Z
You can see that root DAG step is running. But all children node is failed.
Anything else we need to know?:
Environment:
argo: v2.2.1
BuildDate: 2018-10-11T16:26:28Z
GitCommit: 3b52b26190163d1f72f3aef1a39f9f291378dafb
GitTreeState: clean
GitTag: v2.2.1
GoVersion: go1.10.3
Compiler: gc
Platform: linux/amd64
[root@iZ8vb5qgxqbxakfo1cuvpaZ ~]# argo get dag-diamond-8q456
Name: dag-diamond-8q456
Namespace: default
ServiceAccount: default
Status: Running
Created: Thu Jan 24 14:44:57 +0800 (34 minutes ago)
Started: Thu Jan 24 14:44:57 +0800 (34 minutes ago)
Finished: Thu Jan 24 14:45:17 +0800 (34 minutes ago)
Duration: 20 seconds
STEP PODNAME DURATION MESSAGE
โ dag-diamond-8q456
โ-โ A dag-diamond-8q456-1022901548 11s
โ-โ B dag-diamond-8q456-1073234405 2s
โ-โ C dag-diamond-8q456-1056456786 2s
โ-โ D dag-diamond-8q456-972568691 2s
Can anyone help to debug this situtation.
I've seen same the issue of workflows with DAGs running forever the last days, but haven't been able to pinpoint the exact cause.
Reproduce with https://gist.github.com/ObeA/3d037e095be64b167edf88b74224ab79.
@alexmt @jessesuen Can you help to point the root cause ?
Here's another way to reproduce it, in this case a step calls a bunch of sub-steps that fail, no DAGs involved: https://gist.github.com/duboisf/8c3682adbd34593c6b3b7154c5dcc73d
I have seen it as well.
Working on it
Issue was fixed by https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7#diff-0f6d0392a7803ab237934814167f60ec
Controller incorrect stopped step group processing after first step failed and did assess node status. This is fixed now.
@alexmt Please reopen this issue.
The fix https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7#diff-0f6d0392a7803ab237934814167f60ec is the Steps type workflow. This can be reproduced and fix in my PR #1141.
I raise this issue is the DAG workflow. It's different from steps and I already cherry pick that commit, it will also be hang forever in my dag workflow.
This situtation can be reproduced hard, so I think you should reopen this issue . Maybe someone can reproduce this DAG type worklfow hang forevrer . Thanks
@xianlubird you are right. The https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7 did fix one bug which caused dag to stuck in Running state ( if dag is a step of step group ) but I did not realize your example consists of the only dag. Keep looking for a fix
@alexmt @jessesuen
I can also reproduce this pure dag issue, in diamond pattern. This is blocking us to use Argo to production.
https://gist.github.com/xubofei1983/e73f184e5770c0a6f8677b7c4069b32f
โ retry-with-dags-krj86
โ-โ hello1(0) retry-with-dags-krj86-2368448136 1s
โ-โ hello2(0) retry-with-dags-krj86-1872593483 1s
โ-โ hello32
| โ-โ hellosub21(0) retry-with-dags-krj86-2590853895 2s
| โ-โ hellosub22(0) retry-with-dags-krj86-4155253284 29s
| โ-โ hellosub23(0) retry-with-dags-krj86-528723241 2s
โ-โ hello33
| โ-โ hellosub21(0) retry-with-dags-krj86-2923825724 1s
| โ-โ hellosub22(0) retry-with-dags-krj86-2922153183 1s
| โ-โ hellosub23(0) retry-with-dags-krj86-1164382842 22s
โ-โ hello31
โ-โ hellosub11(0) retry-with-dags-krj86-1343433011 1s
โ-โ hellosub12(0) retry-with-dags-krj86-1339882672 10s
โ-โ hellosub13 No more retries left
โ-โ hellosub13(0) retry-with-dags-krj86-913253045 16s failed with exit code 1
โ-โ hellosub13(1) retry-with-dags-krj86-2859603944 1s failed with exit code 1
โ-โ hellosub13(2) retry-with-dags-krj86-711627427 1s failed with exit code 1
โ-โ hellosub13(3) retry-with-dags-krj86-242001190 1s failed with exit code 1
argo list
NAME STATUS AGE DURATION
retry-with-dags-krj86 Running 3m 3m
retry-with-dags-tqqbw Running 9m 9m
@xubofei1983 if workflow is still available can you please attach kubectl get workflow retry-with-dags-krj86 -o=yaml ?
@alexmt create another run and that's the output
https://gist.github.com/xubofei1983/9d2317db84ee2419c2883169e440036f
Also, for such case "argo terminate" does not work at all, I think because all pods already finished.
So there is no way to stop and resubmit to my knowledge. We have to start from beginning, which is terrible.
Thanks for providing a simple way to reproduce the issue @xubofei1983 . Your case should be fixed by https://github.com/argoproj/argo/pull/1208 . Root cause was incorrect handling of successfully completed step with retries.
Here is simplest workflow which causes it:
dag with retry
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-with-dags-
spec:
entrypoint: retry-with-dags
templates:
- name: retry-with-dags
dag:
tasks:
- name: success1
template: success
- name: sub-dag1
template: sub-dag
dependencies:
- success1
- name: success2
dependencies:
- sub-dag1
template: success
- name: sub-dag
dag:
tasks:
- name: fail
template: fail
- container:
args:
- import random; import sys; exit_code = 1; sys.exit(exit_code)
command:
- python
- -c
image: python:alpine3.6
name: fail
- container:
args:
- import random; import sys; exit_code = 0; sys.exit(exit_code)
command:
- python
- -c
image: python:alpine3.6
name: success
retryStrategy:
limit: 3
I'm still unable to reproduce original issue described at https://github.com/argoproj/argo/issues/1190#issue-402573294 (pure dag without retries) though. Keep looking at it.
Thanks @alexmt for the quick fix.
What is the best way to install your fix? Do we update the workflow-controller container to point at the latest tag here https://hub.docker.com/r/argoproj/workflow-controller/tags ?
I can easily recreate this issue, and just cause general chaos, with this test:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: testing-workflow-
spec:
entrypoint: testing-workflow
templates:
- name: testing-workflow
steps:
- - name: testing-steps
template: testing-steps
withItems:
- "a"
- "b"
- "c"
- "d"
- "e"
- "f"
- "g"
- "h"
- "i"
- "j"
- "k"
- "l"
- "m"
- "n"
- "o"
- "p"
- "r"
- "s"
- "t"
- "u"
- "v"
- "w"
- "x"
- "y"
- "z"
- name: testing-steps
dag:
tasks:
- name: run-testing1
template: run-testing
- name: run-testing2
template: run-testing
dependencies: [run-testing1]
- name: run-testing
container:
image: frolvlad/alpine-bash
imagePullPolicy: Always
command: [bash, -c]
args: ["if (( RANDOM % 2 )); then echo 'fail'; exit 1; else echo 'success'; fi"]
@andreweskeclarke I have the same problem with my DAG workflow so I did this:
kubectl set image deployment/workflow-controller workflow-controller=argoproj/workflow-controller:latest --namespace=argo
It is not ideal. But until the new version is released, this seems to work.
I'm trying to use @illagrenan's suggestion to pull the latest tag and use that, but the latest tag is still failing for this issue. Version I'm currently using "Workflow Controller (version: v2.3.0+2b0b8f1.dirty)".
Anyone else have success with this fix?
@alexmt I use the master branch build the image, but for the test case
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-with-dags-
spec:
entrypoint: retry-with-dags
templates:
- name: retry-with-dags
dag:
tasks:
- name: success1
template: success
- name: sub-dag1
template: sub-dag
dependencies:
- success1
- name: success2
dependencies:
- sub-dag1
template: success
- name: sub-dag
dag:
tasks:
- name: fail
template: fail
- container:
args:
- import random; import sys; exit_code = 1; sys.exit(exit_code)
command:
- python
- -c
image: python:alpine3.6
name: fail
- container:
args:
- import random; import sys; exit_code = 0; sys.exit(exit_code)
command:
- python
- -c
image: python:alpine3.6
name: success
retryStrategy:
limit: 3
It will be running forever always,
[root@iZ8vbha5mb49ipi1114n6bZ dag-hang]# ags get retry-with-dags-z5kdr
Name: retry-with-dags-z5kdr
Namespace: default
ServiceAccount: default
Status: Running
Created: Mon Mar 25 17:46:53 +0800 (3 minutes ago)
Started: Mon Mar 25 17:46:53 +0800 (3 minutes ago)
Duration: 3 minutes 24 seconds
STEP PODNAME DURATION MESSAGE
โ retry-with-dags-z5kdr
โ-โ success1(0) retry-with-dags-z5kdr-563744608 8s
โ-โ sub-dag1
โ-โ fail retry-with-dags-z5kdr-1558755693 9s failed with exit code 1
I'm having the same issue with DAGs, and I can easily replicate off the above script using v2.3.0-rc1. This is a pretty urgent bug, as errors will go unreported if the jobs do not properly fail.
@alexmt Pls take a look
cc @sarabala1979
Will try to look into it tomorrow and hopefully include into 2.3
@alexmt Any update on this?
@alexmt is looking into this. It will be fixed soon.
I have a fix for this.
Most helpful comment
I have a fix for this.