Argo: Dag workflow will stay running forever in some long run situation.

Created on 24 Jan 2019 · 25Comments · Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?:
BUG
What happened:
Workflow stay running forever
What you expected to happen:
Workflow failed
How to reproduce it (as minimally and precisely as possible):
Here is Workflow Status

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: 2019-01-24T06:44:57Z
  generateName: dag-diamond-
  generation: 1
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Succeeded
  name: dag-diamond-8q456
  namespace: default
  resourceVersion: "3363752"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/default/workflows/dag-diamond-8q456
  uid: 9059470d-1fa3-11e9-bc14-00163e0337d4
spec:
  activeDeadlineSeconds: 0
  arguments: {}
  entrypoint: diamond
  templates:
  - container:
      command:
      - echo
      - '{{inputs.parameters.message}}'
      image: alpine:3.7
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: message
    metadata: {}
    name: echo
    outputs: {}
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: message
            value: A
        name: A
        template: echo
      - arguments:
          parameters:
          - name: message
            value: B
        dependencies:
        - A
        name: B
        template: echo
      - arguments:
          parameters:
          - name: message
            value: C
        dependencies:
        - A
        name: C
        template: echo
      - arguments:
          parameters:
          - name: message
            value: D
        dependencies:
        - B
        - C
        name: D
        template: echo
    inputs: {}
    metadata: {}
    name: diamond
    outputs: {}
status:
  finishedAt: 2019-01-24T06:45:17Z
  nodes:
    dag-diamond-8q456:
      children:
      - dag-diamond-8q456-1022901548
      displayName: dag-diamond-8q456
      finishedAt: 2019-01-24T06:45:17Z
      id: dag-diamond-8q456
      name: dag-diamond-8q456
      outboundNodes:
      - dag-diamond-8q456-972568691
      phase: Running
      startedAt: 2019-01-24T06:44:57Z
      templateName: diamond
      type: DAG
    dag-diamond-8q456-972568691:
      boundaryID: dag-diamond-8q456
      displayName: D
      finishedAt: 2019-01-24T06:45:16Z
      id: dag-diamond-8q456-972568691
      inputs:
        parameters:
        - name: message
          value: D
      name: dag-diamond-8q456.D
      phase: Succeeded
      startedAt: 2019-01-24T06:45:14Z
      templateName: echo
      type: Pod
    dag-diamond-8q456-1022901548:
      boundaryID: dag-diamond-8q456
      children:
      - dag-diamond-8q456-1073234405
      - dag-diamond-8q456-1056456786
      displayName: A
      finishedAt: 2019-01-24T06:45:08Z
      id: dag-diamond-8q456-1022901548
      inputs:
        parameters:
        - name: message
          value: A
      name: dag-diamond-8q456.A
      phase: Succeeded
      startedAt: 2019-01-24T06:44:57Z
      templateName: echo
      type: Pod
    dag-diamond-8q456-1056456786:
      boundaryID: dag-diamond-8q456
      children:
      - dag-diamond-8q456-972568691
      displayName: C
      finishedAt: 2019-01-24T06:45:11Z
      id: dag-diamond-8q456-1056456786
      inputs:
        parameters:
        - name: message
          value: C
      name: dag-diamond-8q456.C
      phase: Succeeded
      startedAt: 2019-01-24T06:45:09Z
      templateName: echo
      type: Pod
    dag-diamond-8q456-1073234405:
      boundaryID: dag-diamond-8q456
      children:
      - dag-diamond-8q456-972568691
      displayName: B
      finishedAt: 2019-01-24T06:45:11Z
      id: dag-diamond-8q456-1073234405
      inputs:
        parameters:
        - name: message
          value: B
      name: dag-diamond-8q456.B
      phase: Succeeded
      startedAt: 2019-01-24T06:45:09Z
      templateName: echo
      type: Pod
  phase: Running
  startedAt: 2019-01-24T06:44:57Z

You can see that root DAG step is running. But all children node is failed.
Anything else we need to know?:

Environment:

Argo version:

argo: v2.2.1
  BuildDate: 2018-10-11T16:26:28Z
  GitCommit: 3b52b26190163d1f72f3aef1a39f9f291378dafb
  GitTreeState: clean
  GitTag: v2.2.1
  GoVersion: go1.10.3
  Compiler: gc
  Platform: linux/amd64

[root@iZ8vb5qgxqbxakfo1cuvpaZ ~]# argo get dag-diamond-8q456
Name:                dag-diamond-8q456
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Thu Jan 24 14:44:57 +0800 (34 minutes ago)
Started:             Thu Jan 24 14:44:57 +0800 (34 minutes ago)
Finished:            Thu Jan 24 14:45:17 +0800 (34 minutes ago)
Duration:            20 seconds

STEP                  PODNAME                       DURATION  MESSAGE
 ● dag-diamond-8q456
 ├-✔ A                dag-diamond-8q456-1022901548  11s
 ├-✔ B                dag-diamond-8q456-1073234405  2s
 ├-✔ C                dag-diamond-8q456-1056456786  2s
 └-✔ D                dag-diamond-8q456-972568691   2s

Can anyone help to debug this situtation.

bug

Source

xianlubird

👍3

Most helpful comment

I have a fix for this.

jessesuen on 10 May 2019

🎉4

All 25 comments

I've seen same the issue of workflows with DAGs running forever the last days, but haven't been able to pinpoint the exact cause.

Reproduce with https://gist.github.com/ObeA/3d037e095be64b167edf88b74224ab79.

ObeA on 24 Jan 2019

@alexmt @jessesuen Can you help to point the root cause ?

xianlubird on 24 Jan 2019

Here's another way to reproduce it, in this case a step calls a bunch of sub-steps that fail, no DAGs involved: https://gist.github.com/duboisf/8c3682adbd34593c6b3b7154c5dcc73d

duboisf on 25 Jan 2019

👍1

I have seen it as well.

xubofei1983 on 25 Jan 2019

Working on it

alexmt on 25 Jan 2019

Issue was fixed by https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7#diff-0f6d0392a7803ab237934814167f60ec

Controller incorrect stopped step group processing after first step failed and did assess node status. This is fixed now.

alexmt on 25 Jan 2019

@alexmt Please reopen this issue.

The fix https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7#diff-0f6d0392a7803ab237934814167f60ec is the Steps type workflow. This can be reproduced and fix in my PR #1141.

I raise this issue is the DAG workflow. It's different from steps and I already cherry pick that commit, it will also be hang forever in my dag workflow.

This situtation can be reproduced hard, so I think you should reopen this issue . Maybe someone can reproduce this DAG type worklfow hang forevrer . Thanks

xianlubird on 26 Jan 2019

@xianlubird you are right. The https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7 did fix one bug which caused dag to stuck in Running state ( if dag is a step of step group ) but I did not realize your example consists of the only dag. Keep looking for a fix

alexmt on 28 Jan 2019

@alexmt @jessesuen

I can also reproduce this pure dag issue, in diamond pattern. This is blocking us to use Argo to production.

https://gist.github.com/xubofei1983/e73f184e5770c0a6f8677b7c4069b32f

xubofei1983 on 2 Feb 2019

● retry-with-dags-krj86
├-✔ hello1(0) retry-with-dags-krj86-2368448136 1s
├-✔ hello2(0) retry-with-dags-krj86-1872593483 1s
├-✔ hello32
| ├-✔ hellosub21(0) retry-with-dags-krj86-2590853895 2s
| ├-✔ hellosub22(0) retry-with-dags-krj86-4155253284 29s
| └-✔ hellosub23(0) retry-with-dags-krj86-528723241 2s
├-✔ hello33
| ├-✔ hellosub21(0) retry-with-dags-krj86-2923825724 1s
| ├-✔ hellosub22(0) retry-with-dags-krj86-2922153183 1s
| └-✔ hellosub23(0) retry-with-dags-krj86-1164382842 22s
└-✖ hello31
├-✔ hellosub11(0) retry-with-dags-krj86-1343433011 1s
├-✔ hellosub12(0) retry-with-dags-krj86-1339882672 10s
└-✖ hellosub13 No more retries left
├-✖ hellosub13(0) retry-with-dags-krj86-913253045 16s failed with exit code 1
├-✖ hellosub13(1) retry-with-dags-krj86-2859603944 1s failed with exit code 1
├-✖ hellosub13(2) retry-with-dags-krj86-711627427 1s failed with exit code 1
└-✖ hellosub13(3) retry-with-dags-krj86-242001190 1s failed with exit code 1

argo list
NAME STATUS AGE DURATION
retry-with-dags-krj86 Running 3m 3m
retry-with-dags-tqqbw Running 9m 9m

xubofei1983 on 2 Feb 2019

@xubofei1983 if workflow is still available can you please attach kubectl get workflow retry-with-dags-krj86 -o=yaml ?

alexmt on 2 Feb 2019

@alexmt create another run and that's the output
https://gist.github.com/xubofei1983/9d2317db84ee2419c2883169e440036f

xubofei1983 on 2 Feb 2019

Also, for such case "argo terminate" does not work at all, I think because all pods already finished.

So there is no way to stop and resubmit to my knowledge. We have to start from beginning, which is terrible.

xubofei1983 on 2 Feb 2019

Thanks for providing a simple way to reproduce the issue @xubofei1983 . Your case should be fixed by https://github.com/argoproj/argo/pull/1208 . Root cause was incorrect handling of successfully completed step with retries.

Here is simplest workflow which causes it:

dag with retry

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-with-dags-
spec:
  entrypoint: retry-with-dags
  templates:
  - name: retry-with-dags

    dag:
      tasks:
      - name: success1
        template: success

      - name: sub-dag1
        template: sub-dag
        dependencies:
        - success1

      - name: success2
        dependencies:
        - sub-dag1
        template: success

  - name: sub-dag
    dag:
      tasks:
      - name: fail
        template: fail

  - container:
      args:
      - import random; import sys; exit_code = 1; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: fail

  - container:
      args:
      - import random; import sys; exit_code = 0; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: success
    retryStrategy:
      limit: 3

I'm still unable to reproduce original issue described at https://github.com/argoproj/argo/issues/1190#issue-402573294 (pure dag without retries) though. Keep looking at it.

alexmt on 5 Feb 2019

Thanks @alexmt for the quick fix.

xubofei1983 on 5 Feb 2019

What is the best way to install your fix? Do we update the workflow-controller container to point at the latest tag here https://hub.docker.com/r/argoproj/workflow-controller/tags ?

I can easily recreate this issue, and just cause general chaos, with this test:

apiVersion:             argoproj.io/v1alpha1                                                                                                                                                                                                                                                                                 
kind:                   Workflow                                                                                                                                                                                                                                                                                             
metadata:                                                                                                                                                                                                                                                                                                                    
  generateName:         testing-workflow-                                                                                                                                                                                                                                                                                    
spec:                                                                                                                                                                                                                                                                                                                        
  entrypoint:           testing-workflow                                                                                                                                                                                                                                                                                     
  templates:                                                                                                                                                                                                                                                                                                                 
  - name:               testing-workflow                                                                                                                                                                                                                                                                                     
    steps:                                                                                                                                                                                                                                                                                                                   
    - - name:           testing-steps                                                                                                                                                                                                                                                                                        
        template:       testing-steps                                                                                                                                                                                                                                                                                        
        withItems:                                                                                                                                                                                                                                                                                                           
            - "a"                                                                                                                                                                                                                                                                                                            
            - "b"                                                                                                                                                                                                                                                                                                            
            - "c"                                                                                                                                                                                                                                                                                                            
            - "d"                                                                                                                                                                                                                                                                                                            
            - "e"                                                                                                                                                                                                                                                                                                            
            - "f"                                                                                                                                                                                                                                                                                                            
            - "g"                                                                                                                                                                                                                                                                                                            
            - "h"                                                                                                                                                                                                                                                                                                            
            - "i"                                                                                                                                                                                                                                                                                                            
            - "j"                                                                                                                                                                                                                                                                                                            
            - "k"                                                                                                                                                                                                                                                                                                            
            - "l"                                                                                                                                                                                                                                                                                                            
            - "m"                                                                                                                                                                                                                                                                                                            
            - "n"                                                                                                                                                                                                                                                                                                            
            - "o"                                                                                                                                                                                                                                                                                                            
            - "p"                                                                                                                                                                                                                                                                                                            
            - "r"                                                                                                                                                                                                                                                                                                            
            - "s"                                                                                                                                                                                                                                                                                                            
            - "t"                                                                                                                                                                                                                                                                                                            
            - "u"                                                                                                                                                                                                                                                                                                            
            - "v"                                                                                                                                                                                                                                                                                                            
            - "w"                                                                                                                                                                                                                                                                                                            
            - "x"                                                                                                                                                                                                                                                                                                            
            - "y"                                                                                                                                                                                                                                                                                                            
            - "z"                                                                                                                                                                                                                                                                                                            
  - name:               testing-steps                                                                                                                                                                                                                                                                                        
    dag:                                                                                                                                                                                                                                                                                                                     
      tasks:                                                                                                                                                                                                                                                                                                                 
      - name:           run-testing1                                                                                                                                                                                                                                                                                         
        template:       run-testing                                                                                                                                                                                                                                                                                          
      - name:           run-testing2                                                                                                                                                                                                                                                                                         
        template:       run-testing                                                                                                                                                                                                                                                                                          
        dependencies:   [run-testing1]                                                                                                                                                                                                                                                                                       
  - name:               run-testing                                                                                                                                                                                                                                                                                          
    container:                                                                                                                                                                                                                                                                                                               
      image:            frolvlad/alpine-bash                                                                                                                                                                                                                                                                                 
      imagePullPolicy:  Always                                                                                                                                                                                                                                                                                               
      command:          [bash, -c]                                                                                                                                                                                                                                                                                           
      args:             ["if (( RANDOM % 2 )); then echo 'fail'; exit 1; else echo 'success'; fi"]

andreweskeclarke on 12 Feb 2019

@andreweskeclarke I have the same problem with my DAG workflow so I did this:

kubectl set image deployment/workflow-controller workflow-controller=argoproj/workflow-controller:latest --namespace=argo

It is not ideal. But until the new version is released, this seems to work.

illagrenan on 14 Feb 2019

I'm trying to use @illagrenan's suggestion to pull the latest tag and use that, but the latest tag is still failing for this issue. Version I'm currently using "Workflow Controller (version: v2.3.0+2b0b8f1.dirty)".

Anyone else have success with this fix?

jhsmith on 2 Mar 2019

@alexmt I use the master branch build the image, but for the test case

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-with-dags-
spec:
  entrypoint: retry-with-dags
  templates:
  - name: retry-with-dags

    dag:
      tasks:
      - name: success1
        template: success

      - name: sub-dag1
        template: sub-dag
        dependencies:
        - success1

      - name: success2
        dependencies:
        - sub-dag1
        template: success

  - name: sub-dag
    dag:
      tasks:
      - name: fail
        template: fail

  - container:
      args:
      - import random; import sys; exit_code = 1; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: fail

  - container:
      args:
      - import random; import sys; exit_code = 0; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: success
    retryStrategy:
      limit: 3

It will be running forever always,

[root@iZ8vbha5mb49ipi1114n6bZ dag-hang]# ags get retry-with-dags-z5kdr
Name:                retry-with-dags-z5kdr
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Mon Mar 25 17:46:53 +0800 (3 minutes ago)
Started:             Mon Mar 25 17:46:53 +0800 (3 minutes ago)
Duration:            3 minutes 24 seconds

STEP                      PODNAME                           DURATION  MESSAGE
 ● retry-with-dags-z5kdr
 ├-✔ success1(0)          retry-with-dags-z5kdr-563744608   8s
 └-✖ sub-dag1
   └-✖ fail               retry-with-dags-z5kdr-1558755693  9s        failed with exit code 1

xianlubird on 25 Mar 2019

I'm having the same issue with DAGs, and I can easily replicate off the above script using v2.3.0-rc1. This is a pretty urgent bug, as errors will go unreported if the jobs do not properly fail.