Argo: Dag workflow will stay running forever in some long run situation.

Created on 24 Jan 2019  ยท  25Comments  ยท  Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?:
BUG
What happened:
Workflow stay running forever
What you expected to happen:
Workflow failed
How to reproduce it (as minimally and precisely as possible):
Here is Workflow Status

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: 2019-01-24T06:44:57Z
  generateName: dag-diamond-
  generation: 1
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Succeeded
  name: dag-diamond-8q456
  namespace: default
  resourceVersion: "3363752"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/default/workflows/dag-diamond-8q456
  uid: 9059470d-1fa3-11e9-bc14-00163e0337d4
spec:
  activeDeadlineSeconds: 0
  arguments: {}
  entrypoint: diamond
  templates:
  - container:
      command:
      - echo
      - '{{inputs.parameters.message}}'
      image: alpine:3.7
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: message
    metadata: {}
    name: echo
    outputs: {}
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: message
            value: A
        name: A
        template: echo
      - arguments:
          parameters:
          - name: message
            value: B
        dependencies:
        - A
        name: B
        template: echo
      - arguments:
          parameters:
          - name: message
            value: C
        dependencies:
        - A
        name: C
        template: echo
      - arguments:
          parameters:
          - name: message
            value: D
        dependencies:
        - B
        - C
        name: D
        template: echo
    inputs: {}
    metadata: {}
    name: diamond
    outputs: {}
status:
  finishedAt: 2019-01-24T06:45:17Z
  nodes:
    dag-diamond-8q456:
      children:
      - dag-diamond-8q456-1022901548
      displayName: dag-diamond-8q456
      finishedAt: 2019-01-24T06:45:17Z
      id: dag-diamond-8q456
      name: dag-diamond-8q456
      outboundNodes:
      - dag-diamond-8q456-972568691
      phase: Running
      startedAt: 2019-01-24T06:44:57Z
      templateName: diamond
      type: DAG
    dag-diamond-8q456-972568691:
      boundaryID: dag-diamond-8q456
      displayName: D
      finishedAt: 2019-01-24T06:45:16Z
      id: dag-diamond-8q456-972568691
      inputs:
        parameters:
        - name: message
          value: D
      name: dag-diamond-8q456.D
      phase: Succeeded
      startedAt: 2019-01-24T06:45:14Z
      templateName: echo
      type: Pod
    dag-diamond-8q456-1022901548:
      boundaryID: dag-diamond-8q456
      children:
      - dag-diamond-8q456-1073234405
      - dag-diamond-8q456-1056456786
      displayName: A
      finishedAt: 2019-01-24T06:45:08Z
      id: dag-diamond-8q456-1022901548
      inputs:
        parameters:
        - name: message
          value: A
      name: dag-diamond-8q456.A
      phase: Succeeded
      startedAt: 2019-01-24T06:44:57Z
      templateName: echo
      type: Pod
    dag-diamond-8q456-1056456786:
      boundaryID: dag-diamond-8q456
      children:
      - dag-diamond-8q456-972568691
      displayName: C
      finishedAt: 2019-01-24T06:45:11Z
      id: dag-diamond-8q456-1056456786
      inputs:
        parameters:
        - name: message
          value: C
      name: dag-diamond-8q456.C
      phase: Succeeded
      startedAt: 2019-01-24T06:45:09Z
      templateName: echo
      type: Pod
    dag-diamond-8q456-1073234405:
      boundaryID: dag-diamond-8q456
      children:
      - dag-diamond-8q456-972568691
      displayName: B
      finishedAt: 2019-01-24T06:45:11Z
      id: dag-diamond-8q456-1073234405
      inputs:
        parameters:
        - name: message
          value: B
      name: dag-diamond-8q456.B
      phase: Succeeded
      startedAt: 2019-01-24T06:45:09Z
      templateName: echo
      type: Pod
  phase: Running
  startedAt: 2019-01-24T06:44:57Z

You can see that root DAG step is running. But all children node is failed.
Anything else we need to know?:

Environment:

  • Argo version:
argo: v2.2.1
  BuildDate: 2018-10-11T16:26:28Z
  GitCommit: 3b52b26190163d1f72f3aef1a39f9f291378dafb
  GitTreeState: clean
  GitTag: v2.2.1
  GoVersion: go1.10.3
  Compiler: gc
  Platform: linux/amd64
[root@iZ8vb5qgxqbxakfo1cuvpaZ ~]# argo get dag-diamond-8q456
Name:                dag-diamond-8q456
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Thu Jan 24 14:44:57 +0800 (34 minutes ago)
Started:             Thu Jan 24 14:44:57 +0800 (34 minutes ago)
Finished:            Thu Jan 24 14:45:17 +0800 (34 minutes ago)
Duration:            20 seconds

STEP                  PODNAME                       DURATION  MESSAGE
 โ— dag-diamond-8q456
 โ”œ-โœ” A                dag-diamond-8q456-1022901548  11s
 โ”œ-โœ” B                dag-diamond-8q456-1073234405  2s
 โ”œ-โœ” C                dag-diamond-8q456-1056456786  2s
 โ””-โœ” D                dag-diamond-8q456-972568691   2s

Can anyone help to debug this situtation.

bug

Most helpful comment

I have a fix for this.

All 25 comments

I've seen same the issue of workflows with DAGs running forever the last days, but haven't been able to pinpoint the exact cause.

Reproduce with https://gist.github.com/ObeA/3d037e095be64b167edf88b74224ab79.

@alexmt @jessesuen Can you help to point the root cause ?

Here's another way to reproduce it, in this case a step calls a bunch of sub-steps that fail, no DAGs involved: https://gist.github.com/duboisf/8c3682adbd34593c6b3b7154c5dcc73d

I have seen it as well.

Working on it

Issue was fixed by https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7#diff-0f6d0392a7803ab237934814167f60ec

Controller incorrect stopped step group processing after first step failed and did assess node status. This is fixed now.

@alexmt Please reopen this issue.

The fix https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7#diff-0f6d0392a7803ab237934814167f60ec is the Steps type workflow. This can be reproduced and fix in my PR #1141.

I raise this issue is the DAG workflow. It's different from steps and I already cherry pick that commit, it will also be hang forever in my dag workflow.

This situtation can be reproduced hard, so I think you should reopen this issue . Maybe someone can reproduce this DAG type worklfow hang forevrer . Thanks

@xianlubird you are right. The https://github.com/argoproj/argo/commit/cb538489a187134577e2146afcf9367f45088ff7 did fix one bug which caused dag to stuck in Running state ( if dag is a step of step group ) but I did not realize your example consists of the only dag. Keep looking for a fix

@alexmt @jessesuen

I can also reproduce this pure dag issue, in diamond pattern. This is blocking us to use Argo to production.

https://gist.github.com/xubofei1983/e73f184e5770c0a6f8677b7c4069b32f

โ— retry-with-dags-krj86
โ”œ-โœ” hello1(0) retry-with-dags-krj86-2368448136 1s
โ”œ-โœ” hello2(0) retry-with-dags-krj86-1872593483 1s
โ”œ-โœ” hello32
| โ”œ-โœ” hellosub21(0) retry-with-dags-krj86-2590853895 2s
| โ”œ-โœ” hellosub22(0) retry-with-dags-krj86-4155253284 29s
| โ””-โœ” hellosub23(0) retry-with-dags-krj86-528723241 2s
โ”œ-โœ” hello33
| โ”œ-โœ” hellosub21(0) retry-with-dags-krj86-2923825724 1s
| โ”œ-โœ” hellosub22(0) retry-with-dags-krj86-2922153183 1s
| โ””-โœ” hellosub23(0) retry-with-dags-krj86-1164382842 22s
โ””-โœ– hello31
โ”œ-โœ” hellosub11(0) retry-with-dags-krj86-1343433011 1s
โ”œ-โœ” hellosub12(0) retry-with-dags-krj86-1339882672 10s
โ””-โœ– hellosub13 No more retries left
โ”œ-โœ– hellosub13(0) retry-with-dags-krj86-913253045 16s failed with exit code 1
โ”œ-โœ– hellosub13(1) retry-with-dags-krj86-2859603944 1s failed with exit code 1
โ”œ-โœ– hellosub13(2) retry-with-dags-krj86-711627427 1s failed with exit code 1
โ””-โœ– hellosub13(3) retry-with-dags-krj86-242001190 1s failed with exit code 1

argo list
NAME STATUS AGE DURATION
retry-with-dags-krj86 Running 3m 3m
retry-with-dags-tqqbw Running 9m 9m

@xubofei1983 if workflow is still available can you please attach kubectl get workflow retry-with-dags-krj86 -o=yaml ?

@alexmt create another run and that's the output
https://gist.github.com/xubofei1983/9d2317db84ee2419c2883169e440036f

Also, for such case "argo terminate" does not work at all, I think because all pods already finished.

So there is no way to stop and resubmit to my knowledge. We have to start from beginning, which is terrible.

Thanks for providing a simple way to reproduce the issue @xubofei1983 . Your case should be fixed by https://github.com/argoproj/argo/pull/1208 . Root cause was incorrect handling of successfully completed step with retries.

Here is simplest workflow which causes it:

dag with retry

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-with-dags-
spec:
  entrypoint: retry-with-dags
  templates:
  - name: retry-with-dags

    dag:
      tasks:
      - name: success1
        template: success

      - name: sub-dag1
        template: sub-dag
        dependencies:
        - success1

      - name: success2
        dependencies:
        - sub-dag1
        template: success

  - name: sub-dag
    dag:
      tasks:
      - name: fail
        template: fail

  - container:
      args:
      - import random; import sys; exit_code = 1; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: fail

  - container:
      args:
      - import random; import sys; exit_code = 0; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: success
    retryStrategy:
      limit: 3

I'm still unable to reproduce original issue described at https://github.com/argoproj/argo/issues/1190#issue-402573294 (pure dag without retries) though. Keep looking at it.

Thanks @alexmt for the quick fix.

What is the best way to install your fix? Do we update the workflow-controller container to point at the latest tag here https://hub.docker.com/r/argoproj/workflow-controller/tags ?

I can easily recreate this issue, and just cause general chaos, with this test:

apiVersion:             argoproj.io/v1alpha1                                                                                                                                                                                                                                                                                 
kind:                   Workflow                                                                                                                                                                                                                                                                                             
metadata:                                                                                                                                                                                                                                                                                                                    
  generateName:         testing-workflow-                                                                                                                                                                                                                                                                                    
spec:                                                                                                                                                                                                                                                                                                                        
  entrypoint:           testing-workflow                                                                                                                                                                                                                                                                                     
  templates:                                                                                                                                                                                                                                                                                                                 
  - name:               testing-workflow                                                                                                                                                                                                                                                                                     
    steps:                                                                                                                                                                                                                                                                                                                   
    - - name:           testing-steps                                                                                                                                                                                                                                                                                        
        template:       testing-steps                                                                                                                                                                                                                                                                                        
        withItems:                                                                                                                                                                                                                                                                                                           
            - "a"                                                                                                                                                                                                                                                                                                            
            - "b"                                                                                                                                                                                                                                                                                                            
            - "c"                                                                                                                                                                                                                                                                                                            
            - "d"                                                                                                                                                                                                                                                                                                            
            - "e"                                                                                                                                                                                                                                                                                                            
            - "f"                                                                                                                                                                                                                                                                                                            
            - "g"                                                                                                                                                                                                                                                                                                            
            - "h"                                                                                                                                                                                                                                                                                                            
            - "i"                                                                                                                                                                                                                                                                                                            
            - "j"                                                                                                                                                                                                                                                                                                            
            - "k"                                                                                                                                                                                                                                                                                                            
            - "l"                                                                                                                                                                                                                                                                                                            
            - "m"                                                                                                                                                                                                                                                                                                            
            - "n"                                                                                                                                                                                                                                                                                                            
            - "o"                                                                                                                                                                                                                                                                                                            
            - "p"                                                                                                                                                                                                                                                                                                            
            - "r"                                                                                                                                                                                                                                                                                                            
            - "s"                                                                                                                                                                                                                                                                                                            
            - "t"                                                                                                                                                                                                                                                                                                            
            - "u"                                                                                                                                                                                                                                                                                                            
            - "v"                                                                                                                                                                                                                                                                                                            
            - "w"                                                                                                                                                                                                                                                                                                            
            - "x"                                                                                                                                                                                                                                                                                                            
            - "y"                                                                                                                                                                                                                                                                                                            
            - "z"                                                                                                                                                                                                                                                                                                            
  - name:               testing-steps                                                                                                                                                                                                                                                                                        
    dag:                                                                                                                                                                                                                                                                                                                     
      tasks:                                                                                                                                                                                                                                                                                                                 
      - name:           run-testing1                                                                                                                                                                                                                                                                                         
        template:       run-testing                                                                                                                                                                                                                                                                                          
      - name:           run-testing2                                                                                                                                                                                                                                                                                         
        template:       run-testing                                                                                                                                                                                                                                                                                          
        dependencies:   [run-testing1]                                                                                                                                                                                                                                                                                       
  - name:               run-testing                                                                                                                                                                                                                                                                                          
    container:                                                                                                                                                                                                                                                                                                               
      image:            frolvlad/alpine-bash                                                                                                                                                                                                                                                                                 
      imagePullPolicy:  Always                                                                                                                                                                                                                                                                                               
      command:          [bash, -c]                                                                                                                                                                                                                                                                                           
      args:             ["if (( RANDOM % 2 )); then echo 'fail'; exit 1; else echo 'success'; fi"]                                                                                                                                                                                                                           

@andreweskeclarke I have the same problem with my DAG workflow so I did this:

kubectl set image deployment/workflow-controller workflow-controller=argoproj/workflow-controller:latest --namespace=argo

It is not ideal. But until the new version is released, this seems to work.

I'm trying to use @illagrenan's suggestion to pull the latest tag and use that, but the latest tag is still failing for this issue. Version I'm currently using "Workflow Controller (version: v2.3.0+2b0b8f1.dirty)".

Anyone else have success with this fix?

@alexmt I use the master branch build the image, but for the test case

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-with-dags-
spec:
  entrypoint: retry-with-dags
  templates:
  - name: retry-with-dags

    dag:
      tasks:
      - name: success1
        template: success

      - name: sub-dag1
        template: sub-dag
        dependencies:
        - success1

      - name: success2
        dependencies:
        - sub-dag1
        template: success

  - name: sub-dag
    dag:
      tasks:
      - name: fail
        template: fail

  - container:
      args:
      - import random; import sys; exit_code = 1; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: fail

  - container:
      args:
      - import random; import sys; exit_code = 0; sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
    name: success
    retryStrategy:
      limit: 3

It will be running forever always,

[root@iZ8vbha5mb49ipi1114n6bZ dag-hang]# ags get retry-with-dags-z5kdr
Name:                retry-with-dags-z5kdr
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Mon Mar 25 17:46:53 +0800 (3 minutes ago)
Started:             Mon Mar 25 17:46:53 +0800 (3 minutes ago)
Duration:            3 minutes 24 seconds

STEP                      PODNAME                           DURATION  MESSAGE
 โ— retry-with-dags-z5kdr
 โ”œ-โœ” success1(0)          retry-with-dags-z5kdr-563744608   8s
 โ””-โœ– sub-dag1
   โ””-โœ– fail               retry-with-dags-z5kdr-1558755693  9s        failed with exit code 1

I'm having the same issue with DAGs, and I can easily replicate off the above script using v2.3.0-rc1. This is a pretty urgent bug, as errors will go unreported if the jobs do not properly fail.

@alexmt Pls take a look

cc @sarabala1979

Will try to look into it tomorrow and hopefully include into 2.3

@alexmt Any update on this?

@alexmt is looking into this. It will be fixed soon.

I have a fix for this.

Was this page helpful?
0 / 5 - 0 ratings