What happened/what you expected to happen?
Workflow DAG step hangs when the DAG has a task for scheduled jobs to run, but there are no jobs scheduled, and a subsequent task depends on that task.
Please note that this is different than issue: Dag workflow will stay running forever in some long run situation. which was previously closed. I tested the examples in that issue and they still work as expected in version 2.9.5.
This issue is reproducible in v2.9.5, but works as expected in v2.8.2.
What version of Argo Workflows are you running? v2.9.5
Paste the workflow here, including status:
kubectl get wf -o yaml ${workflow}
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
creationTimestamp: "2020-08-11T18:19:28Z"
generateName: dag-hang-
generation: 9
labels:
workflows.argoproj.io/phase: Running
name: dag-hang-pcwmr
namespace: argo
resourceVersion: "246946368"
selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflows/dag-hang-pcwmr
uid: 7a6b50fa-afdd-4843-b226-54170ea849fa
spec:
arguments: {}
entrypoint: dag
templates:
- arguments: {}
dag:
tasks:
- arguments: {}
name: scheduler
template: job-scheduler
- arguments:
parameters:
- name: job_name
value: '{{item.job_name}}'
dependencies:
- scheduler
name: children
template: whalesay
withParam: '{{tasks.scheduler.outputs.parameters.scheduled-jobs}}'
- arguments: {}
dependencies:
- children
name: postprocess
template: whalesay
inputs: {}
metadata: {}
name: dag
outputs: {}
- arguments: {}
container:
args:
- echo Decided not to schedule any jobs
command:
- sh
- -c
image: alpine:latest
name: ""
resources: {}
inputs: {}
metadata: {}
name: job-scheduler
outputs:
parameters:
- name: scheduled-jobs
value: '[]'
- arguments: {}
container:
args:
- hello world
command:
- cowsay
image: docker/whalesay
name: ""
resources: {}
inputs: {}
metadata: {}
name: whalesay
outputs: {}
status:
finishedAt: null
nodes:
dag-hang-pcwmr:
children:
- dag-hang-pcwmr-1789179473
displayName: dag-hang-pcwmr
finishedAt: null
id: dag-hang-pcwmr
name: dag-hang-pcwmr
phase: Running
startedAt: "2020-08-11T18:19:28Z"
templateName: dag
templateScope: local/dag-hang-pcwmr
type: DAG
dag-hang-pcwmr-1415348083:
boundaryID: dag-hang-pcwmr
displayName: postprocess
finishedAt: "2020-08-11T18:19:40Z"
hostNodeName: dech117
id: dag-hang-pcwmr-1415348083
name: dag-hang-pcwmr.postprocess
outputs:
exitCode: "0"
phase: Succeeded
resourcesDuration:
cpu: 3
memory: 3
startedAt: "2020-08-11T18:19:36Z"
templateName: whalesay
templateScope: local/dag-hang-pcwmr
type: Pod
dag-hang-pcwmr-1738428083:
boundaryID: dag-hang-pcwmr
children:
- dag-hang-pcwmr-1415348083
displayName: children
finishedAt: "2020-08-11T18:19:35Z"
id: dag-hang-pcwmr-1738428083
name: dag-hang-pcwmr.children
phase: Succeeded
startedAt: "2020-08-11T18:19:35Z"
templateName: whalesay
templateScope: local/dag-hang-pcwmr
type: TaskGroup
dag-hang-pcwmr-1789179473:
boundaryID: dag-hang-pcwmr
children:
- dag-hang-pcwmr-1738428083
displayName: scheduler
finishedAt: "2020-08-11T18:19:33Z"
hostNodeName: dech113
id: dag-hang-pcwmr-1789179473
name: dag-hang-pcwmr.scheduler
outputs:
exitCode: "0"
parameters:
- name: scheduled-jobs
value: '[]'
phase: Succeeded
resourcesDuration:
cpu: 4
memory: 4
startedAt: "2020-08-11T18:19:28Z"
templateName: job-scheduler
templateScope: local/dag-hang-pcwmr
type: Pod
phase: Running
startedAt: "2020-08-11T18:19:28Z"
Paste the logs from the workflow controller:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}
time="2020-08-11T18:19:28Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Updated phase -> Running" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="DAG node dag-hang-pcwmr initialized Running" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="All of node dag-hang-pcwmr.scheduler dependencies [] completed" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Pod node dag-hang-pcwmr-1789179473 initialized Pending" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Created pod: dag-hang-pcwmr.scheduler (dag-hang-pcwmr-1789179473)" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946246 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:29Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1789179473
time="2020-08-11T18:19:29Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:29Z" level=info msg="Updating node dag-hang-pcwmr-1789179473 message: ContainerCreating"
time="2020-08-11T18:19:29Z" level=info msg="Skipped pod dag-hang-pcwmr.scheduler (dag-hang-pcwmr-1789179473) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:29Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946257 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:30Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:30Z" level=info msg="Skipped pod dag-hang-pcwmr.scheduler (dag-hang-pcwmr-1789179473) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:33Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1789179473
time="2020-08-11T18:19:34Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:34Z" level=info msg="Setting node dag-hang-pcwmr-1789179473 outputs"
time="2020-08-11T18:19:34Z" level=info msg="Updating node dag-hang-pcwmr-1789179473 status Pending -> Running"
time="2020-08-11T18:19:34Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946296 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="Updating node dag-hang-pcwmr-1789179473 status Running -> Succeeded"
time="2020-08-11T18:19:35Z" level=info msg="TaskGroup node dag-hang-pcwmr-1738428083 initialized Running (message: )" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="node dag-hang-pcwmr-1738428083 phase Running -> Succeeded" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="node dag-hang-pcwmr-1738428083 finished: 2020-08-11 18:19:35.424262116 +0000 UTC" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946305 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="All of node dag-hang-pcwmr.postprocess dependencies [children] completed" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Pod node dag-hang-pcwmr-1415348083 initialized Pending" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Labeled pod argo/dag-hang-pcwmr-1789179473 completed"
time="2020-08-11T18:19:36Z" level=info msg="Created pod: dag-hang-pcwmr.postprocess (dag-hang-pcwmr-1415348083)" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946318 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:37Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1415348083
time="2020-08-11T18:19:37Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:37Z" level=info msg="Labeled pod argo/dag-hang-pcwmr-1789179473 completed"
time="2020-08-11T18:19:37Z" level=info msg="Updating node dag-hang-pcwmr-1415348083 message: ContainerCreating"
time="2020-08-11T18:19:37Z" level=info msg="Skipped pod dag-hang-pcwmr.postprocess (dag-hang-pcwmr-1415348083) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:37Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946328 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:38Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:38Z" level=info msg="Skipped pod dag-hang-pcwmr.postprocess (dag-hang-pcwmr-1415348083) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:40Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1415348083
time="2020-08-11T18:19:41Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:41Z" level=info msg="Setting node dag-hang-pcwmr-1415348083 outputs"
time="2020-08-11T18:19:41Z" level=info msg="Updating node dag-hang-pcwmr-1415348083 status Pending -> Running"
time="2020-08-11T18:19:41Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946358 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:42Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:42Z" level=info msg="Updating node dag-hang-pcwmr-1415348083 status Running -> Succeeded"
time="2020-08-11T18:19:42Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946368 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:43Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:43Z" level=info msg="Labeled pod argo/dag-hang-pcwmr-1415348083 completed"
Message from the maintainers:
Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.
@simster7 created a fix and released in v2.9.5.
Can you please run the controlled with the environment variable ALL_POD_CHANGES_SIGNIFICANT=true and see if that addresses this issue.
I'll investigate this today
Confirmed this as a bug, working on a fix
Thank you @simster7
Note: this bug is a very unusual edge case. For this bug to happen all of these conditions must be satisfied:
withItems, withParams, etc.)withParams: "")just happened to run into the same bug, where my generator script is returning an empty json payload, here's the yaml I used for reproducing it:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: reproduce-bug-
spec:
entrypoint: start
templates:
- name: start
dag:
tasks:
- name: first-step
template: first-step
- name: second-step
dependencies: [first-step]
template: second-step
arguments:
parameters:
- name: chunk
value: "{{item}}"
withParam: "{{tasks.first-step.outputs.result}}"
- name: third-step
dependencies: [second-step]
template: third-step
- name: first-step
script:
image: python:3
command: [bash]
source: |
set -e
python3 -c 'import os, json; print(json.dumps([]))'
resources:
requests:
cpu: 250m
- name: second-step
inputs:
parameters:
- name: chunk
script:
image: python:3
command: [bash]
source: |
set -e
echo hi {{$inputs.parameters.chunk}}
resources:
requests:
cpu: 100m
- name: third-step
script:
image: busybox
command: [sh]
source: |
echo fin
resources:
requests:
cpu: 100m
I return empty json on purpose, as I have a finish maker detection step in my real workflow, I wanted to skip the whole DAG if the finish marker file is already there in place in cloud storage. And JSON result is required even if when condition didn't pass. And because of that, turned out returning empty json will make DAG hangs forever. So I changed it to return a dummy value and skip in the downstream with when condition as well.
I guess maybe it makes sense to have a feature to skip the whole DAG without even providing JSON result. Anyway, just added a possible case people may want to return empty JSON on purpose on top of it.
I'm using V2.11.7 and running into the same issue again.
All DAGs are going into long run and keep hanging.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
dependencies: [A]
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
dependencies: [A]
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
dependencies: [B, C]
template: echo
arguments:
parameters: [{name: message, value: D}]
Can you please try :latest?
@tarunpantqb I just ran the sample workflow you posted with no issues. I also see that it's one of our examples, which means that if there was a bug that prevented it from running successfully, our testing infrastructure would have caught it already.
Lastly, as you can see in https://github.com/argoproj/argo/issues/3739#issuecomment-672286102, this issue is only relevant for DAGs that contain a specific edge case when using withItems, etc. The workflow you posted does not use them.
Can you please ensure that you don't have an environment or similar issue that's preventing you from running workflows? If you are sure that it's not an issue on your end, feel free to open a new issue with a detailed bug report.
Most helpful comment
Confirmed this as a bug, working on a fix