Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
Using RetryStrategy on a container causes the retry-step to fail with inputs.parameters.image was not supplied, even though the resulting child-pod succeeds.

What you expected to happen:
The retry step should succeed if the child-pod succeeds within the retry limit.
How to reproduce it (as minimally and precisely as possible):
Workflow: wf-strip.yaml.txt
Anything else we need to know?:
Removing the last 2 lines from the attached workflow, so removing the retryStrategy, causes the workflow to succeed. Also works without issue on 2.3.0
Environment:
clientVersion:
buildDate: "2019-08-19T11:13:49Z"
compiler: gc
gitCommit: 96fac5cd13a5dc064f7d9f4f23030a6aeface6cc
gitTreeState: clean
gitVersion: v1.14.6
goVersion: go1.12.9
major: "1"
minor: "14"
platform: windows/amd64
serverVersion:
buildDate: "2019-02-28T13:30:26Z"
compiler: gc
gitCommit: c27b913fddd1a6c480c229191a087698aa92f0b1
gitTreeState: clean
gitVersion: v1.13.4
goVersion: go1.11.5
major: "1"
minor: "13"
platform: linux/amd64
Other debugging information (if applicable):
Nodes:
Argo 24 - Retry:
Children:
argo24-retry-2505001171
Display Name: argo24-retry
Finished At: 2019-10-08T15:22:46Z
Id: argo24-retry
Name: argo24-retry
Phase: Error
Started At: 2019-10-08T15:22:41Z
Template Name: dag
Type: DAG
Argo 24 - Retry - 1352431966:
Boundary ID: argo24-retry
Display Name: task1(0)
Finished At: 2019-10-08T15:22:45Z
Id: argo24-retry-1352431966
Inputs:
Parameters:
Name: image
Value: python:alpine3.6
Name: argo24-retry.task1(0)
Phase: Succeeded
Started At: 2019-10-08T15:22:41Z
Template Name: container1
Type: Pod
Argo 24 - Retry - 2505001171:
Boundary ID: argo24-retry
Children:
argo24-retry-1352431966
Display Name: task1
Finished At: 2019-10-08T15:22:42Z
Id: argo24-retry-2505001171
Message: inputs.parameters.image was not supplied
Name: argo24-retry.task1
Phase: Error
Started At: 2019-10-08T15:22:41Z
Template Name: container1
Type: Retry
Phase: Error
time="2019-10-08T15:22:41Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Updated phase -> Running" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="DAG node argo24-retry (argo24-retry) initialized Pending" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="node argo24-retry (argo24-retry) phase Pending -> Running" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="All of node argo24-retry.task1 dependencies [] completed" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Retry node argo24-retry.task1 (argo24-retry-2505001171) initialized Running" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Pod node argo24-retry.task1(0) (argo24-retry-1352431966) initialized Pending" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Created pod: argo24-retry.task1(0) (argo24-retry-1352431966)" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="Updating node argo24-retry.task1(0) (argo24-retry-1352431966) message: ContainerCreating"
time="2019-10-08T15:22:42Z" level=info msg="node argo24-retry.task1 (argo24-retry-2505001171) phase Running -> Error" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="node argo24-retry.task1 (argo24-retry-2505001171) message: inputs.parameters.image was not supplied" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="node argo24-retry.task1 (argo24-retry-2505001171) finished: 2019-10-08 15:22:42.510629459 +0000 UTC" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:43Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:43Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:45Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:45Z" level=info msg="Updating node argo24-retry.task1(0) (argo24-retry-1352431966) status Pending -> Running"
time="2019-10-08T15:22:45Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Updating node argo24-retry.task1(0) (argo24-retry-1352431966) status Running -> Succeeded"
time="2019-10-08T15:22:46Z" level=info msg="node argo24-retry (argo24-retry) phase Running -> Error" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="node argo24-retry (argo24-retry) finished: 2019-10-08 15:22:46.806242053 +0000 UTC" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Checking daemoned children of argo24-retry" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Updated phase Running -> Error" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Marking workflow completed" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Checking daemoned children of " namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:47Z" level=info msg="Labeled pod dev-personal/argo24-retry-1352431966 completed"
After upgrade to 2.4.1 in our dev cluster we are also seeing a quite similar regression. The difference in our case is that the workflow doesn't actually stop with Error. It gets stuck:
Status:
Finished At: <nil>
Nodes:
Example - Pipeline - Hbnbf:
Children:
example-pipeline-hbnbf-1001031843
example-pipeline-hbnbf-1754353090
Display Name: example-pipeline-hbnbf
Finished At: <nil>
Id: example-pipeline-hbnbf
Name: example-pipeline-hbnbf
Phase: Running <<<<--- Stuck
Started At: 2019-10-09T21:32:35Z
Template Name: example-pipeline
Type: DAG
Example - Pipeline - Hbnbf - 1001031843:
Boundary ID: example-pipeline-hbnbf
Children:
example-pipeline-hbnbf-1450750190
Display Name: launch-cluster
Finished At: 2019-10-09T21:32:36Z
Id: example-pipeline-hbnbf-1001031843
Message: inputs.parameters.flavor was not supplied <<<<----- Same
Name: example-pipeline-hbnbf.launch-cluster
Phase: Error
Started At: 2019-10-09T21:32:35Z
Template Name: create-cluster
Type: Retry
Example - Pipeline - Hbnbf - 1450750190:
Boundary ID: example-pipeline-hbnbf
Display Name: launch-cluster(0)
Finished At: 2019-10-09T21:40:43Z
Id: example-pipeline-hbnbf-1450750190
Inputs:
Parameters:
Name: flavor
Value: ds-dev
Name: role-arn
Value: arn:aws:iam::111111111111111:role/xxx
Name: cluster-name
Value: ExampleCluster
Name: app-bundle-name
Value: example-pipeline
argo -n argo get example-pipeline-hbnf
Name: example-pipeline-hbnbf
Namespace: argo
ServiceAccount: argo
Status: Running
Created: Thu Oct 10 08:32:35 +1100 (2 hours ago)
Started: Thu Oct 10 08:32:35 +1100 (2 hours ago)
Duration: 2 hours 7 minutes
Parameters:
dsFlavor: ds-dev
roleArn: arn:aws:iam::11111111111111:role/xxx
artifactVersion: add6ce40
clustersConfig: { .... }
STEP PODNAME DURATION MESSAGE
● example-pipeline-hbnbf (example-pipeline)
├-✔ launch-cluster(0) (create-cluster) example-pipeline-hbnbf-1450750190 8m
└-✔ notify-started (notify-started) example-pipeline-hbnbf-1754353090 9s
_Further info:_
Every workflow using retryStrategy that got scheduled in our dev cluster got stuck in Running, so it's consistent. Once stuck the workflow won't respond to argo terminate.
Can you provide ‘init and wait ‘ container logs ?
You can use kubectl logs
For which pod?
In my example above both pods associated with succeeded steps (launch-cluster(0), notify-started) complete normally. I don't think there's a pod associated with the parent of launch-cluster(0) which is the component that is in an Error state.
I am able to reproduce in my dev environment. I will work on the fix. Thanks for finding it.
Most helpful comment
I am able to reproduce in my dev environment. I will work on the fix. Thanks for finding it.