Checklist:
What happened:
When using retryStrategy with activeDeadlineSeconds the pod that got terminated as a result of timeout gets retried.
What you expected to happen:
The pod should not be retried when the termination is due to timeouts.
How to reproduce it (as minimally and precisely as possible):
Submit the following workflow.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-timeouts-
spec:
entrypoint: step-with-retries
templates:
- name: step-with-retries
activeDeadlineSeconds: 10
retryStrategy:
limit: 5
retryPolicy: "Always"
container:
image: alpine
command: [sh, -c]
args:
- |
sleep 30
The result of the workflow should be something like
Name: retry-timeouts-mzcw8
Namespace: default
ServiceAccount: default
Status: Failed
Message: No more retries left
Conditions:
Completed True
Created: Wed Jun 17 16:47:19 +0530 (22 seconds ago)
Started: Wed Jun 17 16:47:19 +0530 (22 seconds ago)
Finished: Wed Jun 17 16:48:19 +0530 (37 seconds from now)
Duration: 1 minute 0 seconds
STEP TEMPLATE PODNAME DURATION MESSAGE
โ retry-timeouts-mzcw8 step-with-retries No more retries left
โ-โ retry-timeouts-mzcw8(0) step-with-retries retry-timeouts-mzcw8-3754190965 10s Pod was active on the node longer than the specified deadline
โ-โ retry-timeouts-mzcw8(1) step-with-retries retry-timeouts-mzcw8-1405574568 10s Pod was active on the node longer than the specified deadline
โ-โ retry-timeouts-mzcw8(2) step-with-retries retry-timeouts-mzcw8-3552565347 10s Pod was active on the node longer than the specified deadline
โ-โ retry-timeouts-mzcw8(3) step-with-retries retry-timeouts-mzcw8-3082939110 10s Pod was active on the node longer than the specified deadline
โ-โ retry-timeouts-mzcw8(4) step-with-retries retry-timeouts-mzcw8-3083777753 10s Pod was active on the node longer than the specified deadline
โ-โ retry-timeouts-mzcw8(5) step-with-retries retry-timeouts-mzcw8-3687919132 10s Pod was active on the node longer than the specified deadline
Anything else we need to know?:
Environment:
$ argo version
argo: v2.8.0
BuildDate: 2020-05-11T22:55:16Z
GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
GitTreeState: clean
GitTag: v2.8.0
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
$ kubectl version -o yaml
clientVersion:
buildDate: "2020-04-23T22:11:11Z"
compiler: gc
gitCommit: 52c56ce7a8272c798dbc29846288d7cd9fbae032
gitTreeState: archive
gitVersion: v1.18.2
goVersion: go1.14.2
major: "1"
minor: "18"
platform: linux/amd64
serverVersion:
buildDate: "2020-04-06T16:33:17Z"
compiler: gc
gitCommit: 34a615f32e9a0c9e97cdb9f749adb392758349a6
gitTreeState: clean
gitVersion: v1.14.10-gke.36
goVersion: go1.12.12b4
major: "1"
minor: 14+
platform: linux/amd64
Logs
Message from the maintainers:
If you are impacted by this bug please add a ๐ reaction to this issue! We often sort issues this way to know what to prioritize.
The pod should not be retried when the termination is due to timeouts.
Hmmm... Could you elaborate more on why you think so? Maybe include your use case?
I would actually argue the opposite. activeDeadlineSeconds is there to provide a maximum limit for a task to complete, if a pod doesn't complete its task, then it failed and must be retried.
Say, I have a low priority task that should run on preemptive/spot instances. I want to use the retryStrategy to retry the preempted tasks. I generally expect this task to not take more than x hours. If the task is not completed in x hours, I want to mark it as failed and do not want to spend any more compute resources on it.
I tried doing something like this as well.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-timeouts-
spec:
entrypoint: timeout-task
templates:
- name: timeout-task
activeDeadlineSeconds: 10
steps:
- - name: retry-task
template: step-with-retries
- name: step-with-retries
retryStrategy:
limit: 5
retryPolicy: "Always"
container:
image: alpine
command: [sh, -c]
args:
- |
sleep 30
exit 1
This results in an error saying
Failed to submit workflow: templates.timeout-task.activeDeadlineSeconds is only valid for leaf templates
So I'm not sure how exactly to make this happen.
How about backoff.maxDuration?
https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml
activeDeadlineSeconds is meant to apply to a single instance. backoff.maxDuration applies to the retryStrategy. You will need to have both activeDeadlineSeconds (to make sure that a long-running pod is stopped) and backoff.maxDuration (to make sure another try is not started) set to the same value.
There is this issue now,
Workflow yaml:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-timeouts-
spec:
entrypoint: step-with-retries
templates:
- name: step-with-retries
activeDeadlineSeconds: 40
retryStrategy:
limit: 10
retryPolicy: "Always"
backoff:
duration: "0s"
factor: 1
maxDuration: "40s"
container:
image: alpine
command: [sh, -c]
args:
- |
sleep 50
exit 1
Results in
Name: retry-timeouts-9g62v
Namespace: default
ServiceAccount: default
Status: Failed
Message: Max duration limit exceeded
Conditions:
Completed True
Created: Wed Jun 17 23:19:50 +0530 (2 minutes ago)
Started: Wed Jun 17 23:19:50 +0530 (2 minutes ago)
Finished: Wed Jun 17 23:21:10 +0530 (1 minute ago)
Duration: 1 minute 20 seconds
STEP TEMPLATE PODNAME DURATION MESSAGE
โ retry-timeouts-9g62v step-with-retries Max duration limit exceeded
โ-โ retry-timeouts-9g62v(0) step-with-retries retry-timeouts-9g62v-2256888070 40s Pod was active on the node longer than the specified deadline
โ-โ retry-timeouts-9g62v(1) step-with-retries retry-timeouts-9g62v-578979075 40s Pod was active on the node longer than the specified deadline
The second retry is not expected.
I tried decreasing backoff.maxDuration by a few seconds, but the same keeps happening. Am I doing something wrong? According this, it's not supposed to work this way right?
EDIT: Looks like this has been fixed in v2.8.2
@simster7 Should I close this issue and create an enhancement one? Using backoff.maxDuration is useful. But for that to work the way I want it to, I'll have to start another task that will monitor the time for which the retry node (not it's children) has been active for and kill the child pods when the retry node runs for longer than activeDeadlineSeconds. Or am I missing something?
I'll have to start another task that will monitor the time for which the retry node (not it's children) has been active for
The retryNode and its first child get created in the same operation of the Workflow, so both their start times should essentially be the same.
and kill the child pods when the retry node runs for longer than activeDeadlineSeconds
I see what you mean here. If, for example, you have a maxDuration and activeDeadlineSeconds of 40m and the first node fails in 35m, the second node will start. However, since activeDeadlineSeconds only applies to the individual child node it will not be failed after 5 minutes and will continue on until a maximum of 40m. At which point no new attempts will be made since maxDuration will be exceeded.
Should I close this issue and create an enhancement one?
I think a potential way to solve this issue is to add the start time of the retry node as an argo variable. Something like {{retries.startedAt}}. That way your individual containers could implement the logic that you're looking for and fail if the time is exceeded. What do you think? No need to create a new issue.
Looks like this has been fixed in v2.8.2
Yes sorry, meant to say. The change that counts maxDuration from the beginning of the first node is only available in 2.8.2+.
I think a potential way to solve this issue is to add the start time of the retry node as an argo variable. Something like
{{retries.startedAt}}. That way your individual containers could implement the logic that you're looking for and fail if the time is exceeded.
Yes, this might work in most cases. But say, there is a case where the code that runs in the retry pod is something you can't/shouldn't modify (for example a docker image coming from a CI/CD pipeline). In that case, adding {{retries.startedAt}} will require something like a sidecar to track the running time. Is handling such events with a sidecar really a good way?
In my case, I might have ~50 such retry nodes in my workflow. In such a case, is running a single pod/task (say a daemon task) to track the running time of retry nodes instead of having ~50 sidecars better (I'm not really sure about what's the right thing to do here :sweat_smile:)?
I might be biased here, but I think this is something that the workflow manager should do rather than taking care of it at the task level. Would like to know your thoughts on this @simster7
I have another idea: a flag in retryStrategy (such as maxDurationStrict: true) that automatically sets the time remaining in your maxDuration to activeDeadlineSeconds. Should be easy to implement and makes good sense. I'll get a PoC done today
That would be perfect! Thanks!
Most helpful comment
I have another idea: a flag in
retryStrategy(such asmaxDurationStrict: true) that automatically sets the time remaining in yourmaxDurationtoactiveDeadlineSeconds. Should be easy to implement and makes good sense. I'll get a PoC done today