Argo: `activeDeadlineSeconds` is not considered when using `retryStrategy`

Created on 17 Jun 2020 · 10Comments · Source: argoproj/argo

Checklist:

[x] I've included the version.
[x] I've included reproduction steps.
[x] I've included the workflow YAML.
[x] I've included the logs.

What happened:
When using retryStrategy with activeDeadlineSeconds the pod that got terminated as a result of timeout gets retried.

What you expected to happen:
The pod should not be retried when the termination is due to timeouts.

How to reproduce it (as minimally and precisely as possible):

Submit the following workflow.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-timeouts-
spec:
  entrypoint: step-with-retries
  templates:
  - name: step-with-retries
    activeDeadlineSeconds: 10
    retryStrategy:
      limit: 5
      retryPolicy: "Always"
    container:
      image: alpine
      command: [sh, -c]
      args:
      - |
        sleep 30

The result of the workflow should be something like

Name:                retry-timeouts-mzcw8
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             No more retries left
Conditions:
 Completed           True
Created:             Wed Jun 17 16:47:19 +0530 (22 seconds ago)
Started:             Wed Jun 17 16:47:19 +0530 (22 seconds ago)
Finished:            Wed Jun 17 16:48:19 +0530 (37 seconds from now)
Duration:            1 minute 0 seconds

STEP                          TEMPLATE           PODNAME                          DURATION  MESSAGE
 ✖ retry-timeouts-mzcw8       step-with-retries                                             No more retries left
 ├-✖ retry-timeouts-mzcw8(0)  step-with-retries  retry-timeouts-mzcw8-3754190965  10s       Pod was active on the node longer than the specified deadline
 ├-✖ retry-timeouts-mzcw8(1)  step-with-retries  retry-timeouts-mzcw8-1405574568  10s       Pod was active on the node longer than the specified deadline
 ├-✖ retry-timeouts-mzcw8(2)  step-with-retries  retry-timeouts-mzcw8-3552565347  10s       Pod was active on the node longer than the specified deadline
 ├-✖ retry-timeouts-mzcw8(3)  step-with-retries  retry-timeouts-mzcw8-3082939110  10s       Pod was active on the node longer than the specified deadline
 ├-✖ retry-timeouts-mzcw8(4)  step-with-retries  retry-timeouts-mzcw8-3083777753  10s       Pod was active on the node longer than the specified deadline
 └-✖ retry-timeouts-mzcw8(5)  step-with-retries  retry-timeouts-mzcw8-3687919132  10s       Pod was active on the node longer than the specified deadline

Anything else we need to know?:

Environment:

Argo version:

$ argo version
argo: v2.8.0
  BuildDate: 2020-05-11T22:55:16Z
  GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
  GitTreeState: clean
  GitTag: v2.8.0
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64

Kubernetes version :

$ kubectl version -o yaml
clientVersion:
  buildDate: "2020-04-23T22:11:11Z"
  compiler: gc
  gitCommit: 52c56ce7a8272c798dbc29846288d7cd9fbae032
  gitTreeState: archive
  gitVersion: v1.18.2
  goVersion: go1.14.2
  major: "1"
  minor: "18"
  platform: linux/amd64
serverVersion:
  buildDate: "2020-04-06T16:33:17Z"
  compiler: gc
  gitCommit: 34a615f32e9a0c9e97cdb9f749adb392758349a6
  gitTreeState: clean
  gitVersion: v1.14.10-gke.36
  goVersion: go1.12.12b4
  major: "1"
  minor: 14+
  platform: linux/amd64

Logs

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

bug

Source

jyotishp

Most helpful comment

I have another idea: a flag in retryStrategy (such as maxDurationStrict: true) that automatically sets the time remaining in your maxDuration to activeDeadlineSeconds. Should be easy to implement and makes good sense. I'll get a PoC done today

simster7 on 23 Jun 2020

❤1 👍1

All 10 comments

The pod should not be retried when the termination is due to timeouts.

Hmmm... Could you elaborate more on why you think so? Maybe include your use case?

I would actually argue the opposite. activeDeadlineSeconds is there to provide a maximum limit for a task to complete, if a pod doesn't complete its task, then it failed and must be retried.

simster7 on 17 Jun 2020

Say, I have a low priority task that should run on preemptive/spot instances. I want to use the retryStrategy to retry the preempted tasks. I generally expect this task to not take more than x hours. If the task is not completed in x hours, I want to mark it as failed and do not want to spend any more compute resources on it.

I tried doing something like this as well.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-timeouts-
spec:
  entrypoint: timeout-task
  templates:
  - name: timeout-task
    activeDeadlineSeconds: 10
    steps:
    - - name: retry-task
        template: step-with-retries
  - name: step-with-retries
    retryStrategy:
      limit: 5
      retryPolicy: "Always"
    container:
      image: alpine
      command: [sh, -c]
      args:
      - |
        sleep 30
        exit 1

This results in an error saying

Failed to submit workflow: templates.timeout-task.activeDeadlineSeconds is only valid for leaf templates

So I'm not sure how exactly to make this happen.

jyotishp on 17 Jun 2020

How about backoff.maxDuration?

https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml

simster7 on 17 Jun 2020

activeDeadlineSeconds is meant to apply to a single instance. backoff.maxDuration applies to the retryStrategy. You will need to have both activeDeadlineSeconds (to make sure that a long-running pod is stopped) and backoff.maxDuration (to make sure another try is not started) set to the same value.

simster7 on 17 Jun 2020

There is this issue now,

Workflow yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-timeouts-
spec:
  entrypoint: step-with-retries
  templates:
  - name: step-with-retries
    activeDeadlineSeconds: 40
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
      backoff:
        duration: "0s"
        factor: 1
        maxDuration: "40s"
    container:
      image: alpine
      command: [sh, -c]
      args:
      - |
        sleep 50
        exit 1

Results in

Name:                retry-timeouts-9g62v
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             Max duration limit exceeded
Conditions:
 Completed           True
Created:             Wed Jun 17 23:19:50 +0530 (2 minutes ago)
Started:             Wed Jun 17 23:19:50 +0530 (2 minutes ago)
Finished:            Wed Jun 17 23:21:10 +0530 (1 minute ago)
Duration:            1 minute 20 seconds

STEP                          TEMPLATE           PODNAME                          DURATION  MESSAGE
 ✖ retry-timeouts-9g62v       step-with-retries                                             Max duration limit exceeded
 ├-✖ retry-timeouts-9g62v(0)  step-with-retries  retry-timeouts-9g62v-2256888070  40s       Pod was active on the node longer than the specified deadline
 └-✖ retry-timeouts-9g62v(1)  step-with-retries  retry-timeouts-9g62v-578979075   40s       Pod was active on the node longer than the specified deadline

The second retry is not expected.

I tried decreasing backoff.maxDuration by a few seconds, but the same keeps happening. Am I doing something wrong? According this, it's not supposed to work this way right?

EDIT: Looks like this has been fixed in v2.8.2

jyotishp on 17 Jun 2020

@simster7 Should I close this issue and create an enhancement one? Using backoff.maxDuration is useful. But for that to work the way I want it to, I'll have to start another task that will monitor the time for which the retry node (not it's children) has been active for and kill the child pods when the retry node runs for longer than activeDeadlineSeconds. Or am I missing something?

jyotishp on 22 Jun 2020

I'll have to start another task that will monitor the time for which the retry node (not it's children) has been active for

The retryNode and its first child get created in the same operation of the Workflow, so both their start times should essentially be the same.

and kill the child pods when the retry node runs for longer than activeDeadlineSeconds

I see what you mean here. If, for example, you have a maxDuration and activeDeadlineSeconds of 40m and the first node fails in 35m, the second node will start. However, since activeDeadlineSeconds only applies to the individual child node it will not be failed after 5 minutes and will continue on until a maximum of 40m. At which point no new attempts will be made since maxDuration will be exceeded.

Should I close this issue and create an enhancement one?

I think a potential way to solve this issue is to add the start time of the retry node as an argo variable. Something like {{retries.startedAt}}. That way your individual containers could implement the logic that you're looking for and fail if the time is exceeded. What do you think? No need to create a new issue.

Looks like this has been fixed in v2.8.2

Yes sorry, meant to say. The change that counts maxDuration from the beginning of the first node is only available in 2.8.2+.

simster7 on 22 Jun 2020

I think a potential way to solve this issue is to add the start time of the retry node as an argo variable. Something like {{retries.startedAt}}. That way your individual containers could implement the logic that you're looking for and fail if the time is exceeded.

Yes, this might work in most cases. But say, there is a case where the code that runs in the retry pod is something you can't/shouldn't modify (for example a docker image coming from a CI/CD pipeline). In that case, adding {{retries.startedAt}} will require something like a sidecar to track the running time. Is handling such events with a sidecar really a good way?

In my case, I might have ~50 such retry nodes in my workflow. In such a case, is running a single pod/task (say a daemon task) to track the running time of retry nodes instead of having ~50 sidecars better (I'm not really sure about what's the right thing to do here :sweat_smile:)?

I might be biased here, but I think this is something that the workflow manager should do rather than taking care of it at the task level. Would like to know your thoughts on this @simster7

jyotishp on 23 Jun 2020