Argo: Configurable pod / resource reclamation

Created on 20 Nov 2018 · 13Comments · Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
Our workflows require significant disk storage for each step (in the order of hundreds of GB). Since Argo does not delete pods or free resources during the workflow run, the PVCs remain bound until the workflow is deleted upon completion, meaning we cannot reclaim that storage until the workflow is complete.

What you expected to happen:
For there to be a podReclaimation configuration on the workflow spec which allows for configuration of whether resources are to be reclaimed as each step completes.

enhancement

Source

waxmittmann

👍9

Most helpful comment

I think "podReclaimation" can have several options

Retain: always retain completed pods
Delete：delete pod immediately when it is completed
DeleteLater：delete completed pod when its workflow is completed

@jessesuen @alexmt , invite you two for discuss, ^_^

jackywu on 18 Dec 2018

👍2

All 13 comments

I think "podReclaimation" can have several options

Retain: always retain completed pods
Delete：delete pod immediately when it is completed
DeleteLater：delete completed pod when its workflow is completed

@jessesuen @alexmt , invite you two for discuss, ^_^

jackywu on 18 Dec 2018

👍2

The pod reclaimation/GC policy could be better expressed by a duration after finished, rather than keywords. This would make it symmetrical to Job and Workflow GC. Also, not sure podReclaimation is the right terminology, since we're not reusing pods, we're just deleting them. Here is what I propose:

Examples:

spec:
  podGarbageCollection:
    ttlSecondsAfterFinished: 0

spec:
  podGarbageCollection:
    ttlSecondsAfterFinished: 600

In the future, the status of the step or workflow might also be a consideration when deleting pods. For example, one may only want to delete successful steps, but keep around the failed ones for debugging. Something like:

(delete all successful pods 10 minutes after step was successful)

spec:
  podGarbageCollection:
    upon: PodSuccess
    ttlSecondsAfterFinished: 600

DeleteLater：delete completed pod when its workflow is completed

This may be able to be captured with the predicate. For example:

(delete all pods 10 minutes after the workflow was completed)

spec:
  podGarbageCollection:
    upon: WorkflowCompletion
    ttlSecondsAfterFinished: 600

jessesuen on 18 Dec 2018

oh yea, I also think podGarbageCollection is a more meaningful terminology, but I still can not understand whether ttlSecondsAfterFinished make any sense. why not delete pods immediately upon any *Complete Event or *Success Event?

jackywu on 19 Dec 2018

@alexmt ，the keyword "DeleteLater" means delete pods when workflow were successfully or failed executed. And your question inspired me，more option shold be supplied to make it more meaningful，following jessesuen's terminology such as：

upon：WorkflowCompletion, this include "WorkflowSuccessful" and "WorkflowFail"
upon: WorkflowSuccessful
upon: WorkflowFail

jackywu on 19 Dec 2018

consider this strategy again，and following the naming specification in yaml，I think “podgcstrategy” terminology is better. as following

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism-limit-
spec:
  entrypoint: parallelism-limit
  podgcstrategy: {strategy-options}
  parallelism: 2
  templates:
  - name: parallelism-limit
    steps:
    - - name: sleep
        template: sleep
        withItems:
        - arg1
        - arg2
        - arg3
        - arg4
  - name: sleep
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo failed; exit 1"]

and I think there are four strategy as following

const (
    PodGCUponPodCompleted      PodGCStrategy = "upon-pod-complated"
    PodGCUponPodSucceeded      PodGCStrategy = "upon-pod-succeeded"
    PodGCUponWorkflowCompleted PodGCStrategy = "upon-workflow-completed"
    PodGCUponWorkflowSucceeded PodGCStrategy = "upon-workflow-succeeded"
)

and I have finished coding . @jessesuen @alexmt , if you think it's not bad, I will submit a PR to ask for review.

jackywu on 20 Dec 2018

👍1

oh yea, I also think podGarbageCollection is a more meaningful terminology, but I still can not understand whether ttlSecondsAfterFinished make any sense. why not delete pods immediately upon any *Complete Event or *Success Event?

It makes a lot of sense because it is very useful for forensic analysis. Sometime people want to see what happened in the pods in a workflow.

tigerwings on 4 May 2019

Sometime people want to see what happened in the pods in a workflow

Perhaps an alternative solution is to _artifact_ the logs? Though, depending on how that is implemented, it may not appropriately capture the pod lifecycle / events.

MrSaints on 14 May 2019

Fixed

jessesuen on 15 Aug 2019

@jessesuen I assume this was fixed in https://github.com/argoproj/argo/pull/1234 - that should be probably mentioned in the changelog, right?

discordianfish on 26 Aug 2019

How do we do this now that 2.4 is released? It's not in the docs?

mildewey on 15 Jan 2020

I was looking for an example as well. Found it here: https://github.com/argoproj/argo/blob/c7e5cba14a835fbfd0aba88b99197675ce1f0c66/examples/pod-gc-strategy.yaml#L9-L16

alexmt on 14 Mar 2020

is there a strategy for not deleting pods? Or keeping a history?

ghostsquad on 16 Mar 2020

The default strategy is to not delete the pods, simply leave off the podGC section. Alternately, you can persist the run data while cleaning up the pods if you enable persistence: https://github.com/argoproj/argo/blob/master/docs/offloading-large-workflows.md

mildewey on 17 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings