Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
Our workflows require significant disk storage for each step (in the order of hundreds of GB). Since Argo does not delete pods or free resources during the workflow run, the PVCs remain bound until the workflow is deleted upon completion, meaning we cannot reclaim that storage until the workflow is complete.
What you expected to happen:
For there to be a podReclaimation configuration on the workflow spec which allows for configuration of whether resources are to be reclaimed as each step completes.
I think "podReclaimation" can have several options
@jessesuen @alexmt , invite you two for discuss, ^_^
The pod reclaimation/GC policy could be better expressed by a duration after finished, rather than keywords. This would make it symmetrical to Job and Workflow GC. Also, not sure podReclaimation is the right terminology, since we're not reusing pods, we're just deleting them. Here is what I propose:
Examples:
spec:
podGarbageCollection:
ttlSecondsAfterFinished: 0
spec:
podGarbageCollection:
ttlSecondsAfterFinished: 600
In the future, the status of the step or workflow might also be a consideration when deleting pods. For example, one may only want to delete successful steps, but keep around the failed ones for debugging. Something like:
(delete all successful pods 10 minutes after step was successful)
spec:
podGarbageCollection:
upon: PodSuccess
ttlSecondsAfterFinished: 600
DeleteLater:delete completed pod when its workflow is completed
This may be able to be captured with the predicate. For example:
(delete all pods 10 minutes after the workflow was completed)
spec:
podGarbageCollection:
upon: WorkflowCompletion
ttlSecondsAfterFinished: 600
oh yea, I also think podGarbageCollection is a more meaningful terminology, but I still can not understand whether ttlSecondsAfterFinished make any sense. why not delete pods immediately upon any *Complete Event or *Success Event?
@alexmt ,the keyword "DeleteLater" means delete pods when workflow were successfully or failed executed. And your question inspired me,more option shold be supplied to make it more meaningful,following jessesuen's terminology such as:
consider this strategy again,and following the naming specification in yaml,I think “podgcstrategy” terminology is better. as following
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: parallelism-limit-
spec:
entrypoint: parallelism-limit
podgcstrategy: {strategy-options}
parallelism: 2
templates:
- name: parallelism-limit
steps:
- - name: sleep
template: sleep
withItems:
- arg1
- arg2
- arg3
- arg4
- name: sleep
container:
image: alpine:latest
command: [sh, -c]
args: ["echo failed; exit 1"]
and I think there are four strategy as following
const (
PodGCUponPodCompleted PodGCStrategy = "upon-pod-complated"
PodGCUponPodSucceeded PodGCStrategy = "upon-pod-succeeded"
PodGCUponWorkflowCompleted PodGCStrategy = "upon-workflow-completed"
PodGCUponWorkflowSucceeded PodGCStrategy = "upon-workflow-succeeded"
)
and I have finished coding . @jessesuen @alexmt , if you think it's not bad, I will submit a PR to ask for review.
oh yea, I also think
podGarbageCollectionis a more meaningful terminology, but I still can not understand whetherttlSecondsAfterFinishedmake any sense. why not delete pods immediately upon any *Complete Event or *Success Event?
It makes a lot of sense because it is very useful for forensic analysis. Sometime people want to see what happened in the pods in a workflow.
Sometime people want to see what happened in the pods in a workflow
Perhaps an alternative solution is to _artifact_ the logs? Though, depending on how that is implemented, it may not appropriately capture the pod lifecycle / events.
Fixed
@jessesuen I assume this was fixed in https://github.com/argoproj/argo/pull/1234 - that should be probably mentioned in the changelog, right?
How do we do this now that 2.4 is released? It's not in the docs?
I was looking for an example as well. Found it here: https://github.com/argoproj/argo/blob/c7e5cba14a835fbfd0aba88b99197675ce1f0c66/examples/pod-gc-strategy.yaml#L9-L16
is there a strategy for not deleting pods? Or keeping a history?
The default strategy is to not delete the pods, simply leave off the podGC section. Alternately, you can persist the run data while cleaning up the pods if you enable persistence: https://github.com/argoproj/argo/blob/master/docs/offloading-large-workflows.md
Most helpful comment
I think "podReclaimation" can have several options
@jessesuen @alexmt , invite you two for discuss, ^_^