Argo: Handle pod deletes outside workflow lifecycle

Created on 11 Jun 2019  路  4Comments  路  Source: argoproj/argo

Possibly a suggestion/feature request. I ran into an issue similar to #893 when Kured restarted a node on which a pod was executing a workflow step. This triggered the handling mechanism here which marked the workflow step as failed with the message pod deleted.

Can't this scenario be augmented with a pre-stop hook injected into the pod-spec to notify workflow-controller to better handle cases where a pod has been deleted outside of the workflow lifecycle?

https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

bug

Most helpful comment

I found the issue. The problem was in my workflow after I upgraded argo from 2.4.3 to 2.6.0. I did't changed retry parameters: before it was {retryStrategy: limit: 10} (only supported by that version). And after upgrade I got message: level=info msg="Node not set to be retried after status: Error". My fault, didn't noticed this message in workflow-controller logs. The fix for workflow to change {retryStrategy: limit: 10, retryPolicy: Always, backoff: ...} backoff i added because limit 10 can be reached quickly. Now it works. Thanks! Maybe this information will help for others

All 4 comments

I run in to same issue with v2.6.0-rc1. To reproduce just delete node running one or several workflow pods. Whole workflow stuck in running state as some pods had "pod deleted" and newer retried. Expected behaviour is pod rescheduled with retry. Found it works with 2.4.3

@audriusrudalevicius - Could you give more detail about your case? Argo has the ability to handle the situation that pod is deleted outside of the wf lifecycle, in general if the POD is deleted, wf will retry (if it's there in your spec) or marked as failed. I want to see if there's a bug to make it not work in your case.

I found the issue. The problem was in my workflow after I upgraded argo from 2.4.3 to 2.6.0. I did't changed retry parameters: before it was {retryStrategy: limit: 10} (only supported by that version). And after upgrade I got message: level=info msg="Node not set to be retried after status: Error". My fault, didn't noticed this message in workflow-controller logs. The fix for workflow to change {retryStrategy: limit: 10, retryPolicy: Always, backoff: ...} backoff i added because limit 10 can be reached quickly. Now it works. Thanks! Maybe this information will help for others

Closing, feel free to reopen if necessary

Was this page helpful?
0 / 5 - 0 ratings