Argo: Set a maximum duration time between submission and workflow start time

Created on 10 Jul 2020  路  8Comments  路  Source: argoproj/argo

Sometime when no required nodes are available the cluster needs to start one before starting a workflow. During that time the workflow is in a pending state. It happens that cluster scaling does not work (for example when the workflow ask for incoherent requirements) and the workflow stay in the pending state forever until an admin cleans it.

Is an option similar to activeDeadlineSeconds exist for such a case? Or maybe an easy to setup Kubernetes configuration?

In short, we would like set a maximum duration time between the submission date and the workflow started.

enhancement

Most helpful comment

Thank you. activedeadlineseconds seems to be set for both pending and running workflow. Our workflow should be considered failing after 10min in pending state but they usually run for ~20h. Could we separate this parameter for running and pending workflow?

All 8 comments

@simster7 to reply with a workaround.

Hi @hadim. We think this is a valid use case, however we think it's possible to do so without any new functionality. Possible workarounds for this:

  1. Add a step to your Workflow and runs before all other steps. This step can compare the result of {{workflow.creationTimestamp}} (an Argo variable) and the current time to determine if too much time has passed between the two. If that's the case, the step can fail itself therefore failing the Workflow as a whole.

  2. Create a CronWorkflow that periodically gets a list of all workflows in the cluster, and deletes ones that have been pending for too long. Containers started by Argo that have kubectl installed should be automatically given the same access of the service account that Argo uses.

Thanks for the workarounds. That's interesting.

For 1): That's my preferred solution but to make it even better that would be great to be able to do it with an onStart handler so we don't have to manually add a new node in a DAG and we could set this in the workflowDefault setting. See https://github.com/argoproj/argo/issues/3428 about onStart.

For 2): good idea, I think we'll do that waiting for onStart (if this is something you'll consider).

Great, will be closing as solved. We can continue the discussion of onStart in #3428

@simster7 I just realize that adding and handler to onStart does not solve the original issue.

We are running our cluster on GKE. Sometimes the workflow has not been configured correctly and the asked CPU limits + nodeSelector cannot be satisfied. In return the workflow stays in pending state forever and never fails with this kind of message:

Unschedulable: 0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory, 3 node(s) didn't match node selector.

This is confusing because the same pending state and message also appears when the workflow is waiting for the cluster to scale-up (usually it takes a couple of minute). This is why we would like to put a time threshold to the pending state so if the workflow hasn't started after xx minutes then it should fail. After xx minutes we will consider the workflow will never be able to start.

Adding a step at the beginning of the workflow or an onStart handler will not work since the workflow will not start.

Can we re-open this issue given the message above? If you prefer I can open a new one. Let me know.

Fixes #3581

Thank you. activedeadlineseconds seems to be set for both pending and running workflow. Our workflow should be considered failing after 10min in pending state but they usually run for ~20h. Could we separate this parameter for running and pending workflow?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

0xdevalias picture 0xdevalias  路  3Comments

iterion picture iterion  路  3Comments

alexlatchford picture alexlatchford  路  3Comments

nelsonfassis picture nelsonfassis  路  4Comments

logicfox picture logicfox  路  4Comments