Argo: Proposal: Max Parallel execution at workflow level and template level

Created on 8 Jan 2018  路  5Comments  路  Source: argoproj/argo

Summary:

If there are a lot of steps in parallel to be run, either because there are lots of parallel steps or you are doing dynamic withItems which allows a generated number of steps to be run in parallel, we may want to limit max parallel executing Pods in a given workflow. This will allow other workflows to not starve (assuming there are resource limitation set at cluster level, if resource limitatio are not set it will allow the cluster to not starve other things) which have more sequential steps.

Proposal:

1) Introduce spec.maxParallelExecution to limit parallelization of a given workflow.

2) Also have a spec.templates.steps.step.withItems.maxParallelExecution to limit parallelization when doing generated or explicit parallel steps within a loop.

Intended Behavior:

1) When spec.maxParallelExecution is set then we will run at most that many pods in parallel during the whole workflow.
2) When spec.templates.steps.step.withItems.maxParallelExecution is set then the parallel fan-out steps in that step will run at most that many pods in parallel while still obeying spec.maxParallelExecution limitation if it is also set

/cc @javierbq

Most helpful comment

Hi, does the feature exist ? I couldn't see anything in the documentation to limit the number of pods running concurrently in a loop withParam.

Thanks

All 5 comments

The JobSpec uses the term Parallelism to control this. Let's reuse the terminology.

It is unlikely to guarantee that workflow pods will not get starved.

  • If the Kubernetes cluster is used for running other workloads, it is possible that other pods consume so many resources that Argo's workflow pods don't get a chance to run and starve.

  • Whether or not a pod will actually run will be a factor or available resources. Even with lower parallelism, it is possible that some new pods might not actually run if the running pods consume too many resources.

Baking in Parallelism in the workflow could also lead to inadequate use of the cluster (e.g. a workflow created and test on minikube with a certain parallelism when run on, say, gke with larger number of nodes will either result in inadequate use or will have to modify the workflow yaml).

Instead, can the controller look at the load and available resources in the Kubernetes cluster to decide whether to run a pod or not?

There are use cases for limiting parallelism that is not strictly related to resources. For example, today we support withItems, and the implementation submits all pods as soon as possible. But if I want a sequential iteration of withItems (e.g. logically my application cannot support more than 1 concurrent pods), then I would restrict Parallism to 1. Today we have no ability to use withItems in a sequential way, but Parallism would allow us to achieve that.

The starvation case could also be possible in a multi-tenant cluster. If a cluster has a finite amount of nodes, and a single user submits a highly parallelized job, it's likely no other workflows could be run. Parallelism acts similarly to a "niceness" to prevent this.

Implemented parallelism controls at both workflow and template level in commit 0517096c32cd4f2443ae4208012c6110fbd07ab6. Examples:

examples/parallelism-limit.yaml
examples/parallelism-template-limit.yaml

Hi, does the feature exist ? I couldn't see anything in the documentation to limit the number of pods running concurrently in a loop withParam.

Thanks

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tigerwings picture tigerwings  路  3Comments

nelsonfassis picture nelsonfassis  路  4Comments

ddseapy picture ddseapy  路  3Comments

tommyJimmy87 picture tommyJimmy87  路  3Comments

vicaire picture vicaire  路  4Comments