Argo: Run steps sequentially / retry when parallel steps exceed memory quota

Created on 6 Feb 2018  路  8Comments  路  Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened:

We set a requests.memory quota for the workflows namespace. Then we ran a workflow with parallel steps which together (but not singly) exceeded the quota. The workflow failed these steps with the following error in the workflow-controller:

time="2018-02-06T14:51:34Z" level=info msg="Updated phase Running -> Error" namespace=argo-workflows workflow=steps-6kgn6
time="2018-02-06T14:51:34Z" level=info msg="Updated message  -> pods \"steps-6kgn6-2822283003\" is forbidden: exceeded quota: argo-workflows-quota, requested: requests.memory=8256Mi, used: requests.memory=8256Mi, limited: requests.memory=12Gi" namespace=argo-workflows workflow=steps-6kgn6

What you expected to happen:

The steps should have been run sequentially and succeeded. If separate processes (i.e. other argo workflows) are causing the quota to be exceeded, argo should be able to retry.

How to reproduce it (as minimally and precisely as possible):

Set a requests.memory quota for the workflow namespace...

apiVersion: v1
kind: ResourceQuota
metadata:
  name: argo-workflows-quota
spec:
  hard:
    requests.memory: 12Gi

Submit the following workflow...

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello1"
    - - name: hello2a
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello2a"
      - name: hello2b
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello2b"

  - name: whalesay
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
      resources:
        requests:
          memory: 8Gi

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version

argo: v2.0.0-beta1
BuildDate: 2018-01-18T22:06:03Z
GitCommit: 549870c1ee08138b20b8a4b0c026569cf1e6c19a
GitTreeState: clean
GitTag: v2.0.0-beta1
GoVersion: go1.9.1
Compiler: gc
Platform: linux/amd64

  • Kubernetes version :
$ kubectl version -o yaml

clientVersion:
buildDate: 2018-01-18T10:09:24Z
compiler: gc
gitCommit: 5fa2db2bd46ac79e5e00a4e6ed24191080aa463b
gitTreeState: clean
gitVersion: v1.9.2
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64
serverVersion:
buildDate: 2017-12-15T20:55:30Z
compiler: gc
gitCommit: 925c127ec6b946659ad0fd596fa959be43f0cc05
gitTreeState: clean
gitVersion: v1.9.0
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64

enhancement

Most helpful comment

My team is in a similar boat right now too.

Jobs, StatefulSets, Deployments, etc all check the CPU and Memlimit quotas and retry until pods can be scheduled.

Parallelism doesn't satisfy our use case as we want to limit the total sum of work that a user has running vs limiting single workflows to a specified amount.

All 8 comments

There is a retryStrategy field which will allow container steps to be retried in the event of a failure (unfortunately this does not have a backoff policy yet). Also, in 2.1, there will be a parallelism feature to limit the parallelism of a workflow and/or template. Will these satisfy your use case?

I don't think scheduling algorithms based on resources belongs in the controller because this is supposed to be the responsibility of an admission controller.

Thanks, hmmmm. I didn't know about retryStrategy or k8s Admission Controllers. From a quick read, I don't think that an Admission Controller would help much here. What we're trying to do is limit the resources that a particular namespace can use at one time, but not blocking a workflow unless a single step claims too many resources. This _does_ sound more like the job a (custom?) scheduler should be doing, perhaps in combination with an Admission Controller which stops you submitting a pod which could never run given the scheduler quotas.

My team is in a similar boat right now too.

Jobs, StatefulSets, Deployments, etc all check the CPU and Memlimit quotas and retry until pods can be scheduled.

Parallelism doesn't satisfy our use case as we want to limit the total sum of work that a user has running vs limiting single workflows to a specified amount.

I believe PR https://github.com/argoproj/argo/pull/1096 is trying to solve this. Will try to revive this PR based on current master.

Hi, PR #1096 is closed but not because it was merged. I'm guessing the idea of limiting the parallel executions didn't take off ? Thanks

I closed that PR because it was inactive for 1 year. I believe that you can use the parallelism feature to achieve the same goal?

Indeed you can, it just took me a while to figure-out how.

For anyone else searching :

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my_workflow
  generateName: my_workflow
spec:
  parallelism: 2
 ...
Was this page helpful?
0 / 5 - 0 ratings