Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT
What happened:
We set a requests.memory quota for the workflows namespace. Then we ran a workflow with parallel steps which together (but not singly) exceeded the quota. The workflow failed these steps with the following error in the workflow-controller:
time="2018-02-06T14:51:34Z" level=info msg="Updated phase Running -> Error" namespace=argo-workflows workflow=steps-6kgn6
time="2018-02-06T14:51:34Z" level=info msg="Updated message -> pods \"steps-6kgn6-2822283003\" is forbidden: exceeded quota: argo-workflows-quota, requested: requests.memory=8256Mi, used: requests.memory=8256Mi, limited: requests.memory=12Gi" namespace=argo-workflows workflow=steps-6kgn6
What you expected to happen:
The steps should have been run sequentially and succeeded. If separate processes (i.e. other argo workflows) are causing the quota to be exceeded, argo should be able to retry.
How to reproduce it (as minimally and precisely as possible):
Set a requests.memory quota for the workflow namespace...
apiVersion: v1
kind: ResourceQuota
metadata:
name: argo-workflows-quota
spec:
hard:
requests.memory: 12Gi
Submit the following workflow...
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: steps-
spec:
entrypoint: hello-hello-hello
templates:
- name: hello-hello-hello
steps:
- - name: hello1
template: whalesay
arguments:
parameters:
- name: message
value: "hello1"
- - name: hello2a
template: whalesay
arguments:
parameters:
- name: message
value: "hello2a"
- name: hello2b
template: whalesay
arguments:
parameters:
- name: message
value: "hello2b"
- name: whalesay
inputs:
parameters:
- name: message
container:
image: docker/whalesay
command: [cowsay]
args: ["{{inputs.parameters.message}}"]
resources:
requests:
memory: 8Gi
Anything else we need to know?:
Environment:
$ argo version
argo: v2.0.0-beta1
BuildDate: 2018-01-18T22:06:03Z
GitCommit: 549870c1ee08138b20b8a4b0c026569cf1e6c19a
GitTreeState: clean
GitTag: v2.0.0-beta1
GoVersion: go1.9.1
Compiler: gc
Platform: linux/amd64
$ kubectl version -o yaml
clientVersion:
buildDate: 2018-01-18T10:09:24Z
compiler: gc
gitCommit: 5fa2db2bd46ac79e5e00a4e6ed24191080aa463b
gitTreeState: clean
gitVersion: v1.9.2
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64
serverVersion:
buildDate: 2017-12-15T20:55:30Z
compiler: gc
gitCommit: 925c127ec6b946659ad0fd596fa959be43f0cc05
gitTreeState: clean
gitVersion: v1.9.0
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64
There is a retryStrategy field which will allow container steps to be retried in the event of a failure (unfortunately this does not have a backoff policy yet). Also, in 2.1, there will be a parallelism feature to limit the parallelism of a workflow and/or template. Will these satisfy your use case?
I don't think scheduling algorithms based on resources belongs in the controller because this is supposed to be the responsibility of an admission controller.
Thanks, hmmmm. I didn't know about retryStrategy or k8s Admission Controllers. From a quick read, I don't think that an Admission Controller would help much here. What we're trying to do is limit the resources that a particular namespace can use at one time, but not blocking a workflow unless a single step claims too many resources. This _does_ sound more like the job a (custom?) scheduler should be doing, perhaps in combination with an Admission Controller which stops you submitting a pod which could never run given the scheduler quotas.
My team is in a similar boat right now too.
Jobs, StatefulSets, Deployments, etc all check the CPU and Memlimit quotas and retry until pods can be scheduled.
Parallelism doesn't satisfy our use case as we want to limit the total sum of work that a user has running vs limiting single workflows to a specified amount.
I believe PR https://github.com/argoproj/argo/pull/1096 is trying to solve this. Will try to revive this PR based on current master.
Hi, PR #1096 is closed but not because it was merged. I'm guessing the idea of limiting the parallel executions didn't take off ? Thanks
I closed that PR because it was inactive for 1 year. I believe that you can use the parallelism feature to achieve the same goal?
Indeed you can, it just took me a while to figure-out how.
For anyone else searching :
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: my_workflow
generateName: my_workflow
spec:
parallelism: 2
...
Most helpful comment
My team is in a similar boat right now too.
Jobs, StatefulSets, Deployments, etc all check the CPU and Memlimit quotas and retry until pods can be scheduled.
Parallelism doesn't satisfy our use case as we want to limit the total sum of work that a user has running vs limiting single workflows to a specified amount.