Hello there! Some time back I asked a question on the Slack channel and still haven't got any advice on the problem.
I need some help and advice about the way Argo handles resource quotas. We're hitting the problem repeatedly in our namespace, a workflow fails because of quota limits and is not retried later on.
An example Workflow result:
pods "workflow-test-1578387522-82e06118" is forbidden: exceeded quota: diamand-quota, requested: limits.cpu=2, used: limits.cpu=23750m, limited: limits.cpu=24
Is there any advice with respect to the Workflow reconciliation? Any existing solutions? Does / should workflow-controller take care of that?
All in all, I need to know
a) whether the problem is on our site
b) whether there is an easy solution how to get around failed Workflows due to resource quotas
c) whether somebody else is hitting the issue
d) whether there are any plans from the Argo site regarding this and/or how can I contribute
I am ready and willing to go ahead and see to the implementation myself, just not experienced enough to be able to tell whether this is something that can be implemented and how to go about this. again, any pointers are welcome! :)
Cheers,
Marek
As far as I'm aware, resource quotas are a K8s concept that Argo does not know about. They are manged by a cluster admin and restrict how much resources can a specific namespace use. If running workflows is providing this error, might it be that your specific namespace in your specific cluster is running out of quotas? If so, this is not a problem that Argo could solve.
Hello @simster7 , thank you for the answer!
Correct, it is a K8s concept.
If running workflows is providing this error, might it be that your specific namespace in your specific cluster is running out of quotas?
That is precisely the issue, the namespace is temporarily out of quotas. For example, due to lots of pods being currently running or lack of memory.
this is not a problem that Argo could solve
I disagree. I believe the correct behaviour would be to wait for the resource quota to be available before executing the next step of a Workflow. Argo is, after all, a workload management engine, isn't it? How I am I supposed to use Argo for container orchestration if it doesn't allow me to follow the very basic Kubernetes rule (which is that pods are created when there are resources to do so) and fails instantly?
Consider the following example:
I have a workflow in which I only submit resources. In the Workflow step, I do submit a resource (let's say a Job) and then the Workflow immediately fails because it is not able to create the Pod for it. However, 1 second later, the Job is run by K8s because the quota had been freed for it. The Workflow, however, stays failed nevertheless. And that, in my opinion, is quite a problem and it makes it nearly impossible to use Argo Workflow in an environment with strict quotas. I do want to use it tho because Argo's wicked if it weren't for this damn thing! :D
Cheers,
Marek
(I edited your comment above to distinguish the quotes and your responses)
Ah, I see what you're saying now. If that is the case as you've described, I agree that we need some sort of spec/logic to fix this.
I haven't tried it here yet, but would using retryStrategy here help at all? If not, what do you think we could add to it to solve this issue?
Thanks for the edit! :)
I actually do see your point now as well after giving it a deeper thought and I think you might be right in the sense that Argo is probably not to blame ( but it might be the saviour :) )
I'll get back to the retryStrategy in a sec. Please, correct me if I am wrong in my reasoning. Let's consider the case of k8s resources first (which is the most trickier one I suppose, because it's basically decoupled from the Workflow):
kubectl create command -> and here one of the two things happens:In both cases, the Workflow is considered failed (unless treated otherwise).
Now, in case of a)
This is, in my opinion, something which should be handled on the Argo side. The reason should be detected and there should be a mechanism (which may or may not be configurable) to wait for the required resources.
the b) case is where we're doomed, because it's completely decoupled and I agree that in that particular case, Argo's done it's job and submitted the resource. However, what's missing here is the ability to act upon these failures. I think that the successCondition and failureCondition is not sufficient and neither is retryStrategy because I have no way to detect why the failure happened and I end up retrying workflow steps whose failure was justifiable due to application misbehaviour (also, it is quite confusing, even from the UI, because it looks like the step was failing for some time, but in fact, it is just waiting for the quota).
I think a potential solution to combat the b) problem to introduce a better decision mechanism. I can elaborate a little bit more if you will, but the rough idea would be to have a onSuccess and onFailure branching like:
steps:
- - name: resource-submission
onFailure:
template: retry-if-exceeded-quota
and probably to improve the successCondition, resp. failureCondition to allow a more fine-grained control over these?
I am looking forward to your response and comments,
Mark
What sort of condition would {success, failure}Condition use to determine if the issue is caused by resource quotas?
I think this falls in the domain of retryStrategy although I agree it might be a bit off, because this is technically a failure while scheduling, and not a failure of the pod. Any ideas for what would be a good change to retryStrategy to alleviate this?
I second that, we use argo workflows with gpus, and as of now having a limit on namespace causes workflow to fail. To make it work one needs to put number of retries to unlimited, and this is very bad -- suppose there is an error in the pod itself.
I'd like argo controller to be a bit more mindful wrt resource allocation and take pod requests and namespace limits into consideration.
I'm relatively new to Argo, but I've already seen similar issues. I'm running computational jobs. I know that these will occupy, for example, 1 CPU. If I have a cluster with, say, 10 CPUs but my Workflow will create 100 pods, I'd like to be able to specify the restriction at the Workflow level in the spec that the work described will require 1 CPU.
I see these resource based constraints as important because the current constraints on parallelism (as far as I know) don't really address this issue. I can set a limit on the total number of Workflows, but that is difficult because I don't know what resources each of the workflows will require. I can set a parallelism in the Workflow itself, but then I don't know what the cluster is capable of.
What is really needed to scale things is for the workflows themselves to specify their resource constraints and have the scheduler prioritize these nodes and then schedule what the cluster can handle. Is there some way to accomplish this currently?
@xogeny I'm going to implement it, if you'd like you can be one of beta-testers :) Basically the idea is to alter how pods are scheduled, not to fail when there are no resources, but rather indicate a "Pending" state.
@jamhed I'd be very happy to try it out. This will be relatively important for us. BTW, I see you are in Prague. Turns out I'm in Prague this week. Let me know if you want to grab lunch this week and talk Argo, my treat. I'm near Karlin, BTW.
@xogeny sure, would love to, check your mail. actually, i have more things i'd like to discuss with this regard :)
@jamhed consider myself in for the testing :)
@CermakM sure :)
@CermakM jamhed/workflow-controller:v2.4.3-1
If it can't schedule the pod, it just keeps it in Pending state.
Details: https://github.com/argoproj/argo/compare/v2.4.3...jamhed:v2.4.3-1
@jamhed that sounds interesting! Will give it a shot :)
@CermakM, this is pretty much the changes you need to do in argo helm chart:
images:
namespace: jamhed
tag: v2.4.3-1
@jamhed Just tested it out. Works like a charm! What a relief to see that ... are there any plans for merging this to the upstream?!
@simster7 :pray:
@CermakM let me backport it to 2.6.1, and i'll open a pull request.
@CermakM @simster7 https://github.com/argoproj/argo/pull/2385
We are considering making this default behaviour in v2.11. Thoughts?
We are considering making this default behaviour in v2.11. Thoughts?
I would say it's much better behavior in comparison to the current one. So +1 on my side for this usability improvement.
Thank you. I've created a new image for testing if you would like to try it: argoproj/workflow-controller:fix-3791 .
I've created another test image: argoproj/workflow-controller:fix-3791.
Can you please try it out to confirm it fixes your problem?
Hey whats the current status on this? Has this feature been released? I can see the flag in the API reference but the default behaviour isn't documented.
I've created another test image: argoproj/workflow-controller:fix-3791.
Can you please try it out to confirm it fixes your problem?
We've installed your build argoproj/workflow-controller:fix-3791 (sha256:2cc4166ce). I can confirm the workflow acts much more stable with respect to resources in comparison to v2.9.5 (I haven't tested newer releases than that).
Is there anything to observe in logs (I didn't see any relevant messages)?
Thank you. I wanted to verify it worked better.
Thank you. I wanted to verify it worked better.
Thank you for this fix.
In what release of argo we can expect this feature to be present?
v2.11
Thank you. I've created a new image for testing if you would like to try it:
argoproj/workflow-controller:fix-3791.
@alexec After some time we spotted an issue. Workflows fail (interestingly, they do not get deleted based on the ttl strategy configuration) and stay in the cluster. I can see pod deleted as a message. This happens for a pod that requires a relatively large amount of resources and that are not available as other workflows use them.
Checking cluster events, there was nothing suspicious. Workflow controler produces the following log:
time="2020-09-01T08:43:41Z" level=info msg="Processing workflow" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Updated phase -> Running" namespace=thoth-backend-stage workflow=adviser-c9267fd5
E0901 08:43:41.029779 1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"adviser-c9267fd5.16309c64000c72ab", GenerateName:"", Namespace:"thoth-backend-stage", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"thoth-backend-stage", Name:"adviser-c9267fd5", UID:"b44d8e83-14f6-482e-b001-42b0c8e1a09b", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"195448328", FieldPath:""}, Reason:"WorkflowRunning", Message:"Workflow Running", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcba04f41ab50ab, ext:508109826557890, loc:(*time.Location)(0x2994080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcba04f41ab50ab, ext:508109826557890, loc:(*time.Location)(0x2994080)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
time="2020-09-01T08:43:41Z" level=info msg="DAG node adviser-c9267fd5 initialized Running" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="All of node adviser-c9267fd5.advise dependencies [] completed" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Pod node adviser-c9267fd5-3669748408 initialized Pending" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Mark node adviser-c9267fd5.advise as Pending, due to: pods \"adviser-c9267fd5-3669748408\" is forbidden: exceeded quota: thoth-backend-stage-quota, requested: limits.memory=6400Mi, used: limits.memory=14824Mi, limited: limits.memory=20Gi" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="node adviser-c9267fd5-3669748408 message: Pending 27.047885ms" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Released all acquired locks" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Workflow update successful" namespace=thoth-backend-stage phase=Running resourceVersion=195448332 workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Processing workflow" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=warning msg="pod adviser-c9267fd5-3669748408 deleted" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-2425184343 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-1685984280 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-1448319234 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-3104014563 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Outbound nodes of adviser-c9267fd5 set to [adviser-c9267fd5-1685984280 adviser-c9267fd5-1448319234 adviser-c9267fd5-3104014563]" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="node adviser-c9267fd5 phase Running -> Error" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="node adviser-c9267fd5 finished: 2020-09-01 08:43:51.160005601 +0000 UTC" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Checking daemoned children of adviser-c9267fd5" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Updated phase Running -> Error" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Marking workflow completed" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Checking daemoned children of " namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Released all acquired locks" namespace=thoth-backend-stage workflow=adviser-c9267fd5
E0901 08:43:51.161571 1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"adviser-c9267fd5.16309c665bf6cacc", GenerateName:"", Namespace:"thoth-backend-stage", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"workflows.argoproj.io/node-name":"adviser-c9267fd5", "workflows.argoproj.io/node-type":"DAG"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"thoth-backend-stage", Name:"adviser-c9267fd5", UID:"b44d8e83-14f6-482e-b001-42b0c8e1a09b", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"195448332", FieldPath:""}, Reason:"WorkflowNodeError", Message:"Error node adviser-c9267fd5", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c989c4cc, ext:508119958577099, loc:(*time.Location)(0x2994080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c989c4cc, ext:508119958577099, loc:(*time.Location)(0x2994080)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
E0901 08:43:51.166772 1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"adviser-c9267fd5.16309c665bf7e5f2", GenerateName:"", Namespace:"thoth-backend-stage", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"thoth-backend-stage", Name:"adviser-c9267fd5", UID:"b44d8e83-14f6-482e-b001-42b0c8e1a09b", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"195448332", FieldPath:""}, Reason:"WorkflowFailed", Message:"", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c98adff2, ext:508119958649585, loc:(*time.Location)(0x2994080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c98adff2, ext:508119958649585, loc:(*time.Location)(0x2994080)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
time="2020-09-01T08:43:51Z" level=info msg="Workflow update successful" namespace=thoth-backend-stage phase=Error resourceVersion=195448520 workflow=adviser-c9267fd5
Not sure if this line can have any impact on the issue (I suspect not?!):
'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
Pods may be deleted manually, due to scaledown events in your cluster, or other reasons. When this happens, unless you have 'resubmitPendingPods: true`, the nodes fails. See #3918
Pods may be deleted manually, due to scaledown events in your cluster, or other reasons. When this happens, unless you have 'resubmitPendingPods: true`, the nodes fails. See #3918
Thanks! It looks like setting resubmitPendingPods: true on the template level did the trick.
Most helpful comment
We are considering making this default behaviour in v2.11. Thoughts?