Argo: Need to better handle large #'s of workflows and warn users to cleanup them

Created on 7 Feb 2020 · 11Comments · Source: argoproj/argo

Checklist:

[ X ] I've included the version.
[ X ] I've included reproduction steps.
[ ] I've included the workflow YAML.
[ X] I've included the logs.

What happened:
On one of our K8s clusters, not all, we are getting repeated errors in the workflow controller of:

E0206 22:46:53.953386       1 reflector.go:126] github.com/argoproj/argo/workflow/controller/controller.go:156: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request

which maps to https://github.com/argoproj/argo/blob/v2.4.3/workflow/controller/controller.go#L156

Turns out we had 13450 workflows built up from over the last year sitting around in the 'workflows' namespace because there still is no automatic garbage collection with argo. Running argo list -n workflows would actually crash the workflow-controller pod (causing it to go into evicted state), after it returned:
macbook-pro-2:argo [email protected]$ argo list -n workflows
2020/02/06 18:43:48 rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2151705869 vs. 2147483647)

I'm assuming the logic inside the workflow controller is just trying to fully fetch all the workflows at once rather than in chunks or applying a maximum upper bound, which is causing the timeout and crashes.

What you expected to happen:
I expected the processing by the wfInformer to report it was encountering too many workflows and recommend manual pruning of them in the controller logs. Similarly, in the error path that returns "Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request", I would expect it to report what URL/REST-request it was issuing.

How to reproduce it (as minimally and precisely as possible):
Just need to create roughly 13000+ workflows to stick around in 'workflows' namespace. The failure happens with both v2.3.0 and v.2.4.3, just different error line # for 2.3.0, of course.

Anything else we need to know?:

Environment:

Argo version: v2.3.0 and v.2.4.3

$ argo version

Kubernetes version :

clientVersion:
  buildDate: "2019-12-07T21:20:10Z"
  compiler: gc
  gitCommit: 70132b0f130acc0bed193d9ba59dd186f0e634cf
  gitTreeState: clean
  gitVersion: v1.17.0
  goVersion: go1.13.4
  major: "1"
  minor: "17"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-01-16T04:08:27Z"
  compiler: gc
  gitCommit: 18e8565daf60eb3a20c0ac29a7d3a93622659e4d
  gitTreeState: clean
  gitVersion: v1.14.10+IKS
  goVersion: go1.12.12
  major: "1"
  minor: "14"
  platform: linux/amd64

Other debugging information (if applicable):

workflow result:

argo get <workflowname>

executor logs:

kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait

workflow-controller logs:

kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Logs

argo get <workflowname>
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

bug

Source

jimlindeman

Most helpful comment

Hi,
+1 this.

I am experiencing an issue where workflow-controller eats up too much memory (up to 5GB), reaching the nodes full capacity, causing ti to be evicted in a loop.

I am running argo as a a part of a Kubeflow deployment, and the amounts of pipelines running is much (10s and not thousands)

yoni-taranis on 22 May 2020

👍2

All 11 comments

Interesting problem.

One solution, obviously, is to delete old workflows. We could always increase timeouts or change code to deal with this - but then we get to 20k workflows and the problem re-appears.

Any solution therefore must involve deleting old workflows.

What is the use case for 10k + workflows in your system?
Have you considered trying out the workflow archive feature in v2.5?

alexec on 7 Feb 2020

I don't think 10k+ workflows is healthy, it was just a behavior that occurs when the person who sets up use of Argo in pipeline automation moves to another team/project before configuring cleanup.

Our complaint is more of the the way it breaks when it gets there, as the log messages aren't informative to the user that they need to delete workflows. It took us several hours to figure this was the problem, and we would have expected "argo list" to fail gracefully and warn the user there were too many workflows to work properly.

jimlindeman on 7 Feb 2020

👍1

We ran into this issue too. We now must have ttlSecondsAfterFinished: X, on every workflow we run otherwise the workflow objects build up and you eventually can't get info out of argo CLI.

Even worse, because we weren't cleaning up our workflow objects any PersistentVolumes associated with them weren't getting deleted in AWS. At one point we had ~1000 EBS volumes sitting around doing nothing, having had their workflow terminate long ago 😳.

thundergolfer on 20 Feb 2020

Hi,
+1 this.

I am experiencing an issue where workflow-controller eats up too much memory (up to 5GB), reaching the nodes full capacity, causing ti to be evicted in a loop.

I am running argo as a a part of a Kubeflow deployment, and the amounts of pipelines running is much (10s and not thousands)

yoni-taranis on 22 May 2020

👍2

Would a warning in the user interface be useful?

alexec on 22 May 2020

Hi alexec, not sure, is there a way to estimate the required memory usage per 1 kf pipeline/argo workflow

yoni-taranis on 22 May 2020

Please read this help: https://github.com/argoproj/argo/blob/6999dec21a54f1700dc68c6e480a0cbae6f28301/docs/cost-optimisation.md

alexec on 22 May 2020

I might have snatched the wrong issue, but my issue happened while initiating 10 workflows together, each one containing about a hundred of pods or so

yoni-taranis on 22 May 2020

Have you enabled podGC on your workflows?

alexec on 22 May 2020

If by that you mean garbage collecting finished pods, yes.

There were no pods up when this happened, I think I can reproduce this easily

yoni-taranis on 22 May 2020

Fixed in #3089.

alexec on 1 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings