Argo: Recommended resources for workflow controller

Created on 29 Apr 2020 · 3Comments · Source: argoproj/argo

On 2.7.1 getting OOMKilled from kubernetes on workflow-controller causing constant restarts. This was with memory limit and requests set to 1Gi.

I'm also getting this error, which then doesn't get retried (presumably because pod is down and controller not able to grab the pod change as it happens): https://github.com/argoproj/argo/blob/v2.7.1/workflow/controller/operator.go#L792

seems to be happier with 4Gi, though I'll let it run for a while

Questions:

Is running more than 1 workflow-controller an option so that wf don't fail, or is argo designed such that replicaCount must be 1?
What are the recommended resource requests for the workflow controller? The base manifests don't impose resource restrictions.

episcaling question wontfix

Source

ddseapy

👍1

Most helpful comment

Are there recommendations on running a high-availability argo in production? We have had other workflows fail when the argo workflow controller restarts for some reason (ie. postgres connection dropped, underlying node issue). We currently manually alert and re-run on these situations, as workflows are not retried in these scenarios even though we have retries enabled.

I believe the instance ID would alleviate load on a particular controller, but wouldn't address the need for it always being available.

Other applications that are critical we run more than 1 copy, and have PDB to ensure there are always copies running even when re-scheduled to another node.

Any tips would be appreciated.

ddseapy on 19 May 2020

👍6

All 3 comments

This is a really good pair of questions:

No, you must only run one controller per instance ID, typically people don't use this (it can be a workaround).
Great question. I think the memory usage will grow linearly with the number of workflow you're using - thought 4GB seems a lot.

@jessesuen @sarabala1979 thoughts?

alexec on 29 Apr 2020

I believe the instance ID would alleviate load on a particular controller, but wouldn't address the need for it always being available.

Other applications that are critical we run more than 1 copy, and have PDB to ensure there are always copies running even when re-scheduled to another node.

Any tips would be appreciated.

ddseapy on 19 May 2020

👍6

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.