On 2.7.1 getting OOMKilled from kubernetes on workflow-controller causing constant restarts. This was with memory limit and requests set to 1Gi.
I'm also getting this error, which then doesn't get retried (presumably because pod is down and controller not able to grab the pod change as it happens): https://github.com/argoproj/argo/blob/v2.7.1/workflow/controller/operator.go#L792
seems to be happier with 4Gi, though I'll let it run for a while
Questions:
This is a really good pair of questions:
@jessesuen @sarabala1979 thoughts?
Are there recommendations on running a high-availability argo in production? We have had other workflows fail when the argo workflow controller restarts for some reason (ie. postgres connection dropped, underlying node issue). We currently manually alert and re-run on these situations, as workflows are not retried in these scenarios even though we have retries enabled.
I believe the instance ID would alleviate load on a particular controller, but wouldn't address the need for it always being available.
Other applications that are critical we run more than 1 copy, and have PDB to ensure there are always copies running even when re-scheduled to another node.
Any tips would be appreciated.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Are there recommendations on running a high-availability argo in production? We have had other workflows fail when the argo workflow controller restarts for some reason (ie. postgres connection dropped, underlying node issue). We currently manually alert and re-run on these situations, as workflows are not retried in these scenarios even though we have retries enabled.
I believe the instance ID would alleviate load on a particular controller, but wouldn't address the need for it always being available.
Other applications that are critical we run more than 1 copy, and have PDB to ensure there are always copies running even when re-scheduled to another node.
Any tips would be appreciated.