Argo: Workflow controller repeatedly attempts to create pods that already exist

Created on 31 Oct 2019 · 7Comments · Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
Workflow controller continually tries to create the same workflow pod, which counts against resource quotas.

What you expected to happen:
Workflow should create the workflow pod once, and not try again if it was successful the first time.

How to reproduce it (as minimally and precisely as possible):

Create a single workflow that runs for a bit (I use 30 seconds here):

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dulicate-pods-
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    container:
      name: main
      image: alpine
      command: [bin/sh]
      args: [-c, sleep 30]
      resources:
        requests:
          cpu: 100m
          memory: 1Gi
        limits:
          cpu: 400m
          memory: 1Gi

Set up a resource quota in your namespace that limits the number of pods.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: quota-argo
  namespace: argo
spec:
  hard:
    cpu: '32'
    memory: 64Gi
    pods: '12'

Launch the workflow and monitor the workflow controller logs, and the resource quota status using kubectl get quota quota-argo -o yaml --watch.
You will see the workflow controller logs repeatedly outputting Skipped pod <the pod name> (<the pod name>) creation: already exists. Meanwhile, due to bugs in kubernetes regarding invalid pod requests counting towards resource quotas (see https://github.com/kubernetes/kubernetes/issues/70563 and https://github.com/kubernetes/kubernetes/issues/51476), these invalid duplicate requests count against the pod limit of our resource quota. Ultimately this results in the workflow failing if the quota is exceeded.

Anything else we need to know?:
While this seems to involve a bug in kubernetes, it seems like there should be a way for the workflow controller to avoid repeatedly submitting the same pod creation request. I can see that this scenario is actually expected in the code (https://github.com/argoproj/argo/blob/master/workflow/controller/workflowpod.go#L271) though I'm not sure why. Hopefully there is a way to avoid this scenario completely.

Environment:

Argo version: 2.4.2

Kubernetes version : 1.14
workflow-controller logs:

time="2019-10-31T12:37:58Z" level=info msg="Skipped pod pod-spec-patch-wrvrr (pod-spec-patch-wrvrr) creation: already exists" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:37:58Z" level=info msg="Workflow update successful" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:37:59Z" level=info msg="Processing workflow" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:37:59Z" level=info msg="Skipped pod pod-spec-patch-wrvrr (pod-spec-patch-wrvrr) creation: already exists" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:38:01Z" level=info msg="Processing workflow" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:38:01Z" level=info msg="Skipped pod pod-spec-patch-wrvrr (pod-spec-patch-wrvrr) creation: already exists" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:38:07Z" level=info msg="Processing workflow" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:38:07Z" level=info msg="Skipped pod pod-spec-patch-wrvrr (pod-spec-patch-wrvrr) creation: already exists" namespace=argo workflow=pod-spec-patch-wrvrr
time="2019-10-31T12:38:11Z" level=info msg="Processing workflow" namespace=argo workflow=pod-spec-patch-wrvrr

bug

Source

Syps

👍1

All 7 comments

I've only been on Argo recently, so take this with a grain of salt, but it seems to me that we repeatedly submit the pod creation request in order to fit the operator's idempotent paradigm. (We operate workflows by repeatedly running the template over and over, so we ensure that individual steps are idempotent.) Since attempting to create a pod that already exists is basically a no-op, this is the way fulfill the step's idempotentcy.

A potential solution to this would be to check if the pod exists (by running a CoreV1().Pods(...).Get(...)) before attempting to create it, instead of failing gracefully if the pod does indeed exist.

Since this issue seems to be an edge case with a K8s bug, and I don't know why we do the idempotency the way we currently do, I'll defer any opinion as to the worthiness of the potential solution.

Edit: Just had another thought: Seems minor, but it might be that we create pods the current way to save on API calls: this is guaranteed to be one call vs potentially two calls with the proposed solution.

simster7 on 4 Nov 2019

Checking if pod exists sounds like good plan to me. Any chance on this being addressed on an upcoming release? I can attempt to make a PR too if that would be better.

Syps on 18 Nov 2019

We have also encountered this and it's blocking our upgrade to 2.4.2, is there anything we can do to help (like providing logs/running tests)?

benabineri on 3 Dec 2019

Hello @Syps and @benabineri. Sorry I missed your comment a couple of weeks ago. Feel free to make a PR (should be a small change) and run tests to see if this solves your problem. We can see and discuss the trade offs then.

@benabineri Which version of Argo are you currently running? Does that version not have the behavior described here?

simster7 on 3 Dec 2019

We're running argo 2.3.0 in production at the moment and the changes to use of ResourceQuotas it causes are more accurate.

benabineri on 3 Dec 2019

👍1

We are planning to release the patch release soon. Go ahead to create PR soon. We will review and release it with coming patch