Argo: v2.11: workflow-controller-configmap parallelism not being honored.

Created on 21 Sep 2020  ·  25Comments  ·  Source: argoproj/argo

Summary

I set the "across workflow parallelism" in workflow-controller-config to 5.

apiVersion: v1
data:
  config: |
    parallelism: 5
    nodeEvents:
      enabled: true
    workflowDefaults:
      spec:
        ttlStrategy:
          secondsAfterSuccess: 5
        parallelism: 5
kind: ConfigMap

Then submitted 10 workflows at once and all of them went into running state

abmallic@MININT-8H2H4GR:~/argo/argo_workflows$ kg workflow
NAME STATUS AGE
dag-diamon-coinflip-5c55t Running 7s
dag-diamon-coinflip-fczh5 Running 13s
dag-diamon-coinflip-hr4k7 Running 8s
dag-diamon-coinflip-m46cv Running 11s
dag-diamon-coinflip-mq4rl Running 10s
dag-diamon-coinflip-rx2nv Running 12s
dag-diamon-coinflip-vcx66 Running 12s
dag-diamon-coinflip-wkf6r Running 6s
dag-diamon-coinflip-xvjpj Running 6s
dag-diamon-coinflip-zsmqs Running 9s

What happened/what you expected to happen?
5 of them should have gone into running state

Diagnostics

What version of Argo Workflows are you running?
2.11

```apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
creationTimestamp: "2020-09-21T08:28:24Z"
generateName: dag-diamon-coinflip-
generation: 4
labels:
workflows.argoproj.io/phase: Running
managedFields:

  • apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
    f:metadata:
    f:generateName: {}
    f:spec:
    .: {}
    f:arguments: {}
    f:entrypoint: {}
    f:podGC:
    .: {}
    f:strategy: {}
    f:templates: {}
    f:status:
    .: {}
    f:finishedAt: {}
    manager: argo
    operation: Update
    time: "2020-09-21T08:28:24Z"
  • apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
    f:metadata:
    f:labels:
    .: {}
    f:workflows.argoproj.io/phase: {}
    f:spec:
    f:parallelism: {}
    f:ttlStrategy:
    .: {}
    f:secondsAfterSuccess: {}
    f:status:
    f:nodes:
    .: {}
    f:dag-diamon-coinflip-5c55t:
    .: {}
    f:children: {}
    f:displayName: {}
    f:finishedAt: {}
    f:id: {}
    f:name: {}
    f:phase: {}
    f:startedAt: {}
    f:templateName: {}
    f:templateScope: {}
    f:type: {}
    f:dag-diamon-coinflip-5c55t-758823825:
    .: {}
    f:boundaryID: {}
    f:children: {}
    f:displayName: {}
    f:finishedAt: {}
    f:id: {}
    f:name: {}
    f:phase: {}
    f:startedAt: {}
    f:templateName: {}
    f:templateScope: {}
    f:type: {}
    f:dag-diamon-coinflip-5c55t-2710044213:
    .: {}
    f:boundaryID: {}
    f:children: {}
    f:displayName: {}
    f:finishedAt: {}
    f:id: {}
    f:name: {}
    f:phase: {}
    f:startedAt: {}
    f:templateName: {}
    f:templateScope: {}
    f:type: {}
    f:dag-diamon-coinflip-5c55t-3805981268:
    .: {}
    f:boundaryID: {}
    f:displayName: {}
    f:finishedAt: {}
    f:hostNodeName: {}
    f:id: {}
    f:name: {}
    f:phase: {}
    f:startedAt: {}
    f:templateName: {}
    f:templateScope: {}
    f:type: {}
    f:phase: {}
    f:startedAt: {}
    manager: workflow-controller
    operation: Update
    time: "2020-09-21T08:28:28Z"
    name: dag-diamon-coinflip-5c55t
    namespace: argo
    resourceVersion: "2958114"
    selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflows/dag-diamon-coinflip-5c55t
    uid: 12cb8123-a6be-4676-8902-b8b6b4fdc0cc
    spec:
    arguments: {}
    entrypoint: diamond
    parallelism: 5
    podGC:
    strategy: OnWorkflowSuccess
    templates:
  • arguments: {}
    dag:
    tasks:

    • arguments: {}

      name: A

      template: coinflip

    • arguments: {}

      dependencies:



      • A


        name: B


        template: coinflip



    • arguments: {}

      dependencies:



      • A


        name: C


        template: coinflip



    • arguments: {}

      dependencies:



      • B


      • C


        name: D


        template: coinflip


        inputs: {}


        metadata: {}


        name: diamond


        nodeSelector:


        agentpool: argoworkload


        outputs: {}



  • arguments: {}
    inputs: {}
    metadata: {}
    name: coinflip
    nodeSelector:
    agentpool: argoworkload
    outputs: {}
    steps:
    - - arguments: {}
    name: flip-coin
    template: flip-coin
    - - arguments: {}
    name: heads
    template: found-heads
    when: '{{steps.flip-coin.outputs.result}} == heads'

    • arguments: {}

      name: tails

      template: coinflip

      when: '{{steps.flip-coin.outputs.result}} == tails'

  • arguments: {}
    inputs: {}
    metadata: {}
    name: flip-coin
    nodeSelector:
    agentpool: argoworkload
    outputs: {}
    script:
    command:

    • python

      image: python:alpine3.6

      name: ""

      resources: {}

      source: |

      import random

      import time

      time.sleep(60)

      result = "heads" if random.randint(0,1) == 0 else "tails"

      print(result)

  • arguments: {}
    container:
    args:

    • echo "it was heads"

      command:

    • sh

    • -c

      image: alpine:3.6

      name: ""

      resources: {}

      inputs: {}

      metadata: {}

      name: found-heads

      nodeSelector:

      agentpool: argoworkload

      outputs: {}

      ttlStrategy:

      secondsAfterSuccess: 5

      status:

      finishedAt: null

      nodes:

      dag-diamon-coinflip-5c55t:

      children:

    • dag-diamon-coinflip-5c55t-758823825

      displayName: dag-diamon-coinflip-5c55t

      finishedAt: null

      id: dag-diamon-coinflip-5c55t

      name: dag-diamon-coinflip-5c55t

      phase: Running

      startedAt: "2020-09-21T08:28:24Z"

      templateName: diamond

      templateScope: local/dag-diamon-coinflip-5c55t

      type: DAG

      dag-diamon-coinflip-5c55t-758823825:

      boundaryID: dag-diamon-coinflip-5c55t

      children:

    • dag-diamon-coinflip-5c55t-2710044213

      displayName: A

      finishedAt: null

      id: dag-diamon-coinflip-5c55t-758823825

      name: dag-diamon-coinflip-5c55t.A

      phase: Running

      startedAt: "2020-09-21T08:28:24Z"

      templateName: coinflip

      templateScope: local/dag-diamon-coinflip-5c55t

      type: Steps

      dag-diamon-coinflip-5c55t-2710044213:

      boundaryID: dag-diamon-coinflip-5c55t-758823825

      children:

    • dag-diamon-coinflip-5c55t-3805981268

      displayName: '[0]'

      finishedAt: null

      id: dag-diamon-coinflip-5c55t-2710044213

      name: dag-diamon-coinflip-5c55t.A[0]

      phase: Running

      startedAt: "2020-09-21T08:28:24Z"

      templateName: coinflip

      templateScope: local/dag-diamon-coinflip-5c55t

      type: StepGroup

      dag-diamon-coinflip-5c55t-3805981268:

      boundaryID: dag-diamon-coinflip-5c55t-758823825

      displayName: flip-coin

      finishedAt: null

      hostNodeName: aks-argoworkload-29252569-vmss00001o

      id: dag-diamon-coinflip-5c55t-3805981268

      name: dag-diamon-coinflip-5c55t.A[0].flip-coin

      phase: Running

      startedAt: "2020-09-21T08:28:24Z"

      templateName: flip-coin

      templateScope: local/dag-diamon-coinflip-5c55t

      type: Pod

      phase: Running

      startedAt: "2020-09-21T08:28:24Z"

```



Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

bug epireliability more-information-needed

All 25 comments

Can you provide workflow-controller log?
Kubectl logs

I have been tracking down ttlStrategy & serviceAccount also not being honored either on the workflowDefaults. Perhaps this goes further than just parallelism? Although this appears to be intermittent and/or workflow dependent.

Workflow Controller Logs for Failing to apply workflowDefaults.psec.serviceAccountName

time="2020-09-21T14:52:59Z" level=info msg="Processing workflow" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="Updated phase  -> Running" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="Steps node steps-zjxns initialized Running" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="StepGroup node steps-zjxns-3343772756 initialized Running" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="Pod node steps-zjxns-3740556429 initialized Pending" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="Created pod: steps-zjxns[0].hello1 (steps-zjxns-3740556429)" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="Workflow step group node steps-zjxns-3343772756 not yet completed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:52:59Z" level=info msg="Workflow update successful" namespace=sandbox phase=Running resourceVersion=2251432 workflow=steps-zjxns
time="2020-09-21T14:53:00Z" level=info msg="Processing workflow" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:00Z" level=info msg="Updating node steps-zjxns-3740556429 message: ContainerCreating"
time="2020-09-21T14:53:00Z" level=info msg="Skipped pod steps-zjxns[0].hello1 (steps-zjxns-3740556429) creation: already exists" namespace=sandbox podPhase=Pending workflow=steps-zjxns
time="2020-09-21T14:53:00Z" level=info msg="Workflow step group node steps-zjxns-3343772756 not yet completed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:00Z" level=info msg="Workflow update successful" namespace=sandbox phase=Running resourceVersion=2251441 workflow=steps-zjxns
time="2020-09-21T14:53:01Z" level=info msg="Processing workflow" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:01Z" level=info msg="Skipped pod steps-zjxns[0].hello1 (steps-zjxns-3740556429) creation: already exists" namespace=sandbox podPhase=Pending workflow=steps-zjxns
time="2020-09-21T14:53:01Z" level=info msg="Workflow step group node steps-zjxns-3343772756 not yet completed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:03Z" level=info msg="Processing workflow" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:03Z" level=info msg="Updating node steps-zjxns-3740556429 status Pending -> Running"
time="2020-09-21T14:53:03Z" level=info msg="Workflow step group node steps-zjxns-3343772756 not yet completed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:03Z" level=info msg="Workflow update successful" namespace=sandbox phase=Running resourceVersion=2251459 workflow=steps-zjxns
time="2020-09-21T14:53:04Z" level=info msg="Processing workflow" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:04Z" level=info msg="Workflow step group node steps-zjxns-3343772756 not yet completed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Processing workflow" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Updating node steps-zjxns-3740556429 status Running -> Error"
time="2020-09-21T14:53:09Z" level=info msg="Updating node steps-zjxns-3740556429 message: failed to save outputs: Failed to establish pod watch: timed out waiting for the condition"
time="2020-09-21T14:53:09Z" level=info msg="Step group node steps-zjxns-3343772756 deemed failed: child 'steps-zjxns-3740556429' failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="node steps-zjxns-3343772756 phase Running -> Failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="node steps-zjxns-3343772756 message: child 'steps-zjxns-3740556429' failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="node steps-zjxns-3343772756 finished: 2020-09-21 14:53:09.334926131 +0000 UTC" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="step group steps-zjxns-3343772756 was unsuccessful: child 'steps-zjxns-3740556429' failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Outbound nodes of steps-zjxns-3740556429 is [steps-zjxns-3740556429]" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Outbound nodes of steps-zjxns is [steps-zjxns-3740556429]" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="node steps-zjxns phase Running -> Failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="node steps-zjxns message: child 'steps-zjxns-3740556429' failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="node steps-zjxns finished: 2020-09-21 14:53:09.335050744 +0000 UTC" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Checking daemoned children of steps-zjxns" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Updated phase Running -> Failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Updated message  -> child 'steps-zjxns-3740556429' failed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Marking workflow completed" namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Checking daemoned children of " namespace=sandbox workflow=steps-zjxns
time="2020-09-21T14:53:09Z" level=info msg="Workflow update successful" namespace=sandbox phase=Failed resourceVersion=2251485 workflow=steps-zjxns
time="2020-09-21T14:53:10Z" level=info msg="Labeled pod completed" namespace=sandbox pod=steps-zjxns-3740556429

Related to #4044

@sarabala1979 do you want to own?

Workflow controller logs @sarabala1979
workflow_controller.log

I have been tracking down ttlStrategy & serviceAccount also not being honored either on the workflowDefaults. Perhaps this goes further than just parallelism? Although this appears to be intermittent and/or workflow dependent.

Workflow Controller Logs for Failing to apply workflowDefaults.psec.serviceAccountName

ttlStrategy however seems to be working fine.

Okay .. update. I deleted workflow-controller deployment and re-deployed argo. It created workflow-controller again. Also, deleted the couple of pods for the workflow-controller and argo-server. Left the configmap as is. Submitted 10 workflows again. But this time, only 7 workflows ran at a time. So there was some throttling.

Are there steps to be done post the config map updates. That's something that might have been missing.
@willgleich @sarabala1979

@AbhishekMallick Above workflow log doesn't have any parallelism logs. it looks like controller is not seeing configmap update
Here the checklist

  1. Update the Parallelism on workflow-controller-confimap in argo namespace
  2. Once you updated confimap, you can see the updated configmap in workflow-controler log
  3. you should see the Workflow processing has been postponed due to max parallelism limit log message when controller reaches limit

Please provide the latest controller log with configmap update.

I am also facing this issue.

My Argo workflow controller config-map

apiVersion: v1
data:
  # More configs here https://github.com/argoproj/argo/blob/master/docs/workflow-controller-configmap.yaml
  config: |
        parallelism: 100
        nodeEvents:
            enabled: false
        workflowDefaults:
          spec:
            ttlStrategy:
              secondsAfterSuccess: 3600
            parallelism: 20
kind: ConfigMap
metadata:
  name: workflow-controller-configmap

Argo controller logs

workflow-controller-5d956bfb6d-d74t2.log

Graphana metrics
Running workflows error

I can see that it is respecting this limit for while but after some time it is not honoring the parallelism limit.

Sample workflow logs

time="2020-09-21T16:46:34Z" level=warning msg="Workflow processing has been postponed due to max parallelism limit" key=argo/internal-data-ingestioncs7nk
time="2020-09-21T16:46:43Z" level=warning msg="Workflow processing has been postponed due to max parallelism limit" key=argo/internal-data-ingestiongbt7k
time="2020-09-21T16:46:43Z" level=warning msg="Workflow processing has been postponed due to max parallelism limit" key=argo/internal-data-ingestioncbzcd
time="2020-09-21T16:46:48Z" level=warning msg="Workflow processing has been postponed due to max parallelism limit" key=argo/internal-data-ingestionmwf7b
time="2020-09-21T16:46:48Z" level=warning msg="Workflow processing has been postponed due to max parallelism limit" key=argo/internal-data-ingestioncs7nk

Suggestion
It will better if you can improve logging message with additional information like current running workflows count and configured limit in this log message to avoid confusions.

Thanks for details information. Is it possible to get the log for 21:30 to 22:30. I want to see after dip in graph. The controller didn't honoring it. Did you install new controller or update configmap?

@sarabala1979, the logs are attached in his message. Under Argo Controller logs. I can see that the logs match with what both of you are saying - the parallelism limit was being honored for a while and that's there in the logs. But, then it suddenly stops happening.

There is a bug in the controller when if you change controlle-level parallelism (i.e. in the config map), we do not actually update the controller. You must restart the controller if you change the controller parallelism.

You can change the workflow-level item.

To fix this:

Copy this line: https://github.com/argoproj/argo/blob/master/workflow/controller/controller.go#L140 to here https://github.com/argoproj/argo/blob/master/workflow/controller/config.go#L34.

@alexec I have noticed that bug earlier also so as precautionary step I restarted my Argo workflow controller before starting my actual workload. I confirmed that controller is picking latest config-map with parallelism=100 by looking at controller logs of config is being used. Even after restart, controller is not respecting parallelism always.

Argo controller logs

level=info msg="Configuration:\nartifactRepository: {}\nmetricsConfig: {}\nnodeEvents:\n  enabled: false\nparallelism: 100\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \
"\"\n  issuer: \"\"\n  redirectUrl: \"\"\ntelemetryConfig: {}\nworkflowDefaults:\n  metadata:\n    creationTimestamp: null\n  spec:\n    arguments: {}\n    parallelism: 20\n    ttlStrategy:\n      secondsAfterSuccess: 3600\n  status:\n
  finishedAt: null\n    startedAt: null\n"

So I suspect that restart bug you are talking about is separate then what I am facing.

I've created a fix in #4132. If I create a dev build, could I please ask you to run it?

The dev build is argoproj/workflow-controller:throttler

Do you have an ETA on when this would land? We have run into this issue as well and would love to use workflow parallelism. We are running many massive workflows concurrently which runs into scalability issues w/ kubernetes, so setting workflow parallelism and pods per workflow is essential for us to prevent the system from grinding to a halt.

Our testing steps were:

  • Set parallelism to 2 in configmap
  • restart the workflow-controller
  • launch 3 workflows -- the 3rd one was pending (expected)
  • Set parallelism to 5 in configmap
  • restart the workflow-controller
  • launch 7 workflows -- all 7 went into the running state (expected: only 5 go into the running state).

This symptoms seem to be very similar to what others described above.

Our testing steps

Are you saying that argoproj/workflow-controller:throttler did not work for you?

@lynxoid I think your issue may be better solved by #4207 which is now on latest.

Are you saying that argoproj/workflow-controller:throttler did not work for you?

Is this a docker image? Where do I pull it from? I did not see a project w/ this name / release.

(⎈ |k3s-default:argo)➜  argo git:(int64str) docker pull argoproj/workflow-controller:throttler
throttler: Pulling from argoproj/workflow-controller
Digest: sha256:5c65218e4c5857fe9b2bdf1954c59f630ca245ce674b9ea7965197f85f2514ec
Status: Image is up to date for argoproj/workflow-controller:throttler
docker.io/argoproj/workflow-controller:throttler
(⎈ |k3s-default:argo)➜  argo git:(int64str) docker run argoproj/workflow-controller:throttler version
workflow-controller: v2.3.0-rc3
  BuildDate: 2020-09-28T22:59:32Z
  GitCommit: 83ced5299381d7d7bab24c41879c5f110fd62b57
  GitTreeState: clean
  GitTag: v2.3.0-rc3
  GoVersion: go1.13.15
  Compiler: gc
  Platform: linux/amd64

We updated the wokrlfow-controller to use argoproj/workflow-controller:throttler. The controller restarted picking up parallelism: 3 setting. We kicked off 4 workflows: 3 went to "running" right away, 1 to "pending". All good.

Next, we edited the config and set "parallelism: 2", then kicked off 4 workflows again. Expected behavior: 2 running, 2 pending. Observed: 3 running, 1 pending -- so it seems that the update to config was not picked up. We deleted the controller pod causing it to restart and picking up the config setting. Kicked off 4 again -- 2 running, 2 pending this time. Seems that w/ this change you still need to restart the controller for changes to be picked up.

I'd like to test the issue where the limit is no longer respected after some time -- is there a faster test than what is described above by other reporters?

Next step for us is to try the 2.11.2 release since it has #4207.

Seems that w/ this change you still need to restart the controller for changes to be picked up.

Correct. This config item is not used by many people, so I want to keep it simple to reduce bugs, and not having the ability to re-configure at runtime is part of that.

@lynxoid I'm waiting for you to say the fix works for you before it getting it reviewed and merged. Did you think it works for you?

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

Was this page helpful?
0 / 5 - 0 ratings