Argo: Difficulty scaling when running many workflows and/or steps

Created on 2 Dec 2020 · 36Comments · Source: argoproj/argo

Problem statement

Use case: processing media assets as they are uploaded by customers
- We are trying to test the case of continuously created workflows, with the hope of relatively predictable responsiveness.
- In an ideal world, we would be able to run ~2500 workflows in parallel at peak times
- Expected workflow run time varies quite a bit, but most of them will take less than five minutes
In the synthetic benchmarks detailed below, our nodegroup scale limits are never reached. We don’t seem to be able to run workflows quickly enough to saturate our capacity.

Environment details

Running on EKS
- Kubernetes version: 1.17
- Controller running on instance type: m5.8xlarge (32 vCPU, 128GB RAM)
- Workflow executors instance type: m5.xlarge (4 vCPU, 16GB RAM)
Argo details:
- Modified for additional metrics based on master at commit 5c538d7a918e41029d3911a92c6ac615f04d3b80
- Running with parallelism: 800, otherwise we observed the EKS control place becoming unresponsive
- Running with containerRuntimeExecutor: kubelet on AWS Bottlerocket instances

Case 1 (many workflows, few steps)

We created a workflow template (see fig.1) with a modifiable number of steps. We launched workflows at a rate of 3 per second with a script (see fig.2)
Initially the controller seems to keep up, with the running workflow count growing, and no noticeable delays. After a few minutes we observe a growing number of pending workflows, as well as an oscillation pattern in several metrics.
Instead of scaling up to meet demand, we see kubernetes node capacity go unused as the number of running workflows oscillates
We added custom metrics in the controller to monitor the workflow and pod queues (wfc.wfQueue and wfc.podQueue in controller/controller.go). The workflow queue oscillates between 1000 and 1500 items during our test. However, the pod queue consistently stays at 0.

Case 2 (few workflows, many steps)

We created a similar template (fig. 6) and script (fig. 7) that generates a number of “work” steps that run in parallel, and concludes with a final “followup” step.
When launching one workflow at a time, or at an interval larger than 20s, everything ran smoothly (pods completed and workflows were successful)
When the flow of workflows was increased, we began to see the “zombie” phenomena (fig. 4) - even though the pod was marked as Completed, the workflow lingers in the Running state (fig. 5).

Things we tried

In trying to address these issues, we changed the values of the following parameters without much success:

pod-workers
workflow-workers (the default of 32 was a bottleneck, but anything over 128 didn’t make a difference)
INFORMER_WRITE_BACK=false
--qps, --burst
We increased the memory and CPU resources available to the controller
workflowResyncPeriodand podResyncPeriod
We’ve also tried various experimental branches from recent seemingly related issues (and we are happy to try more!)

fig. 1

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep-test-template
  generateName: sleep-test-
  namespace: argo-workflows
spec:
  entrypoint: sleep
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 600
  podGC:
    strategy: OnPodCompletion
  arguments:
    parameters:
      - name: friendly-name
        value: sleep_test # Use underscores, not hyphens
      - name: cpu-limit
        value: 2000m
      - name: mem-limit
        value: 1024Mi
      - name: step-count
        value: "200"
      - name: sleep-seconds
        value: "8"
  metrics:
    prometheus:
      - name: "workflow_duration"      # Metric name (will be prepended with "argo_workflows_")
        help: "Duration gauge by name" # A help doc describing your metric. This is required.
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
        gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
          value: "{{workflow.duration}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
      - name: "workflow_processed"
        help: "Workflow processed count"
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
           - key: status
             value: "{{workflow.status}}"
        counter:
          value: "1"
  templates:
  - name: sleep
    nodeSelector:
      intent: task-workers
    steps:
      - - name: generate
          template: gen-number-list
      - - name: "sleep"
          template: snooze
          arguments:
            parameters: [{name: input_asset, value: "{{workflow.parameters.sleep-seconds}}", id: "{{item}}"}]
          withParam: "{{steps.generate.outputs.result}}"

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    nodeSelector:
      intent: task-workers
    script:
      image: python:3.8.5-alpine3.12
      imagePullPolicy: IfNotPresent
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(0, {{workflow.parameters.step-count}})], sys.stdout)
  - name: snooze
    metrics:
      prometheus:
        - name: "resource_duration_cpu"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration CPU" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.cpu}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
        - name: "resource_duration_memory"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration Memory" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.memory}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
    nodeSelector:
      intent: task-workers
    inputs:
      parameters:
        - name: input_asset
    podSpecPatch: '{"containers":[{"name":"main", "resources":{"requests":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}"}, "limits":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}" }}}]}'
    container:
      image: alpine
      imagePullPolicy: IfNotPresent
      command: [sleep]
      args: ["{{workflow.parameters.sleep-seconds}}"]

fig. 2

#!/usr/bin/env bash
set -euo pipefail
while true; do
  for i in {1..3}; do
    argo submit \
        -n argo-workflows \
        --from workflowtemplate/sleep-test-template \
        -p step-count="1" \
        -p sleep-seconds="60" &>/dev/null &
  done
  sleep 1
  echo -n "."
done

fig. 3

Screen Shot 2020-12-01 at 4 09 01 PM

fig.4

❯ argo -n argo-workflows get sleep-fanout-test-template-6dtjp
Name:                sleep-fanout-test-template-6dtjp
Namespace:           argo-workflows
ServiceAccount:      default
Status:              Running
Created:             Wed Dec 02 15:39:59 -0500 (6 minutes ago)
Started:             Wed Dec 02 15:39:59 -0500 (6 minutes ago)
Duration:            6 minutes 21 seconds
ResourcesDuration:   42m21s*(1 cpu),2h30m41s*(100Mi memory)
Parameters:
  step-count:        100
  sleep-seconds:     8

STEP                                 TEMPLATE         PODNAME                                      DURATION  MESSAGE
 ● sleep-fanout-test-template-6dtjp  sleep
 ├---✔ generate                      gen-number-list  sleep-fanout-test-template-6dtjp-2151903814  7s
 ├-·-✔ sleep(0:0)                    snooze           sleep-fanout-test-template-6dtjp-1189074090  14s
 | ├-✔ sleep(1:1)                    snooze           sleep-fanout-test-template-6dtjp-1828931302  25s
...
 | └-✔ sleep(99:99)                  snooze           sleep-fanout-test-template-6dtjp-1049774502  16s
 └---◷ followup                      snooze           sleep-fanout-test-template-6dtjp-1490893639  5m

fig. 5

❯ kubectl -n argo-workflows get pod/sleep-fanout-test-template-6dtjp-1490893639
NAME                                          READY   STATUS      RESTARTS   AGE
sleep-fanout-test-template-6dtjp-1490893639   0/2     Completed   0          5m43s

fig. 6

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep-fanout-test-template
  generateName: sleep-fanout-test-
  namespace: argo-workflows
spec:
  entrypoint: sleep
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 600
  podGC:
    strategy: OnPodCompletion
  arguments:
    parameters:
      - name: friendly-name
        value: sleep_fanout_test # Use underscores, not hyphens
      - name: cpu-limit
        value: 2000m
      - name: mem-limit
        value: 1024Mi
      - name: step-count
        value: "200"
      - name: sleep-seconds
        value: "8"
  metrics:
    prometheus:
      - name: "workflow_duration"      # Metric name (will be prepended with "argo_workflows_")
        help: "Duration gauge by name" # A help doc describing your metric. This is required.
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
        gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
          value: "{{workflow.duration}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
      - name: "workflow_processed"
        help: "Workflow processed count"
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
           - key: status
             value: "{{workflow.status}}"
        counter:
          value: "1"
  templates:
  - name: sleep
    nodeSelector:
      intent: task-workers
    steps:
      - - name: generate
          template: gen-number-list
      - - name: "sleep"
          template: snooze
          withParam: "{{steps.generate.outputs.result}}"
      - - name: "followup"
          template: snooze

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    nodeSelector:
      intent: task-workers
    script:
      image: python:3.8.5-alpine3.12
      imagePullPolicy: IfNotPresent
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(0, {{workflow.parameters.step-count}})], sys.stdout)
  - name: snooze
    metrics:
      prometheus:
        - name: "resource_duration_cpu"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration CPU" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.cpu}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
        - name: "resource_duration_memory"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration Memory" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.memory}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
    nodeSelector:
      intent: task-workers
    podSpecPatch: '{"containers":[{"name":"main", "resources":{"requests":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}"}, "limits":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}" }}}]}'
    container:
      image: alpine
      imagePullPolicy: IfNotPresent
      command: [sleep]
      args: ["{{workflow.parameters.sleep-seconds}}"]

fig. 7

#!/usr/bin/env bash
set -euo pipefail
while true; do
  argo submit \
    -n argo-workflows \
    --from workflowtemplate/sleep-fanout-test-template \
    -p step-count="100" \
    -p sleep-seconds="8" &>/dev/null
  echo -n "."
  sleep 10
done

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

bug episcaling

Source

tomgoren

Most helpful comment

Your actions:

Try using no parallelism.
Can you check the memory settings? Should not max out at 1GiB.
Try :no-sig and :easyjson.
Separately, try using semaphores instead.

My actions

I'll check to see if we have a memory leak.

alexec on 8 Dec 2020

👍3

All 36 comments

I'm assuming you've read #4560 and you're running :latest?
Do you see any controller crashes or restarts?
How many nodes does each workflow have?
How many pods are running concurrently?

alexec on 2 Dec 2020

I'm assuming you've read #4560 and you're running :latest?

Indeed we have read the thread, and we followed all of the workaround suggested there. We have been keeping up to date with the latest builds from master, yes.

Do you see any controller crashes or restarts?

We haven't seen a controller crash on the most recent builds.

How many nodes does each workflow have?

In our first test, we submitted three workflows per second of two steps each.
In the second test, we submitted one workflow of 102 steps each every 10 seconds.

How many pods are running concurrently?

The highest number of pods we've seen run concurrently in our steady state tests has been around 200.

In those tests, the number of workflows in the Pending state accumulate, at which point we fall behind, and the controller never catches up with the incoming submitted workflows.

In the case of the few workflows with many steps each, the number of pods in the Completed increases, and we observe "zombie" workflows stuck in the Running state.

tomgoren on 3 Dec 2020

What TZ are you in? Can we get on a Zoom in the next few days?

alexec on 4 Dec 2020

🚀1 👍1

Your actions:

Try using no parallelism.
Can you check the memory settings? Should not max out at 1GiB.
Try :no-sig and :easyjson.
Separately, try using semaphores instead.

My actions

I'll check to see if we have a memory leak.

alexec on 8 Dec 2020

👍3

@alexec Thanks again for your time today. We're looking forward to getting to the bottom of this :)

Try using no parallelism.

We ran this test for about 30 mins using the script in Fig 2 (submits 3 wf/sec, two steps total, about 60-70 seconds expected total time per workflow). Workflows seemed to start and complete smoothly with a steady accumulation of "succeeded" workflows. No workflow GC ever appeared to run, which is unexpected and different from our prior tests. Workflow throughput dropped by 50% about 20 minutes into the run, which is what we've normally seen when the GC operation starts, but the number of succeeded workflows never decreased. Pending workflows hovered around 300 throughout the test. When we submitted a couple of workflows in the middle of the test using the CLI tool, it took around 3.5 minutes for the first step to run. We're not sure what to conclude here, but running without parallelism doesn't seem significantly different from our test this morning except for the lack of workflow GC.

Can you check the memory settings? Should not mmax out at 1GiB.

We think this was a red herring. In later tests, memory usage seemed to increase in proportion to the number of workflows that were being tracked, which seems reasonable.

Our current manifest reserves 1GiB of memory for the Argo controller but doesn't set a limit. The node has significant memory available, so we shouldn't be bumping into any physical or OS limits.

Separately, try using semaphores instead.

We tried using semaphores while parallelism was disabled and saw very strange behavior. About 12 workflows would run to completion at the start of the test, and then none would complete. The number of running workflows increased to 500 (the limit I had set in the configmap), and pending workflows increased until I ended the test. No pods were created, though. When we tried to submit a job in the middle of the test, we saw this:

Message:             Waiting for argo-workflows/ConfigMap/semaphores/workflow lock. Lock status: 0/500

Only when we canceled the script and cleaned up the workflows (delete --all) did the locks seem to get released.

This is our configmap and the relevant part of the workflow spec, which is adapted from the synchronization-wf-level example:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: argo-workflows
  name: semaphores
data:
  workflow: "500"

spec:
  ...
  synchronization:
    semaphore:
      configMapKeyRef:
        name: semaphores
        key: workflow

Try :no-sig and :easyjson.

We'll give these a try tomorrow and will report back.

I'll check to see if we have a memory leak.

BTW, we ran pprof against the controller a few weeks ago and didn't see any interesting CPU hot spots (aside from processNextItem and JSON processing like you'd mentioned, which is understandable) or memory leaks. We're happy to run more tests against the latest controller if that would be useful

acj on 8 Dec 2020

👍1

Forgot to mention that we've been seeing a lot of "Deadline exceeded" warnings in our controller logs during recent runs (not just today). I've added a metric to our controller build to count them, so we should have hard numbers to share next time if that would be a useful signal.

acj on 8 Dec 2020

@acj

I've done some analysis and can't see any memory leaks. That ties with your analysis.

I've created a dev build intended to reduce the amount to Kubernetes API requests the controller makes by limiting each workflow to one reconciliation every 15s (configurable). The details are in the linked PR. Would you like to try that out?

alexec on 8 Dec 2020

👍2

Okay, we have a few more test results. Short version: no significant improvement.

Using :ratel, we still saw oscillations in the key metrics after ~15 minutes and slowly fell behind. Queue depth, time in queue, and the running/pending workflow counts slowly increased. We had a fairly large backlog of completed pods (several hundred) near the end of the run, but there was no backlog of succeeded workflows. This run seemed mostly stable, but we still fell behind on workflow processing.
Using :easyjson (~20 days old now), we saw steep growth in the number of pending workflows. The running pod count hovered around 160, which isn't enough to keep pace with our incoming workflows. The "time in queue" metric increased much more slowly than in the first test, but it did slowly increase. After about 30 minutes, many of the metrics seemed to become unstable. A few hundred workflows were GC'd at roughly the same time, which is a pattern that we've seen fairly often. (Unsure of cause/effect here, but they happen together quite often.) After interrupting our WF submission script and letting the controller return to idle, there were still ~300 succeeded workflows and ~800 pending workflows according to the CLI. The prometheus metrics reported -3200 (not a typo) succeeded workflows. I suspect that many of these quirks have been fixed since this image was built, but they seemed worth sharing.
Using :no-sig (similarly old), we also saw steep growth in the queue depth and pending workflow count. No obvious change in behavior compared to :easyjson.
We also tried using a controller built from master (with our custom controller metrics added) with MAX_OPERATION_TIME=90s to see if it would resolve the "Deadline exceeded" warnings. It did seem to resolve the warnings, but we still slowly fell behind in workflow processing.

The close timing between succeeded workflows getting GC'd and performance becoming unstable seems interesting. Would it be worthwhile to try disabling GC or making the GC period very long to see if there's actually a connection? Anything else we should try?

acj on 9 Dec 2020

large backlog of completed pods (several hundred) near the end of the run, but there was no backlog of succeeded workflows.

Do you mean you had zombies?

alexec on 10 Dec 2020

Completed and GC pods are bounded to 512 before blocking:

        completedPods:              make(chan string, 512),
        gcPods:                     make(chan string, 512),

alexec on 10 Dec 2020

Do you mean you had zombies?

I don't think so, at least by the definition given in #4560. The workflows seemed to reliably complete and get GC'd, but we had a lot of completed pods hanging around (visible using kubectl) until the end of the test.

Completed and GC pods are bounded to 512 before blocking

We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full?

:ratel should reduce controller CPU notably

I didn't pay close attention to CPU usage yesterday, but I can confirm that it was about 6x lower compared to other recent runs. Nice!

acj on 10 Dec 2020

Good news about the CPU. I think these are some design issues in pod GC (see #4693).

We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full

When the channel gets full, it will block when new entries are added to it. This means that reconciliation will take longer.

I'm creating a new build with a configurable fix for #4693.

alexec on 10 Dec 2020

@acj I've created a new dev build. Can you run this firstly without any env var changes as a baseline? I expect you to see an improvement with default settings. Then can you try the new env vars as listed in PR please?

Thank you again.

alexec on 10 Dec 2020

👍1

Will do. We ran the latter test (today's :ratel + the new env vars related to GC and queuing) this afternoon and had the smoothest run so far. We still slowly fell behind (increasing time in queue, pending workflows, etc), but the workflow completion rate was very consistent throughout the hour-long run, which is great to see. We'll run it again with default settings tomorrow morning and then post our findings.

Do you think we should also include the zombie env vars and RECENTLY_STARTED_POD_DURATION? They didn't seem to apply to us, so we left them out.

   Environment:
      INLINE_POD_GC:             true
      COMPLETED_PODS_QUEUE:      2048
      GC_PODS_QUEUE:             2048
      LEADER_ELECTION_IDENTITY:  workflow-controller-66d46fb787-gvzcm (v1:metadata.name)

acj on 10 Dec 2020

Do you think we should also include the zombie env vars and RECENTLY_STARTED_POD_DURATION?

I think it is best to test one thing at a time :)

alexec on 11 Dec 2020

Knapkin math:

2500 workflows / 15 seconds per reconcilliation = 166 workflow workers needed
2500 workflows / 30 seconds per reconcilliation = 83 workflow workers needed

Can you try:

DEFAULT_REQUEUE_TIME=30s
DEFAULT_REQUEUE_TIME=60s

I think you should get more workflow throughput (i.e. the queue should not grow).

alexec on 11 Dec 2020

👍2

I think it is best to test one thing at a time :)

Completely agree. I was trying to get clarity on what you meant here:

Then can you try the new env vars as listed in PR please?

Whether we should apply all of those new env vars at once, or by group, or one at a time, etc.

acj on 11 Dec 2020

Whether we should apply all of those new env vars at once, or by group, or one at a time, e

I've updated the PR description to make this clearer.

alexec on 11 Dec 2020

👍2

We see a pretty big difference between :ratel with the defaults vs. :ratel with INLINE_POD_GC, COMPLETED_PODS_QUEUE, and GC_PODS_QUEUE set to the values I mentioned above. With the defaults, there's steep growth in queue depth (especially for wf_ttl_queue), succeeded workflow count, and time in queue (again for wf_ttl_queue more than workflow_queue). CPU usage is also noticeably higher, though still much lower than what we were seeing with :latest. With the customized env var values, we see much slower queue and pending workflow growth (though still some). In both cases, workflow throughput is unsteady until about the 10-minute mark when it levels off. The k8s API request rate (per the new metric) was steady after the first couple of minutes. Still not able to keep up with the workflow queue.

The customized env vars from the previous test seemed to give us the best results, so we carried those into the next tests. Changing DEFAULT_REQUEUE_TIME to 30s and then 60s didn't seem to make a difference. The queue growth rate was slightly slower with requeue time set to 60s.

On a whim, we reran that last test (requeue time at 60s) with a modified script that submits 2 WF/sec instead of 3. The queue depth, time in queue, and pending workflow count all stayed near zero. It took around 30 seconds to submit a workflow (sleeps for 1s and exits) using the CLI and see it fully complete, which is a little sluggish, but much better than we saw in our pre-:ratel tests.

We also tried making the script submit 5 WF/sec. Notably, the k8s API request rate seemed to plateau at the same point whether we submitted 3/sec or 5/sec, which makes me think we're being throttled either by the API server or by the rate limiters in the controller (e.g. workqueue or similar). Maybe that's our next bottleneck?

acj on 11 Dec 2020

Can I ask you to execute another test? INLINE_POD_GC=false instead of on. To back-up this solution, you should see a metric name gcPod and it should top out at 2048 (or 512). That would be strong evidence of this being a change we should pursue.

alexec on 11 Dec 2020

👍1

@acj I think we're getting closer to a solution, so as a result I've created a series of PRs with one fix (:ratel is stuffed with many fixes). This should make it easier to understand the impact of each change in isolation.

Assumptions:

You're not getting zombie workflows.
You're not seeing "pod deleted" errors.

Rejected hypothesis:

Too many Kubernetes API requests (shown by k8s_request_total).
JSON marshaling (as :easyjson did not help)`.
Significant pod changes missed (as :no-sig did not help).

Accepted hypothesis:

Rate-limiting the throughput of workflow reconciliations (DEFAULT_REQUEUE_TIME)
We're not GCing pods fast enough.

I've created dev builds specifically to address these items:

argoproj/workflow-controller:rate adds configurable rate-limiting (a short 2s by default)
argoproj/workflow-controller:gc-queue has 4 as opposed to 1 worker to GC pods.

Would you be able to get onto a Zoom early next week please?

alexec on 12 Dec 2020

👍1

Can I ask you to execute another test? INLINE_POD_GC=false instead of on.

Yep, we'll give this a try tomorrow

Rejected hypothesis:

Too many Kubernetes API requests (shown by k8s_request_total).

Given the plateauing behavior we were seeing last week (bottom of my last comment), is it safe to reject this hypothesis? I'm still wondering about a possible bottleneck there

Would you be able to get onto a Zoom early next week please?

Sure. We'll ping you in slack once we get our schedule sorted

acj on 14 Dec 2020

We ran a test with INLINE_POD_GC set to true, (fig. 1) and observed relatively steady processing of workflows for over an hour. It was the smoothest run thus far.

CPU usage was nominal.
Completed pods accumulated
Pending workflows accumulated slowly

When we ran with INLINE_POD_GC: false, (fig. 2) this was the outcome:

CPU usage climbed throughout the run
workflows in the Succeeded state accumulated and never cleared
the controller crashed after around 40 minutes.

Stack trace from the controller crash:

E1215 17:38:02.451933       1 leaderelection.go:307] Failed to release lock: Lease.coordination.k8s.io "workflow-controller-lease" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0
 12:38:02.514 test-cluster-2 workflow-controller time="2020-12-15T17:38:02.514Z" level=info msg="stopped leading" id=workflow-controller-688877bc7c-cjqrr
 12:38:02.527 test-cluster-2 workflow-controller panic: http: Server closed
 12:38:02.527 test-cluster-2 workflow-controller  app='workflow-controller'  logfile='/containerd/workflow-controller.log'  pod_uid='a50c89d2-97b2-41c8-ba8f-e3274fd7e682'  monitor='agentKubernetes'  message=''  serverHost='scalyr-agent-2-rfrrn'  pod_name='workflow-controller-688877bc7c-cjqrr'  pod_namespace='argo'  scalyr-category='log'  parser='docker'  stream='stderr'  pod-template-hash='688877bc7c'  k8s_node='ip-10-101-110-41.ec2.internal'  raw_timestamp='1608053882527506081'  container_id='04bbaed6801942e9a7a2aee2b8050d527882c876697fbb2d795184656a2d2027'
 12:38:02.527 test-cluster-2 workflow-controller goroutine 708 [running]:
 12:38:02.527 test-cluster-2 workflow-controller github.com/argoproj/argo/workflow/metrics.runServer.func1(0x1, 0x1aac957, 0x8, 0x2382, 0x0, 0x0, 0xc001072000)
 12:38:02.527 test-cluster-2 workflow-controller    /Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:53 +0x117
 12:38:02.527 test-cluster-2 workflow-controller created by github.com/argoproj/argo/workflow/metrics.runServer
 12:38:02.527 test-cluster-2 workflow-controller    /Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:50 +0x246

fig. 1

Screen Shot 2020-12-15 at 12 16 03 PM

fig. 2

Screen Shot 2020-12-15 at 12 39 39 PM

tomgoren on 15 Dec 2020

Test results from using the :gc-queue image:

CPU usage grew as the test went on, and maxed out (32 cores) after ~30 minutes
"Succeeded" workflow tally accumulated throughout the test, only clearing after we killed the script that submits new workflows
Pending workflows steadily grew at the same pace as the running workflows
workflow_ttl_queue grew throughout and its behavior mirrored the "Succeeded" workflow count

Screen Shot 2020-12-16 at 2 35 12 PM

tomgoren on 16 Dec 2020

4761

alexec on 16 Dec 2020

I'm pretty sure we have a new bug in TTL.

alexec on 16 Dec 2020

👍1

Testing with :latest (argoproj/workflow-controller@sha256:4bccc84a407d275b6b7f0e5072341cdfd293fd098b8d2d10465ecb85c6265e49) - fig. 1:

CPU climbed steadily
Succeeded count kept going up
workflow_ttl_queue kept going up
Running and pending relatively steady - both grew slowly in tandem

Testing with :rate - fig. 2:

Essentially identical to the first test

Testing with:latest (same image SHA as above), with --workflow-ttl-workers set to 16 instead of the default 4 - fig. 3:

Initially the tally of "Succeeded" workflows went up, but cleared after ~8 minutes and stayed down
CPU usage was nominal throughout the test
workflow_queue grew consistently
number of running workflows grew consistently, and the pending count followed suit
pods in the "Completed" state accumulated (at the end of the test run the count was 1100+)

fig. 1

Screen Shot 2020-12-16 at 3 20 08 PM

fig. 2

Screen Shot 2020-12-16 at 4 04 08 PM

fig. 3

Screen Shot 2020-12-16 at 4 43 26 PM

tomgoren on 16 Dec 2020

I've been doing some exploratory testing and it is clear TTL often just does not happen. This is a functional bug, not in fact a scaling issue. I'll fix this and get back to you as I'm currently trying to improve the performance of something fundamentally broken.

alexec on 16 Dec 2020

👍1

Accidentally close.

alexec on 17 Dec 2020

@tomgoren I've merged #4728 master. I'd like you to test my bug fix for TTL :fix-ttl (based on master).

alexec on 17 Dec 2020

👍1

I should note I've run submitting 600 workflows at once on my Macbook. That peaked at 60 concurrent a second. I'm going to launch this on a test cluster tomorrow.

alexec on 17 Dec 2020

With :fix-ttl (still with --workflow-ttl-workers at 16) we observed the following:

Initially (for the first ~2.5 minutes) the workflow_ttl_queue and Succeeded count stayed very low, but then again began to climb
Following that the results were much the same as before
We are still seeing an accumulation of pods lingering in the Completed state

Screen Shot 2020-12-17 at 12 31 40 PM

As per our conversation, we are going to run a similar test, this time skipping Argo to verify that the cluster can sustain the desired capacity.

tomgoren on 17 Dec 2020

Using the latest images and recommendations:

spec:
      containers:
      - args:
          - --configmap
          - workflow-controller-configmap
          - --executor-image
          - argoproj/argoexec:fix-ttl
          # --qps >= 2x pods created/per seconds
          - --qps=1024
          - --pod-cleanup-workers=32
          - --workflow-workers=256
          - --workflow-ttl-workers=16
        command:
          - workflow-controller
        # WISTIA: pinned to `:fix-ttl` as of 2020-12-21
        image: argoproj/workflow-controller@sha256:847b5117cb83f02d9bef9d17b459e733f9761d279ce64edc012ebe3c7a634f38
        name: workflow-controller
        resources:
          requests:
            cpu: 7000m
            memory: 1G
        env:
          - name: LEADER_ELECTION_IDENTITY
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name

We observed the following:

Screen Shot 2020-12-22 at 2 27 01 PM

We are seeing far more consistently running workflows, but now we are also seeing a lot of failed workflows as well. Unfortunately the failed pods were reaped immediately so we have only the following information:

❯ argo -n argo-workflows get sleep-test-template-tjh27
Name:                sleep-test-template-tjh27
Namespace:           argo-workflows
ServiceAccount:      default
Status:              Failed
Message:             child 'sleep-test-template-tjh27-4134794983' failed
Conditions:
 Completed           True
Created:             Tue Dec 22 14:19:04 -0500 (3 minutes ago)
Started:             Tue Dec 22 14:19:04 -0500 (3 minutes ago)
Finished:            Tue Dec 22 14:20:19 -0500 (2 minutes ago)
Duration:            1 minute 15 seconds
ResourcesDuration:   3m4s*(1 cpu),11m18s*(100Mi memory)
Parameters:
  step-count:        1
  sleep-seconds:     60

STEP                          TEMPLATE         PODNAME                               DURATION  MESSAGE
 ✖ sleep-test-template-tjh27  sleep                                                            child 'sleep-test-template-tjh27-4134794983' failed
 ├---✔ generate               gen-number-list  sleep-test-template-tjh27-2499020563  5s
 └---✖ sleep(0:0)             snooze           sleep-test-template-tjh27-4134794983  1m        failed with exit code 1

Thank you!

tomgoren on 22 Dec 2020

Good news! I have a branch name test Can I ask you to try :test for both the executor and the controller? We'll need to understand why the pods failed. Is it pod deleted? Can you try

      DEFAULT_REQUEUE_TIME:      20s // defaults to 2s, but causes `pod deleted` under high load
      MAX_OPERATION_TIME:        1m // defaults to 30s

alexec on 22 Dec 2020

Running on :test:

Screen Shot 2021-01-05 at 1 36 02 PM

More workflows running concurrently than in previous tests
No errors or prematurely deleted pods
Workflows in the "succeeded" state linger as do Complete/Terminated pods
It seems that at a certain threshold running workflows decrease and then plateau, succeeded workflow tally continues to grow
CPU and memory usage are low throughout

Adding the environment variables as recommended:

Screen Shot 2021-01-05 at 2 28 15 PM