master at commit 5c538d7a918e41029d3911a92c6ac615f04d3b80parallelism: 800, otherwise we observed the EKS control place becoming unresponsivecontainerRuntimeExecutor: kubelet on AWS Bottlerocket instanceswfc.wfQueue and wfc.podQueue in controller/controller.go). The workflow queue oscillates between 1000 and 1500 items during our test. However, the pod queue consistently stays at 0.Completed, the workflow lingers in the Running state (fig. 5).In trying to address these issues, we changed the values of the following parameters without much success:
pod-workersworkflow-workers (the default of 32 was a bottleneck, but anything over 128 didn’t make a difference)INFORMER_WRITE_BACK=false--qps, --burstworkflowResyncPeriodand podResyncPeriodapiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: sleep-test-template
generateName: sleep-test-
namespace: argo-workflows
spec:
entrypoint: sleep
ttlStrategy:
secondsAfterSuccess: 0
secondsAfterFailure: 600
podGC:
strategy: OnPodCompletion
arguments:
parameters:
- name: friendly-name
value: sleep_test # Use underscores, not hyphens
- name: cpu-limit
value: 2000m
- name: mem-limit
value: 1024Mi
- name: step-count
value: "200"
- name: sleep-seconds
value: "8"
metrics:
prometheus:
- name: "workflow_duration" # Metric name (will be prepended with "argo_workflows_")
help: "Duration gauge by name" # A help doc describing your metric. This is required.
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
gauge: # The metric type. Available are "gauge", "histogram", and "counter".
value: "{{workflow.duration}}" # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
- name: "workflow_processed"
help: "Workflow processed count"
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
- key: status
value: "{{workflow.status}}"
counter:
value: "1"
templates:
- name: sleep
nodeSelector:
intent: task-workers
steps:
- - name: generate
template: gen-number-list
- - name: "sleep"
template: snooze
arguments:
parameters: [{name: input_asset, value: "{{workflow.parameters.sleep-seconds}}", id: "{{item}}"}]
withParam: "{{steps.generate.outputs.result}}"
# Generate a list of numbers in JSON format
- name: gen-number-list
nodeSelector:
intent: task-workers
script:
image: python:3.8.5-alpine3.12
imagePullPolicy: IfNotPresent
command: [python]
source: |
import json
import sys
json.dump([i for i in range(0, {{workflow.parameters.step-count}})], sys.stdout)
- name: snooze
metrics:
prometheus:
- name: "resource_duration_cpu" # Metric name (will be prepended with "argo_workflows_")
help: "Resource Duration CPU" # A help doc describing your metric. This is required.
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
gauge: # The metric type. Available are "gauge", "histogram", and "counter".
value: "{{resourcesDuration.cpu}}" # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
- name: "resource_duration_memory" # Metric name (will be prepended with "argo_workflows_")
help: "Resource Duration Memory" # A help doc describing your metric. This is required.
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
gauge: # The metric type. Available are "gauge", "histogram", and "counter".
value: "{{resourcesDuration.memory}}" # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
nodeSelector:
intent: task-workers
inputs:
parameters:
- name: input_asset
podSpecPatch: '{"containers":[{"name":"main", "resources":{"requests":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}"}, "limits":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}" }}}]}'
container:
image: alpine
imagePullPolicy: IfNotPresent
command: [sleep]
args: ["{{workflow.parameters.sleep-seconds}}"]
#!/usr/bin/env bash
set -euo pipefail
while true; do
for i in {1..3}; do
argo submit \
-n argo-workflows \
--from workflowtemplate/sleep-test-template \
-p step-count="1" \
-p sleep-seconds="60" &>/dev/null &
done
sleep 1
echo -n "."
done

❯ argo -n argo-workflows get sleep-fanout-test-template-6dtjp
Name: sleep-fanout-test-template-6dtjp
Namespace: argo-workflows
ServiceAccount: default
Status: Running
Created: Wed Dec 02 15:39:59 -0500 (6 minutes ago)
Started: Wed Dec 02 15:39:59 -0500 (6 minutes ago)
Duration: 6 minutes 21 seconds
ResourcesDuration: 42m21s*(1 cpu),2h30m41s*(100Mi memory)
Parameters:
step-count: 100
sleep-seconds: 8
STEP TEMPLATE PODNAME DURATION MESSAGE
● sleep-fanout-test-template-6dtjp sleep
├---✔ generate gen-number-list sleep-fanout-test-template-6dtjp-2151903814 7s
├-·-✔ sleep(0:0) snooze sleep-fanout-test-template-6dtjp-1189074090 14s
| ├-✔ sleep(1:1) snooze sleep-fanout-test-template-6dtjp-1828931302 25s
...
| └-✔ sleep(99:99) snooze sleep-fanout-test-template-6dtjp-1049774502 16s
└---◷ followup snooze sleep-fanout-test-template-6dtjp-1490893639 5m
❯ kubectl -n argo-workflows get pod/sleep-fanout-test-template-6dtjp-1490893639
NAME READY STATUS RESTARTS AGE
sleep-fanout-test-template-6dtjp-1490893639 0/2 Completed 0 5m43s
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: sleep-fanout-test-template
generateName: sleep-fanout-test-
namespace: argo-workflows
spec:
entrypoint: sleep
ttlStrategy:
secondsAfterSuccess: 0
secondsAfterFailure: 600
podGC:
strategy: OnPodCompletion
arguments:
parameters:
- name: friendly-name
value: sleep_fanout_test # Use underscores, not hyphens
- name: cpu-limit
value: 2000m
- name: mem-limit
value: 1024Mi
- name: step-count
value: "200"
- name: sleep-seconds
value: "8"
metrics:
prometheus:
- name: "workflow_duration" # Metric name (will be prepended with "argo_workflows_")
help: "Duration gauge by name" # A help doc describing your metric. This is required.
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
gauge: # The metric type. Available are "gauge", "histogram", and "counter".
value: "{{workflow.duration}}" # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
- name: "workflow_processed"
help: "Workflow processed count"
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
- key: status
value: "{{workflow.status}}"
counter:
value: "1"
templates:
- name: sleep
nodeSelector:
intent: task-workers
steps:
- - name: generate
template: gen-number-list
- - name: "sleep"
template: snooze
withParam: "{{steps.generate.outputs.result}}"
- - name: "followup"
template: snooze
# Generate a list of numbers in JSON format
- name: gen-number-list
nodeSelector:
intent: task-workers
script:
image: python:3.8.5-alpine3.12
imagePullPolicy: IfNotPresent
command: [python]
source: |
import json
import sys
json.dump([i for i in range(0, {{workflow.parameters.step-count}})], sys.stdout)
- name: snooze
metrics:
prometheus:
- name: "resource_duration_cpu" # Metric name (will be prepended with "argo_workflows_")
help: "Resource Duration CPU" # A help doc describing your metric. This is required.
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
gauge: # The metric type. Available are "gauge", "histogram", and "counter".
value: "{{resourcesDuration.cpu}}" # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
- name: "resource_duration_memory" # Metric name (will be prepended with "argo_workflows_")
help: "Resource Duration Memory" # A help doc describing your metric. This is required.
labels:
- key: workflow_template
value: "{{workflow.parameters.friendly-name}}"
gauge: # The metric type. Available are "gauge", "histogram", and "counter".
value: "{{resourcesDuration.memory}}" # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
nodeSelector:
intent: task-workers
podSpecPatch: '{"containers":[{"name":"main", "resources":{"requests":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}"}, "limits":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}" }}}]}'
container:
image: alpine
imagePullPolicy: IfNotPresent
command: [sleep]
args: ["{{workflow.parameters.sleep-seconds}}"]
#!/usr/bin/env bash
set -euo pipefail
while true; do
argo submit \
-n argo-workflows \
--from workflowtemplate/sleep-fanout-test-template \
-p step-count="100" \
-p sleep-seconds="8" &>/dev/null
echo -n "."
sleep 10
done
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
:latest?
- I'm assuming you've read #4560 and you're running
:latest?
Indeed we have read the thread, and we followed all of the workaround suggested there. We have been keeping up to date with the latest builds from master, yes.
- Do you see any controller crashes or restarts?
We haven't seen a controller crash on the most recent builds.
- How many nodes does each workflow have?
In our first test, we submitted three workflows per second of two steps each.
In the second test, we submitted one workflow of 102 steps each every 10 seconds.
- How many pods are running concurrently?
The highest number of pods we've seen run concurrently in our steady state tests has been around 200.
In those tests, the number of workflows in the Pending state accumulate, at which point we fall behind, and the controller never catches up with the incoming submitted workflows.
In the case of the few workflows with many steps each, the number of pods in the Completed increases, and we observe "zombie" workflows stuck in the Running state.
What TZ are you in? Can we get on a Zoom in the next few days?
Your actions:
:no-sig and :easyjson.My actions
@alexec Thanks again for your time today. We're looking forward to getting to the bottom of this :)
Try using no parallelism.
We ran this test for about 30 mins using the script in Fig 2 (submits 3 wf/sec, two steps total, about 60-70 seconds expected total time per workflow). Workflows seemed to start and complete smoothly with a steady accumulation of "succeeded" workflows. No workflow GC ever appeared to run, which is unexpected and different from our prior tests. Workflow throughput dropped by 50% about 20 minutes into the run, which is what we've normally seen when the GC operation starts, but the number of succeeded workflows never decreased. Pending workflows hovered around 300 throughout the test. When we submitted a couple of workflows in the middle of the test using the CLI tool, it took around 3.5 minutes for the first step to run. We're not sure what to conclude here, but running without parallelism doesn't seem significantly different from our test this morning except for the lack of workflow GC.
Can you check the memory settings? Should not mmax out at 1GiB.
We think this was a red herring. In later tests, memory usage seemed to increase in proportion to the number of workflows that were being tracked, which seems reasonable.
Our current manifest reserves 1GiB of memory for the Argo controller but doesn't set a limit. The node has significant memory available, so we shouldn't be bumping into any physical or OS limits.
Separately, try using semaphores instead.
We tried using semaphores while parallelism was disabled and saw very strange behavior. About 12 workflows would run to completion at the start of the test, and then none would complete. The number of running workflows increased to 500 (the limit I had set in the configmap), and pending workflows increased until I ended the test. No pods were created, though. When we tried to submit a job in the middle of the test, we saw this:
Message: Waiting for argo-workflows/ConfigMap/semaphores/workflow lock. Lock status: 0/500
Only when we canceled the script and cleaned up the workflows (delete --all) did the locks seem to get released.
This is our configmap and the relevant part of the workflow spec, which is adapted from the synchronization-wf-level example:
apiVersion: v1
kind: ConfigMap
metadata:
namespace: argo-workflows
name: semaphores
data:
workflow: "500"
spec:
...
synchronization:
semaphore:
configMapKeyRef:
name: semaphores
key: workflow
Try :no-sig and :easyjson.
We'll give these a try tomorrow and will report back.
I'll check to see if we have a memory leak.
BTW, we ran pprof against the controller a few weeks ago and didn't see any interesting CPU hot spots (aside from processNextItem and JSON processing like you'd mentioned, which is understandable) or memory leaks. We're happy to run more tests against the latest controller if that would be useful
Forgot to mention that we've been seeing a lot of "Deadline exceeded" warnings in our controller logs during recent runs (not just today). I've added a metric to our controller build to count them, so we should have hard numbers to share next time if that would be a useful signal.
@acj
I've done some analysis and can't see any memory leaks. That ties with your analysis.
I've created a dev build intended to reduce the amount to Kubernetes API requests the controller makes by limiting each workflow to one reconciliation every 15s (configurable). The details are in the linked PR. Would you like to try that out?
Okay, we have a few more test results. Short version: no significant improvement.
Using :ratel, we still saw oscillations in the key metrics after ~15 minutes and slowly fell behind. Queue depth, time in queue, and the running/pending workflow counts slowly increased. We had a fairly large backlog of completed pods (several hundred) near the end of the run, but there was no backlog of succeeded workflows. This run seemed mostly stable, but we still fell behind on workflow processing.
Using :easyjson (~20 days old now), we saw steep growth in the number of pending workflows. The running pod count hovered around 160, which isn't enough to keep pace with our incoming workflows. The "time in queue" metric increased much more slowly than in the first test, but it did slowly increase. After about 30 minutes, many of the metrics seemed to become unstable. A few hundred workflows were GC'd at roughly the same time, which is a pattern that we've seen fairly often. (Unsure of cause/effect here, but they happen together quite often.) After interrupting our WF submission script and letting the controller return to idle, there were still ~300 succeeded workflows and ~800 pending workflows according to the CLI. The prometheus metrics reported -3200 (not a typo) succeeded workflows. I suspect that many of these quirks have been fixed since this image was built, but they seemed worth sharing.
Using :no-sig (similarly old), we also saw steep growth in the queue depth and pending workflow count. No obvious change in behavior compared to :easyjson.
We also tried using a controller built from master (with our custom controller metrics added) with MAX_OPERATION_TIME=90s to see if it would resolve the "Deadline exceeded" warnings. It did seem to resolve the warnings, but we still slowly fell behind in workflow processing.
The close timing between succeeded workflows getting GC'd and performance becoming unstable seems interesting. Would it be worthwhile to try disabling GC or making the GC period very long to see if there's actually a connection? Anything else we should try?
large backlog of completed pods (several hundred) near the end of the run, but there was no backlog of succeeded workflows.
Do you mean you had zombies?
Completed and GC pods are bounded to 512 before blocking:
completedPods: make(chan string, 512),
gcPods: make(chan string, 512),
Do you mean you had zombies?
I don't think so, at least by the definition given in #4560. The workflows seemed to reliably complete and get GC'd, but we had a lot of completed pods hanging around (visible using kubectl) until the end of the test.
Completed and GC pods are bounded to 512 before blocking
We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full?
:ratel should reduce controller CPU notably
I didn't pay close attention to CPU usage yesterday, but I can confirm that it was about 6x lower compared to other recent runs. Nice!
Good news about the CPU. I think these are some design issues in pod GC (see #4693).
We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full
When the channel gets full, it will block when new entries are added to it. This means that reconciliation will take longer.
I'm creating a new build with a configurable fix for #4693.
@acj I've created a new dev build. Can you run this firstly without any env var changes as a baseline? I expect you to see an improvement with default settings. Then can you try the new env vars as listed in PR please?
Thank you again.
Will do. We ran the latter test (today's :ratel + the new env vars related to GC and queuing) this afternoon and had the smoothest run so far. We still slowly fell behind (increasing time in queue, pending workflows, etc), but the workflow completion rate was very consistent throughout the hour-long run, which is great to see. We'll run it again with default settings tomorrow morning and then post our findings.
Do you think we should also include the zombie env vars and RECENTLY_STARTED_POD_DURATION? They didn't seem to apply to us, so we left them out.
Environment:
INLINE_POD_GC: true
COMPLETED_PODS_QUEUE: 2048
GC_PODS_QUEUE: 2048
LEADER_ELECTION_IDENTITY: workflow-controller-66d46fb787-gvzcm (v1:metadata.name)
Do you think we should also include the zombie env vars and RECENTLY_STARTED_POD_DURATION?
I think it is best to test one thing at a time :)
Knapkin math:
Can you try:
DEFAULT_REQUEUE_TIME=30sDEFAULT_REQUEUE_TIME=60sI think you should get more workflow throughput (i.e. the queue should not grow).
I think it is best to test one thing at a time :)
Completely agree. I was trying to get clarity on what you meant here:
Then can you try the new env vars as listed in PR please?
Whether we should apply all of those new env vars at once, or by group, or one at a time, etc.
Whether we should apply all of those new env vars at once, or by group, or one at a time, e
I've updated the PR description to make this clearer.
We see a pretty big difference between :ratel with the defaults vs. :ratel with INLINE_POD_GC, COMPLETED_PODS_QUEUE, and GC_PODS_QUEUE set to the values I mentioned above. With the defaults, there's steep growth in queue depth (especially for wf_ttl_queue), succeeded workflow count, and time in queue (again for wf_ttl_queue more than workflow_queue). CPU usage is also noticeably higher, though still much lower than what we were seeing with :latest. With the customized env var values, we see much slower queue and pending workflow growth (though still some). In both cases, workflow throughput is unsteady until about the 10-minute mark when it levels off. The k8s API request rate (per the new metric) was steady after the first couple of minutes. Still not able to keep up with the workflow queue.
The customized env vars from the previous test seemed to give us the best results, so we carried those into the next tests. Changing DEFAULT_REQUEUE_TIME to 30s and then 60s didn't seem to make a difference. The queue growth rate was slightly slower with requeue time set to 60s.
On a whim, we reran that last test (requeue time at 60s) with a modified script that submits 2 WF/sec instead of 3. The queue depth, time in queue, and pending workflow count all stayed near zero. It took around 30 seconds to submit a workflow (sleeps for 1s and exits) using the CLI and see it fully complete, which is a little sluggish, but much better than we saw in our pre-:ratel tests.
We also tried making the script submit 5 WF/sec. Notably, the k8s API request rate seemed to plateau at the same point whether we submitted 3/sec or 5/sec, which makes me think we're being throttled either by the API server or by the rate limiters in the controller (e.g. workqueue or similar). Maybe that's our next bottleneck?
Can I ask you to execute another test? INLINE_POD_GC=false instead of on. To back-up this solution, you should see a metric name gcPod and it should top out at 2048 (or 512). That would be strong evidence of this being a change we should pursue.
@acj I think we're getting closer to a solution, so as a result I've created a series of PRs with one fix (:ratel is stuffed with many fixes). This should make it easier to understand the impact of each change in isolation.
Assumptions:
Rejected hypothesis:
k8s_request_total).:easyjson did not help)`.:no-sig did not help).Accepted hypothesis:
DEFAULT_REQUEUE_TIME)I've created dev builds specifically to address these items:
argoproj/workflow-controller:rate adds configurable rate-limiting (a short 2s by default)argoproj/workflow-controller:gc-queue has 4 as opposed to 1 worker to GC pods. Would you be able to get onto a Zoom early next week please?
Can I ask you to execute another test? INLINE_POD_GC=false instead of on.
Yep, we'll give this a try tomorrow
Rejected hypothesis:
- Too many Kubernetes API requests (shown by k8s_request_total).
Given the plateauing behavior we were seeing last week (bottom of my last comment), is it safe to reject this hypothesis? I'm still wondering about a possible bottleneck there
Would you be able to get onto a Zoom early next week please?
Sure. We'll ping you in slack once we get our schedule sorted
We ran a test with INLINE_POD_GC set to true, (fig. 1) and observed relatively steady processing of workflows for over an hour. It was the smoothest run thus far.
When we ran with INLINE_POD_GC: false, (fig. 2) this was the outcome:
Succeeded state accumulated and never clearedStack trace from the controller crash:
E1215 17:38:02.451933 1 leaderelection.go:307] Failed to release lock: Lease.coordination.k8s.io "workflow-controller-lease" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0
12:38:02.514 test-cluster-2 workflow-controller time="2020-12-15T17:38:02.514Z" level=info msg="stopped leading" id=workflow-controller-688877bc7c-cjqrr
12:38:02.527 test-cluster-2 workflow-controller panic: http: Server closed
12:38:02.527 test-cluster-2 workflow-controller app='workflow-controller' logfile='/containerd/workflow-controller.log' pod_uid='a50c89d2-97b2-41c8-ba8f-e3274fd7e682' monitor='agentKubernetes' message='' serverHost='scalyr-agent-2-rfrrn' pod_name='workflow-controller-688877bc7c-cjqrr' pod_namespace='argo' scalyr-category='log' parser='docker' stream='stderr' pod-template-hash='688877bc7c' k8s_node='ip-10-101-110-41.ec2.internal' raw_timestamp='1608053882527506081' container_id='04bbaed6801942e9a7a2aee2b8050d527882c876697fbb2d795184656a2d2027'
12:38:02.527 test-cluster-2 workflow-controller goroutine 708 [running]:
12:38:02.527 test-cluster-2 workflow-controller github.com/argoproj/argo/workflow/metrics.runServer.func1(0x1, 0x1aac957, 0x8, 0x2382, 0x0, 0x0, 0xc001072000)
12:38:02.527 test-cluster-2 workflow-controller /Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:53 +0x117
12:38:02.527 test-cluster-2 workflow-controller created by github.com/argoproj/argo/workflow/metrics.runServer
12:38:02.527 test-cluster-2 workflow-controller /Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:50 +0x246


Test results from using the :gc-queue image:
workflow_ttl_queue grew throughout and its behavior mirrored the "Succeeded" workflow count
I'm pretty sure we have a new bug in TTL.
Testing with :latest (argoproj/workflow-controller@sha256:4bccc84a407d275b6b7f0e5072341cdfd293fd098b8d2d10465ecb85c6265e49) - fig. 1:
Testing with :rate - fig. 2:
Testing with:latest (same image SHA as above), with --workflow-ttl-workers set to 16 instead of the default 4 - fig. 3:



I've been doing some exploratory testing and it is clear TTL often just does not happen. This is a functional bug, not in fact a scaling issue. I'll fix this and get back to you as I'm currently trying to improve the performance of something fundamentally broken.
Accidentally close.
@tomgoren I've merged #4728 master. I'd like you to test my bug fix for TTL :fix-ttl (based on master).
I should note I've run submitting 600 workflows at once on my Macbook. That peaked at 60 concurrent a second. I'm going to launch this on a test cluster tomorrow.
With :fix-ttl (still with --workflow-ttl-workers at 16) we observed the following:
workflow_ttl_queue and Succeeded count stayed very low, but then again began to climb
As per our conversation, we are going to run a similar test, this time skipping Argo to verify that the cluster can sustain the desired capacity.
Using the latest images and recommendations:
spec:
containers:
- args:
- --configmap
- workflow-controller-configmap
- --executor-image
- argoproj/argoexec:fix-ttl
# --qps >= 2x pods created/per seconds
- --qps=1024
- --pod-cleanup-workers=32
- --workflow-workers=256
- --workflow-ttl-workers=16
command:
- workflow-controller
# WISTIA: pinned to `:fix-ttl` as of 2020-12-21
image: argoproj/workflow-controller@sha256:847b5117cb83f02d9bef9d17b459e733f9761d279ce64edc012ebe3c7a634f38
name: workflow-controller
resources:
requests:
cpu: 7000m
memory: 1G
env:
- name: LEADER_ELECTION_IDENTITY
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
We observed the following:

We are seeing far more consistently running workflows, but now we are also seeing a lot of failed workflows as well. Unfortunately the failed pods were reaped immediately so we have only the following information:
❯ argo -n argo-workflows get sleep-test-template-tjh27
Name: sleep-test-template-tjh27
Namespace: argo-workflows
ServiceAccount: default
Status: Failed
Message: child 'sleep-test-template-tjh27-4134794983' failed
Conditions:
Completed True
Created: Tue Dec 22 14:19:04 -0500 (3 minutes ago)
Started: Tue Dec 22 14:19:04 -0500 (3 minutes ago)
Finished: Tue Dec 22 14:20:19 -0500 (2 minutes ago)
Duration: 1 minute 15 seconds
ResourcesDuration: 3m4s*(1 cpu),11m18s*(100Mi memory)
Parameters:
step-count: 1
sleep-seconds: 60
STEP TEMPLATE PODNAME DURATION MESSAGE
✖ sleep-test-template-tjh27 sleep child 'sleep-test-template-tjh27-4134794983' failed
├---✔ generate gen-number-list sleep-test-template-tjh27-2499020563 5s
└---✖ sleep(0:0) snooze sleep-test-template-tjh27-4134794983 1m failed with exit code 1
Thank you!
Good news! I have a branch name test Can I ask you to try :test for both the executor and the controller? We'll need to understand why the pods failed. Is it pod deleted? Can you try
DEFAULT_REQUEUE_TIME: 20s // defaults to 2s, but causes `pod deleted` under high load
MAX_OPERATION_TIME: 1m // defaults to 30s
Running on :test:

Adding the environment variables as recommended:

We'll likely try out v2.12.3 as well and report back.
Very similar results with 2.12.3:

Most helpful comment
Your actions:
:no-sigand:easyjson.My actions