Argo: [Imp] Workflows getting Latent with high workloads (performance/scale runs) !!

Created on 30 May 2020  路  4Comments  路  Source: argoproj/argo

Background
We are trying to simulate CI using Argo workflows. We create our workflows using Dag and Tasks, with variation in number of templates depending on different use-cases.

Our performance/scale tests indicate that the workflow executions are becoming significantly latent when we run workflows at scale at a given point in time. We would like to highlight some of our observations and would like to know your opinion to improve the workflow execution performance which is critical to scale the platform.

What happened

  • Whenever we simultaneously run a large number of workflows (~ 300), we see a huge latency in the workflow execution duration.
  • Even when we run a small number of workflows simultaneously (~ 20), we see there is an increase in the workflow execution duration.
  • We see better execution duration when we delete all the completed/Failed/Succeeded workflows

Our findings with respect to latency

  • We don鈥檛 see latency in main containers
  • We see increasing latency in Wait containers and the time the Dag takes to end after all its task completion (refer - Sheet Link ).
  • The completed/Failed/Succeeded workflows present in the namespace are somehow adding to the overall workflow duration time.

Why is this a Blocker?
With every additional load (more workflows/second), the expected time to complete the workflows are getting significantly latent. Since workflow rps is constantly high, the load has also increased by the time the workflow is in middle of the execution and the next half takes even longer than the previous half to complete. Eventually the workflow which is expected to complete in 5 mins, is taking more than 30 minutes and even it may fail as we end up consuming all our allocated resources (as the workflows are not leaving the resources which it should have ideally) in the namespace.

This problem affects overall platform stability and has implications on scaling the platform.

Note on Wait Container
Wait container is making a watch call (To control-plane) to identify if the main container has started within the pod. So when there are large number of wait containers, every watch call from wait containers becomes latent.

Clarifications

  • Why does the Wait container always contact the control-plane to know if the main container has started ?.
  • Once the pods are scheduled, Isn鈥檛 it an anti-pattern to know the other container status within the same pod using the control-pane watch call? Or is it expected? Do we have other alternatives to find container status within a pod without taking control pane dependency?
  • During high workloads (large workflows / sec), Why are Dags taking extra time even after completion of all the dag tasks?. The wait container and Dag taking longer time is only observed during high workfload runs.

Attachments
Here are the baseline for an example workflow

How to replicate the issue

  • This is observed with all the workflows we ran in scale testing. So any workflow if ran at sclae should give the same observations. We tried with (1workflow/2seconds) for 1000 seconds till we reach 500 workflows.
episcaling wontfix

All 4 comments

  1. Make sure you are running v2.8.1
  2. Limit your total number of workflows/pods as per this doc: https://github.com/argoproj/argo/blob/master/docs/cost-optimisation.md
  3. Try different executor: https://github.com/argoproj/argo/blob/master/docs/workflow-executors.md
  4. Try work-avoidance: https://github.com/argoproj/argo/blob/master/docs/work-avoidance.md

v2.9 will include better throughput for large numbers of concurrent workflows: #2921

Note on Wait Container
Wait container is making a watch call (To control-plane) to get the main container's ContainerID. So when there are large number of wait containers, every watch call from wait containers becomes latent.

@alexec, Regarding the wait-container calling api-server, we wanted to check with argo team, on why such pattern is adopted. We believe that workload pods making calls to control-plane may be an anti-pattern. Could you explain more on this.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings