Pipeline: Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up

Created on 12 Nov 2020  ·  17Comments  ·  Source: tektoncd/pipeline

Hi,

after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

image

The controller log is full of repeated "pod not found" messages like the following.

{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}

We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.

Thanks,
Fabian

Additional Info

  • Kubernetes version: v1.18.9
  • Tekton Pipeline version: v0.18.0
kinbug prioritimportant-soon

All 17 comments

Looks like this was eventually processed after 3 hours. Unfortunately, after every controller restart this repeats.

image

This seems a pretty important issue to get sorted, setting the priority higher.

@Fabian-K can you please provide your environment details and yamls if they are shareable? We just fixed scaling issues with larger pipelines with PR #3524 and it has been cherry picked in 0.18 branch. Can you please run with those changes and see if you are still running into issues?

I notice you have uploaded a pipelineRun here in other issue #3126

@pritidesai there are multiple different pipelines used in the cluster. All however are really simple (less than 5 tasks, only a few parameters). My understanding is that https://github.com/tektoncd/pipeline/pull/3524 is mostly relevant if there are way more.
Are images for https://github.com/tektoncd/pipeline/tree/release-v0.18.x already available somewhere? I could deploy and verify if that already fixes the issue.

hey @Fabian-K I am cutting 0.18.1, will update you once its available

@Fabian-K 0.18.1 is here, please give it a try 🙏

Unfortunately, 0.18.1 does not fix the issue :(

@Fabian-K to understand your pipeline, you have five independent tasks running all in parallel and operating on a common git pipelineResource:

  1. assemble
  2. unittest
  3. kustomize
  4. push-docker-images
  5. push-jib-images

I ran the similar pipeline on my local cluster multiple times, all five pods are getting created as expected. I am having hard time reproducing this.

Can you please check if the pods exist before running cleanup job?

Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:

  • create a cluster and create some PipelineRuns in it. Any pipeline should work.
  • wait for all PipelineRuns to complete
  • delete the pods of the executed tasks (but keep the PipelineRuns and TaskRuns)
  • restart the tekton pipelines controller
  • check the tekton_workqueue_depth metric, this will take some time to reach 0 (much longer compared with versions before 0.18.0)

@Fabian-K I will try to reproduce this on my local cluster. In the meanwhile, it might help if you could:

  • turn the log level of the controller to debug
  • grab the logs during the time when no PipelineRuns are running and the pods have been deleted

To change the log level, you can kubectl edit cm/config-logging -n tekton-pipelines and then:

   # Common configuration for all knative codebase
   zap-logger-config: |
     {
-      "level": "info",
+      "level": "debug",
       "development": false,
       "sampling": {
         "initial": 100,
@@ -49,5 +49,5 @@ data:
     }

   # Log level overrides
-  loglevel.controller: "info"
+  loglevel.controller: "debug"
   loglevel.webhook: "info"

I created a fresh kind cluster, ran 17 pipelines 17 with 40 tasks each, let everything finish and then deleted all completed pods.
I collected the work_queue metrics for pipeline run and taskrun, staring during pipeline run execution, through the pod deletion:

$ while true; do curl -s http://localhost:9090/metrics | egrep '^tekton_work_queue_depth' | awk '{ print "'$(date +%F__%H:%M:%S)',"$1","$2 }' | sed 's/__/ /g' >> metrics.log; sleep 5; done

And plotted the result:

set xdata time
set timefmt '%Y-%m-%d %H:%M:%S'
set datafile separator ','
plot "<(grep taskrun metrics.log)" using 1:3 notitle with lines
replot "<(grep pipelinerun metrics.log)" using 1:3 notitle with lines

The resulting graph shows that the pipeline run controller is not affected by the pod deletion (as expected).
The taskrun graph spikes up because I ran all the deletions together, and each will trigger a reconcile, however the line goes down really slowly, as described in this issue:

image

{"level":"info","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"default\", Name:\"pipeline-40-run-c4zcp-t9-vt7wk\", UID:\"703f55a9-7c24-4c13-8d21-1e12af8d633f\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"24686\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pods \"pipeline-40-run-c4zcp-t9-vt7wk-pod-tcnsq\" not found","commit":"473e3f3"}
{"level":"debug","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:536","msg":"Requeuing key default/pipeline-40-run-c4zcp-t9-vt7wk due to non-permanent error (depth: 609)","commit":"473e3f3"}
{"level":"info","ts":"2020-11-19T12:03:26.818Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:515","msg":"Reconcile failed. Time taken: 300.5814ms","commit":"473e3f3","knative.dev/traceid":"3cd7b15c-3350-4f93-8365-53e865c7576a","knative.dev/key":"default/pipeline-40-run-c4zcp-t9-vt7wk"}

I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error: https://github.com/tektoncd/pipeline/blob/473e3f3cc74b215f8c22cb007fcd7413f93f3917/pkg/reconciler/taskrun/taskrun.go#L215-L217

Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.

Graph before applying the patch - it went down in about 30 min.

image

Controller restart to apply the patch:

image

@Fabian-K for further details, it was very helpful 👍

Thanks @afrittoli for the fix 🙏

Does this warrant 0.18.2?

Was this page helpful?
0 / 5 - 0 ratings