Pipeline: Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up

Created on 12 Nov 2020 · 17Comments · Source: tektoncd/pipeline

Hi,

after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

The controller log is full of repeated "pod not found" messages like the following.

{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}

We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.

Thanks,
Fabian

Additional Info

Kubernetes version: v1.18.9
Tekton Pipeline version: v0.18.0

kinbug prioritimportant-soon

Source

Fabian-K

All 17 comments

Looks like this was eventually processed after 3 hours. Unfortunately, after every controller restart this repeats.

Fabian-K on 13 Nov 2020

This seems a pretty important issue to get sorted, setting the priority higher.

afrittoli on 16 Nov 2020

@Fabian-K can you please provide your environment details and yamls if they are shareable? We just fixed scaling issues with larger pipelines with PR #3524 and it has been cherry picked in 0.18 branch. Can you please run with those changes and see if you are still running into issues?

pritidesai on 17 Nov 2020

I notice you have uploaded a pipelineRun here in other issue #3126

pritidesai on 17 Nov 2020

@pritidesai there are multiple different pipelines used in the cluster. All however are really simple (less than 5 tasks, only a few parameters). My understanding is that https://github.com/tektoncd/pipeline/pull/3524 is mostly relevant if there are way more.
Are images for https://github.com/tektoncd/pipeline/tree/release-v0.18.x already available somewhere? I could deploy and verify if that already fixes the issue.

Fabian-K on 17 Nov 2020

hey @Fabian-K I am cutting 0.18.1, will update you once its available

pritidesai on 17 Nov 2020

@Fabian-K 0.18.1 is here, please give it a try 🙏

pritidesai on 17 Nov 2020

Unfortunately, 0.18.1 does not fix the issue :(

Fabian-K on 18 Nov 2020

@Fabian-K to understand your pipeline, you have five independent tasks running all in parallel and operating on a common git pipelineResource:

assemble
unittest
kustomize
push-docker-images
push-jib-images

I ran the similar pipeline on my local cluster multiple times, all five pods are getting created as expected. I am having hard time reproducing this.

Can you please check if the pods exist before running cleanup job?

pritidesai on 18 Nov 2020

Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:

create a cluster and create some PipelineRuns in it. Any pipeline should work.
wait for all PipelineRuns to complete
delete the pods of the executed tasks (but keep the PipelineRuns and TaskRuns)
restart the tekton pipelines controller
check the tekton_workqueue_depth metric, this will take some time to reach 0 (much longer compared with versions before 0.18.0)

Fabian-K on 19 Nov 2020

👍1

@Fabian-K I will try to reproduce this on my local cluster. In the meanwhile, it might help if you could:

turn the log level of the controller to debug
grab the logs during the time when no PipelineRuns are running and the pods have been deleted

To change the log level, you can kubectl edit cm/config-logging -n tekton-pipelines and then:

   # Common configuration for all knative codebase
   zap-logger-config: |
     {
-      "level": "info",
+      "level": "debug",
       "development": false,
       "sampling": {
         "initial": 100,
@@ -49,5 +49,5 @@ data:
     }

   # Log level overrides
-  loglevel.controller: "info"
+  loglevel.controller: "debug"
   loglevel.webhook: "info"

afrittoli on 19 Nov 2020

I created a fresh kind cluster, ran 17 pipelines 17 with 40 tasks each, let everything finish and then deleted all completed pods.
I collected the work_queue metrics for pipeline run and taskrun, staring during pipeline run execution, through the pod deletion:

$ while true; do curl -s http://localhost:9090/metrics | egrep '^tekton_work_queue_depth' | awk '{ print "'$(date +%F__%H:%M:%S)',"$1","$2 }' | sed 's/__/ /g' >> metrics.log; sleep 5; done

And plotted the result:

set xdata time
set timefmt '%Y-%m-%d %H:%M:%S'
set datafile separator ','
plot "<(grep taskrun metrics.log)" using 1:3 notitle with lines
replot "<(grep pipelinerun metrics.log)" using 1:3 notitle with lines

The resulting graph shows that the pipeline run controller is not affected by the pod deletion (as expected).
The taskrun graph spikes up because I ran all the deletions together, and each will trigger a reconcile, however the line goes down really slowly, as described in this issue:

afrittoli on 19 Nov 2020

{"level":"info","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"default\", Name:\"pipeline-40-run-c4zcp-t9-vt7wk\", UID:\"703f55a9-7c24-4c13-8d21-1e12af8d633f\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"24686\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pods \"pipeline-40-run-c4zcp-t9-vt7wk-pod-tcnsq\" not found","commit":"473e3f3"}
{"level":"debug","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:536","msg":"Requeuing key default/pipeline-40-run-c4zcp-t9-vt7wk due to non-permanent error (depth: 609)","commit":"473e3f3"}
{"level":"info","ts":"2020-11-19T12:03:26.818Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:515","msg":"Reconcile failed. Time taken: 300.5814ms","commit":"473e3f3","knative.dev/traceid":"3cd7b15c-3350-4f93-8365-53e865c7576a","knative.dev/key":"default/pipeline-40-run-c4zcp-t9-vt7wk"}

afrittoli on 19 Nov 2020

I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error: https://github.com/tektoncd/pipeline/blob/473e3f3cc74b215f8c22cb007fcd7413f93f3917/pkg/reconciler/taskrun/taskrun.go#L215-L217

Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.

afrittoli on 19 Nov 2020

👍1

Graph before applying the patch - it went down in about 30 min.