Pipeline: Big log activity when upgrade from 0.14.2 to 0.16.2

Created on 15 Sep 2020  路  34Comments  路  Source: tektoncd/pipeline

Expected Behavior

Applying yaml with kubectl and get a working tekton.

I'm deploying tekton pipelines (0.16.2) and triggers (0.8.1) on an AKS cluster (1.17.9).

The images are downloaded from an internal container registry. I patch the YAML deployment file

Actual Behavior

The pods for the tekton pipelines webhook is generating an incredible activity complaining that is not able to update the webhook.

Here's an extract of this logs:

...
{"level":"error","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:480","msg":"Reconcile error","commit":"4e23d50","error":"failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"webhook.pipeline.tekton.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:480\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:466\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:403"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:467","msg":"Reconcile failed. Time taken: 78.005978ms","commit":"4e23d50","knative.dev/traceid":"216cb50e-b397-4629-8088-62bd7c262e13","knative.dev/key":"webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"defaulting/defaulting.go:201","msg":"Updating webhook","commit":"4e23d50","knative.dev/traceid":"04300eae-021a-4eed-bba7-43fe0a6f865c","knative.dev/key":"webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:474","msg":"Reconcile succeeded. Time taken: 81.482712ms","commit":"4e23d50","knative.dev/traceid":"2c97b2b5-2ce0-4e43-bf49-a87f0872cf3a","knative.dev/key":"validation.webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.ValidationWebhook","caller":"validation/reconcile_config.go:164","msg":"Updating webhook","commit":"4e23d50","knative.dev/traceid":"3b7610d3-e89f-49f9-87a1-daf5d8cd5b95","knative.dev/key":"validation.webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:474","msg":"Reconcile succeeded. Time taken: 80.989207ms","commit":"4e23d50","knative.dev/traceid":"04300eae-021a-4eed-bba7-43fe0a6f865c","knative.dev/key":"webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"defaulting/defaulting.go:201","msg":"Updating webhook","commit":"4e23d50","knative.dev/traceid":"76fccf87-fc1c-4210-ba30-8b3d7b35b80c","knative.dev/key":"webhook.pipeline.tekton.dev"}
{"level":"error","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:480","msg":"Reconcile error","commit":"4e23d50","error":"failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.pipeline.tekton.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:480\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:466\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:403"}
{"level":"info","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:467","msg":"Reconcile failed. Time taken: 77.814876ms","commit":"4e23d50","knative.dev/traceid":"3b7610d3-e89f-49f9-87a1-daf5d8cd5b95","knative.dev/key":"validation.webhook.pipeline.tekton.dev"}
...

The problem is that this little pod complains about it around ten times per second.

Steps to Reproduce the Problem

Deploy v0.16.2 on my AKS cluster.

I try to deploy it on a minikube but the problem is not showing up.

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-26T20:32:49Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.9", GitCommit:"f862ce9a809b9832be36717bfc4efae470f59fbf", GitTreeState:"clean", BuildDate:"2020-08-24T18:41:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Same for minikube:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-26T20:32:49Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

    v0.16.2

kinbug prioritcritical-urgent

Most helpful comment

I was able to deploy the 0.18.1 and the issue seems to be gone.

Thanks for the hard work!

All 34 comments

I did try to remove tekton but the problem is still there.

Here is the commands I use to do that:

    kubectl delete apiservices.apiregistration.k8s.io v1beta1.tekton.dev v1alpha1.tekton.dev
    kubectl delete validatingwebhookconfigurations -l app.kubernetes.io/part-of=tekton-pipelines
    kubectl delete validatingwebhookconfigurations -l app.kubernetes.io/part-of=tekton-triggers
    kubectl delete mutatingwebhookconfigurations -l app.kubernetes.io/part-of=tekton-triggers
    kubectl delete mutatingwebhookconfigurations -l app.kubernetes.io/part-of=tekton-pipelines
    kubectl delete all --all-namespaces -l triggers.tekton.dev/release
    kubectl delete all --all-namespaces -l pipeline.tekton.dev/release
    kubectl delete ns tekton-pipelines
    kubectl delete crd -l app.kubernetes.io/part-of=tekton-pipelines
    kubectl delete crd -l app.kubernetes.io/part-of=tekton-triggers

In fact, the problem exist in the version 0.15.2.

Hey @Yannig ! Thanks for reporting this! I don't know why this is happening, I don't see it in my own clusters or our dogfood clusters (but it looks like something very similar happened in a dashboard integration test https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_dashboard/1371/pull-tekton-dashboard-integration-tests/1260292944707981314/ 馃 )

Looking at the code that is causing this, it looks like our knative pkg libs are trying to create validating and mutating webhooks on the fly https://github.com/tektoncd/pipeline/blob/master/vendor/knative.dev/pkg/webhook/resourcesemantics/defaulting/defaulting.go#L206

It might help to look at the mutatingwebhookconfigurations and validatingwebhookconfigurations in your cluster; maybe something else is trying to modify them at the same time?

( @mattmoor in case you have some expertise here)

@bobcatfish Thanks for your feedback !

I will do some experiments with the mutating/validating webhook configuration and I keep you in touch.

Sounds good! Thanks @Yannig please let us know what you find.

The only other thing I could think of is that I think those specific webhooks might be global (I think @chhsia0 could fill in more) - i.e. that one webhook is used for all namespaces, so if you had multiple tekton installations in multiple namespaces maybe they'd all try to update the same webhooks and have conflicts?

I dont know if it could help, but here the list of the mutating webhook configurations:

$ kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io  
NAME                                 CREATED AT
aks-webhook-admission-controller     2020-08-27T09:26:02Z
istio-sidecar-injector               2020-08-28T09:40:30Z
prom-prometheus-operator-admission   2020-08-28T10:14:50Z
webhook.pipeline.tekton.dev          2020-09-15T15:18:07Z
webhook.triggers.tekton.dev          2020-09-15T15:18:09Z

The same thing with the validating webhook configurations:

kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                     CREATED AT
config.webhook.pipeline.tekton.dev       2020-09-15T15:18:07Z
config.webhook.triggers.tekton.dev       2020-09-15T15:18:09Z
istiod-istio-system                      2020-08-28T09:40:28Z
prom-prometheus-operator-admission       2020-08-28T10:15:06Z
validation.webhook.pipeline.tekton.dev   2020-09-15T15:18:07Z
validation.webhook.triggers.tekton.dev   2020-09-15T15:18:09Z

By comparing validation.webhook.pipeline.tekton.dev, I found the following difference:

$ diff /tmp/validation.webhook.pipeline.tekton.dev-bad.json \
       /tmp/validation.webhook.pipeline.tekton.dev-minikube.json
23,26d22
<         },
<         {
<           "key": "control-plane",
<           "operator": "DoesNotExist"
113a110,126
>         ],
>         "scope": "*"
>       },
>       {
>         "apiGroups": [
>           "tekton.dev"
>         ],
>         "apiVersions": [
>           "v1alpha1"
>         ],
>         "operations": [
>           "CREATE",
>           "UPDATE",
>           "DELETE"
>         ],
>         "resources": [
>           "runs/*"

The differences are the same with this kind of object.

Looking the log of a "working" version of Tekton, I found that there's already this kind of message (here with v0.14.2):

{"level":"error","logger":"webhook","caller":"controller/controller.go:449","msg":"Reconcile error","commit":"9168151","error":"failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.pipeline.tekton.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:449\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:435\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func2\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:372"}
{"level":"info","logger":"webhook","caller":"controller/controller.go:436","msg":"Reconcile failed. Time taken: 34.235836ms","commit":"9168151","knative.dev/traceid":"d0a34500-8309-422c-850f-d373c5b28f0d","knative.dev/key":"tekton-pipelines/webhook-certs"}
{"level":"error","logger":"webhook","caller":"controller/controller.go:449","msg":"Reconcile error","commit":"9168151","error":"failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"webhook.pipeline.tekton.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:449\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:435\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func2\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:372"}
{"level":"info","logger":"webhook","caller":"controller/controller.go:436","msg":"Reconcile failed. Time taken: 49.87539ms","commit":"9168151","knative.dev/traceid":"b5727dda-0a0a-455a-b9ec-6e014a64027d","knative.dev/key":"tekton-pipelines/webhook-certs"}

The only thing is that the pod did not try to update the configuration in an infinite loop after that.

Maybe related to #3206 ?

@coryrc I did a check and remove everything in the cluster and the problem is still there. The problem is also present with the previous version of Tekton (0.15.2).

I think there's an issue with the webhook. If I try to inject the modification directly changing only the timeout for example, I get the same error:

$ kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io webhook.pipeline.tekton.dev  -o yaml | \
    sed 's/timeoutSeconds: .*/timeoutSeconds: 29/' | kubectl apply -f -
Error from server (Conflict): error when applying patch:
{"$setElementOrder/webhooks":[{"name":"webhook.pipeline.tekton.dev"}],"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"admissionregistration.k8s.io/v1\",\"kind\":\"MutatingWebhookConfiguration\",\"metadata\":{\"annotations\":{},\"creationTimestamp\":\"2020-09-16T09:30:09Z\",\"generation\":1154,\"labels\":{\"app.kubernetes.io/component\":\"webhook\",\"app.kubernetes.io/instance\":\"default\",\"app.kubernetes.io/part-of\":\"tekton-pipelines\",\"pipeline.tekton.dev/release\":\"v0.16.3\"},\"name\":\"webhook.pipeline.tekton.dev\",\"resourceVersion\":\"6320487\",\"selfLink\":\"/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/webhook.pipeline.tekton.dev\",\"uid\":\"02422862-df25-4fd0-9b81-3e758cb5be91\"},\"webhooks\":[{\"admissionReviewVersions\":[\"v1beta1\"],\"clientConfig\":{\"caBundle\":\".........econds\":29}]}\n"},"generation":1154,"resourceVersion":"6320487"},"webhooks":[{"name":"webhook.pipeline.tekton.dev","timeoutSeconds":29}]}
to:
Resource: "admissionregistration.k8s.io/v1, Resource=mutatingwebhookconfigurations", GroupVersionKind: "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration"
Name: "webhook.pipeline.tekton.dev", Namespace: ""
for: "STDIN": Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io "webhook.pipeline.tekton.dev": the object has been modified; please apply your changes to the latest version and try again

If now I remove the resourceVersion field, the modification is working:

$ kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io webhook.pipeline.tekton.dev  -o yaml | \
    sed 's/timeoutSeconds: .*/timeoutSeconds: 30/' | \
    grep -v resourceVersion | \
    kubectl apply -f -
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.pipeline.tekton.dev configured

I have reproduced the problem with a clean new aks cluster without anything on it.

Command used to create the cluster:

az aks create -g my-group  --name test --kubernetes-version 1.17.9 --network-plugin azure --network-policy calico

Retrieving the credentials:

az aks get-credentials --admin -g my-group -n test

Next, apply the tekton definition:

kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.16.3/release.yaml

Then, the loop error/retry hell start:

...
{"level":"info","logger":"webhook.ValidationWebhook","caller":"validation/reconcile_config.go:164","msg":"Updating webhook","commit":"61e4167","knative.dev/traceid":"f97c95ee-4209-4445-8378-087fcde7ccdb","knative.dev/key":"validation.webhook.pipeline.tekton.dev"}
{"level":"error","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:480","msg":"Reconcile error","commit":"61e4167","error":"failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"webhook.pipeline.tekton.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:480\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:466\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:403"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:467","msg":"Reconcile failed. Time taken: 77.275686ms","commit":"61e4167","knative.dev/traceid":"ed67e4f2-66ee-4fba-ae4c-fc4609c16ffe","knative.dev/key":"webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"defaulting/defaulting.go:201","msg":"Updating webhook","commit":"61e4167","knative.dev/traceid":"014f013d-ba0d-493d-af27-a53189868d9c","knative.dev/key":"webhook.pipeline.tekton.dev"}
{"level":"error","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:480","msg":"Reconcile error","commit":"61e4167","error":"failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.pipeline.tekton.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:480\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:466\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:403"}
{"level":"info","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:467","msg":"Reconcile failed. Time taken: 79.844338ms","commit":"61e4167","knative.dev/traceid":"f97c95ee-4209-4445-8378-087fcde7ccdb","knative.dev/key":"validation.webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.ValidationWebhook","caller":"validation/reconcile_config.go:164","msg":"Updating webhook","commit":"61e4167","knative.dev/traceid":"7907f412-77c1-4123-bdb7-5361f2e86286","knative.dev/key":"validation.webhook.pipeline.tekton.dev"}
{"level":"info","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:474","msg":"Reconcile succeeded. Time taken: 81.368329ms","commit":"61e4167","knative.dev/traceid":"014f013d-ba0d-493d-af27-a53189868d9c","knative.dev/key":"webhook.pipeline.tekton.dev"}
...

Interesting!! Thanks @Yannig - I wonder if it actually has something to do with aks, maybe they are making some modifications to webhooks - or is something specific to 1.17? 馃

@bobcatfish I'll try to do some experiments with older/newer version if the problem is still there.

I will keep you in touch if I made any progress.

Same behaviour with version 1.18.8 ...

Same thing with 1.16.13 with or without calico.

I'm wondering if this issue as something to do with the following issue: https://github.com/Azure/AKS/issues/1771

Oh wow nice find, that could definitely be it, since we use the knative libs!

I will try to play with some labels to check if things get better

Yes ! Adding the following labels did the trick: admissions.enforcer/disabled: "true"

I'll do some others experiments to be sure I'm not missing anything else :)

So, to have a working tekton on AKS, run the kubectl apply command and then, the following commands:

kubectl label mutatingwebhookconfigurations webhook.pipeline.tekton.dev admissions.enforcer/disabled=true
kubectl label validatingwebhookconfigurations validation.webhook.pipeline.tekton.dev admissions.enforcer/disabled=true

Looking at https://github.com/knative/pkg/pull/1696/files, I think a bump to 0.18 (which is not release yet) will automatically fix the problem ?

Also related to https://github.com/Azure/AKS/issues/1771

Yeah tekton needs to grab the latest PKG for our webhook controllers to not fight with AKS.

Could already be fixed by #3319

@afrittoli I will try to install the last version and keep you in touch.

@afrittoli I will try to install the last version and keep you in touch.

This is on master only so far, it will automatically get picked by 0.18.0 release.

This should be fixed with 0.18 (releasing on 11/9), moving it to 0.19 so that it can be verified with 0.18

Not a big issue. I have a workaround for the moment :)

I am gonna go ahead and close this as it should be fixed. If it does not with 0.18, please comment here so we re-open it.
/close

@vdemeester: Closing this issue.

In response to this:

I am gonna go ahead and close this as it should be fixed. If it does not with 0.18, please comment here so we re-open it.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

No problem. I will try this asap.

I was able to deploy the 0.18.1 and the issue seems to be gone.

Thanks for the hard work!

Was this page helpful?
0 / 5 - 0 ratings