OpenShift Pipelines 1.0.1 installed on OpenShift 4.4.17 is failing to run the below TaskRun in some specific namespaces. While it's working on some namespaces it's failing in specific with the following error reported in tekton-pipelines-controller:
{
"level": "error",
"logger": "tekton.taskrun-controller",
"caller": "taskrun/taskrun.go:163",
"msg": "Reconcile error: error adding ready annotation to Pod \"example-taskrun-pod-bl2cg\": Pod \"example-taskrun-pod-bl2cg\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n core.PodSpec{\n \t... // 14 identical fields\n \tHostname: \"\",\n \tSubdomain: \"\",\n \tAffinity: &core.Affinity{\n \t\tNodeAffinity: &core.NodeAffinity{\n \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: nil,\n \t\t\tPreferredDuringSchedulingIgnoredDuringExecution: []core.PreferredSchedulingTerm{\n \t\t\t\t{Weight: 100, Preference: core.NodeSelectorTerm{MatchExpressions: []core.NodeSelectorRequirement{{Key: \"…/workload\", Operator: \"In\", Values: []string{\"dynamic\"}}}}},\n- \t\t\t\t{\n- \t\t\t\t\tWeight: 100,\n- \t\t\t\t\tPreference: core.NodeSelectorTerm{\n- \t\t\t\t\t\tMatchExpressions: []core.NodeSelectorRequirement{{Key: \"…\", Operator: \"In\", Values: []string{\"dynamic\"}}},\n- \t\t\t\t\t},\n- \t\t\t\t},\n \t\t\t},\n \t\t},\n \t\tPodAffinity: nil,\n \t\tPodAntiAffinity: nil,\n \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: []core.Toleration{{Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/memory-pressure\", Operator: \"Exists\", Effect: \"NoSchedule\"}, {Key: \"workload\", Operator: \"Equal\", Value: \"dynamic\", Effect: \"NoSchedule\"}},\n \t... // 10 identical fields\n }\n",
"knative.dev/controller": "taskrun-controller",
"stacktrace": "github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).Reconcile\n\t/opt/app-root/src/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:163\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/opt/app-root/src/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:361\nknative.dev/pkg/controller.(*Impl).Run.func2\n\t/opt/app-root/src/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:310"
}
It always work.
1.
2.
3.
The error is coming from the validation of the PodSpec, which is supposed to not change because we only add or update an annotation.
Looking into https://github.com/tektoncd/pipeline/blob/9c37fea9c19c7d4ed3bf222b45bb9019788e656c/pkg/pod/entrypoint.go#L163 we do an update, which could be a problem if the object got updated between the time we got it and the time we want to update the annotation — the spec got enhance by another mutating webhook, limitrange applied, …. It would be safe to use Patch instead to make sure we just update the annotations.
/kind bug
cc @ImJasonH @pritidesai
I'm not sure what difference the namespace makes, that might be a red herring.
In any case, the issue seems to be that we should update annotations with a Patch instead of an Update, to prevent errors from concurrent updates. Is that correct?
I can take a look at this if no one else is
I'm not sure what difference the namespace makes, that might be a red herring.
I think some namespaces had LimitRange and some did not (in the cluster it happened).
In any case, the issue seems to be that we should update annotations with a Patch instead of an Update, to prevent errors from concurrent updates. Is that correct?
@ImJasonH exactly :+1:
Thanks @ywluogg and possibly replace one more Update with Patch:
Patch takes a JSON object describing the change you want to make, for example when a PipelineRun is cancelled, all of its TaskRuns are cancelled using Patch, code here: https://github.com/tektoncd/pipeline/blob/5005e818919a819f27d9418cb4b37250b27ff5ef/pkg/reconciler/pipelinerun/cancel.go#L52
@mattmoor if I remember correctly using PATCH would require changes to the unit test infra to handle revision numbers properly?
The reactor logic around generation handling won't work properly. This is mainly relevant for "primary key" types IIRC, so patching a child resource might be alright.
Most helpful comment
I can take a look at this if no one else is