Pipeline: sometimes next Task in Pipeline will start before previous Task completes causing PVC/PV permisison issues

Created on 9 Nov 2020  Â·  22Comments  Â·  Source: tektoncd/pipeline

Expected Behavior

When having sequential Tasks in a Pipeline that all use the same PVC/PV created via volumeClaimTemplate definitions in a TriggerTemplate I would expect that by the time a Task starts running all previous Task's Pods have fully termininated. If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs.

Actual Behavior

Sometimes, intermittently, Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.

Most annoyingly, this happens 'sometimes' so you never know if on Task 15 of a pipeline run Task 14 will still be terminating and your Pipeline run will fail due to a file permisions issue and then you need to kick off the pipeline again.

Steps to Reproduce the Problem

  1. Create a TriggerTempalte using volumeClaimTemplate (though i suspect this could be done with any PVC) triggering a Pipeline with multiple sequential tasks that all use that PVC performing write operations in the mounted directory
  2. run the pipeline a few times until you see the write permissions issue
    3.

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0+d59ce34", GitCommit:"d59ce3486ae3ca3a0c36e5498e56f51594076596", GitTreeState:"clean", BuildDate:"2020-10-08T15:58:07Z", GoVersion:"go1.15.0", Compiler:"gc", Platform:"linux/amd64"}
    
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.16.3%

Workaround

Still trying to come up with one beyond just adding a sleep to the beginging of my tasks to lower chance that previous steps Pod is still running

kinbug

All 22 comments

I'm not sure I understand the issue here.

Expected Behavior
If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs

Both Tasks/Pods will have write permissions. The RWO means that the PV can be mounted only at a Node in writable mode at a time - multiple Pods on that Node can write to the volume. The example Parallel Tasks using PVC shows two tasks concurrently writing to the same PVC - in RWO mode.

Actual Behavior
Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.

I would not expect that file permissions is changed wether a Pod/process has terminated or not. In my view, it is up to the Pipeline-designer to make sure that the Tasks/Pods in the Pipeline can be executed in the declared order, e.g. that the expected files as available. I think this depends on how the Tasks handles the files? If a subsequent Task depends on files from a previous task, make sure to use runAfter.

I'm not 100% sure on my cause analysis, i took a best guess that its because a previous steps pod still had the RW access to the PVC in question. All I know for usre is that indeterminately a pipeline run will fail due to an inconsistent step failing due to Permission denied errors when writing to the workspace that is mounted to the home dir.

like i just added the following to the begining of my ClusterTask (that gets used multiple times in the same Pipeline with different input params).

      echo "DEBUG"
      ls -al $HOME
      ls -al $HOME/.ssh

and on random steps will get:

step-workflow-step
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
ls: cannot open directory '/home/tssc': Permission denied

where the ClusterTask has a worksapce of

  workspaces:
  - name: home
    mountPath: /home/tssc

and that workspace is mapped to a PVC/PV created via volumeClaimTemplate

i wasn't having these issues pre upgrade to 1.2. i had to upgrade to 1.2 to get ahold of the when statement

some more info:

updated task to run this first:

      echo "DEBUG"
      id
      whoami
      ls -al /home
      ls -al $HOME
      ls -al $HOME/.ssh

Pipeline:

  • Task N -> Task N+1

Task N

/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
uid=1001(tssc) gid=0(root) groups=0(root),1001
tssc
total 4
drwxr-xr-x. 1 root root 18 Oct 23 17:40 .
drwxr-xr-x. 1 root root 37 Nov 10 14:14 ..
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 tssc
total 36
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 .
drwxr-xr-x. 1 root root 18 Oct 23 17:40 ..
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .docker
drwxrws---. 3 tssc 1001 4096 Nov 10 14:14 .gnupg
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .ssh
drwxrws---. 2 root 1001 16384 Nov 10 14:13 lost+found
drwxrwsr-x. 5 tssc 1001 4096 Nov 10 14:13 ploigos
total 20
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 ..
-rw-rw----. 1 tssc 1001 348 Nov 10 14:14 config
-rw-rw----. 1 tssc 1001 2610 Nov 10 14:14 id_git-ssh-keys-ploigos-workflow-reference-quarkus-mvn-foobar-frui
-rw-rw----. 1 tssc 1001 869 Nov 10 14:14 known_hosts

Task N+1

step-workflow-step
2020/11/10 14:15:19 unsuccessful cred copy: ".docker" from "/tekton/creds" to "/home/tssc": unable to create destination directory: mkdir /home/tssc/.docker: permission denied
2020/11/10 14:15:19 unsuccessful cred copy: ".ssh" from "/tekton/creds" to "/home/tssc": unable to create destination directory: mkdir /home/tssc/.ssh: permission denied
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
uid=1001(tssc) gid=0(root) groups=0(root),1001
tssc
total 4
drwxr-xr-x. 1 root root 18 Oct 23 17:40 .
drwxr-xr-x. 1 root root 37 Nov 10 14:15 ..
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 tssc
ls: cannot open directory '/home/tssc': Permission denied

interpretation

?????

I am confused. it looks like in both Tasks, one happening right after the other, the directory permissions in question are the exact same but in one case get a permission denied and in the other dont

okay. i have learned new things.

      echo "DEBUG"
      ls -al /home
      ls -al $HOME || (echo "PAUSE"; sleep 10000)
      ls -al $HOME/.ssh

for ClusterTasks with multiple steps Tekton seems to be spinning up duplicate pods both with both containers, one for each step.

when i caught the task in the sleep and looked at the running pods, there were two pods running both with the same two step containers in them but then while it was in the sleep the second Pod terminated on its own. "Something" is creating duplicate Pods for the same Task.

actully, its not limtied to Tasks with multipel steps. my pipeline is running one long run Task at the moment and look at the output from oc pods -o wide

NAME                                                              READY   STATUS      RESTARTS   AGE     IP             NODE                                        NOMINATED NODE
READINESS GATES
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050dcskj   0/1     Completed   0          15m     10.128.5.15    ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050mnt4h   0/1     Completed   0          39s     10.128.5.56    ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050q92xw   0/1     Completed   0          30m     10.128.4.253   ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050r56j6   0/1     Completed   0          60m     10.128.4.193   ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050v287s   0/1     Completed   0          45m     10.128.4.215   ip-host1.us-west-2.compute.internal   <none>
<none>
el-ploigos-workflow-reference-quarkus-mvn-foobar-frui-5b5brctrd   1/1     Running     0          25h     10.128.4.15    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--4wpkq   0/1     Completed   0          12m     10.128.5.26    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--9p82j   0/1     Completed   0          13m     10.128.5.20    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--bxxms   0/1     Completed   0          13m     10.128.5.21    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--kt6bn   0/1     Completed   0          12m     10.128.5.30    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--m74q7   0/1     Completed   0          12m     10.128.5.28    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--nk57p   0/1     Completed   0          14m     10.128.5.19    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--ptt2v   0/1     Completed   0          12m     10.128.5.27    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--twqnq   0/2     Completed   0          13m     10.128.5.22    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-qv4pt--j8ccv   0/1     Completed   0          84s     10.128.5.54    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--285fg   0/1     Completed   0          3m34s   10.128.5.36    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--59dwl   0/1     Completed   0          2m13s   10.128.5.43    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--5pn2t   0/1     Completed   0          2m4s    10.128.5.45    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--7xpbc   2/2     Running     0          2m42s   10.128.5.40    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--ckw7p   0/1     Completed   0          85s     10.128.5.53    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--csfsk   0/1     Completed   0          115s    10.128.5.49    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--fsld5   0/1     Completed   0          2m4s    10.128.5.44    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--g2njp   0/1     Completed   0          3m19s   10.128.5.38    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--hk4xq   1/1     Running     0          2m14s   10.128.5.42    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--hzmqq   1/1     Running     0          16s     10.128.5.58    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--kq6c9   0/1     Completed   0          3m29s   10.128.5.37    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--lbkwb   1/1     Running     0          116s    10.128.5.50    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--lgtlp   1/1     Running     0          16s     10.128.5.57    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--ppn8m   0/1     Completed   0          44s     10.128.5.55    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--t4w7l   1/1     Running     0          86s     10.128.5.52    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--tkpqj   0/2     Completed   0          2m41s   10.128.5.41    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--wh967   0/1     Completed   0          96s     10.128.5.51    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4rmp--2zc7p   0/1     Completed   0          8m5s    10.128.5.31    ip-host1.us-west-2.compute.internal   <none>

there is a huge set of duplciative pods all running even though their steps already long since ended. looks like all of them are in my "PAUSE" sleep state. they failed to get permisisons and then something spun up a duplicate pod?

looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere?

reading those more it seems this is a dupe of those, but like this is happening to me now on every single run (on one step or another) now that I have upgraded to 0.16. was hoping to find a work around in those other items but i dont see one

looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere?

A github gist is my usual go-to for longer logs. https://gist.github.com/

Can you share a copy of the Pipeline, Tasks, and a Pod? If they're sensitive for some reason, maybe reduce them to a simplified example that exhibits the problem?

I have a lot of questions but no clear solutions to offer atm:

  • You mentioned "sequential Tasks in a Pipeline": what did you mean by this? They're set to runAfter one another?
  • What do the volumes and volumeMounts sections of the containers in one of the duplicate pods look like?
  • How many nodes do you have in your cluster? It looks like all these pods are being scheduled to the same node, which is interesting.
  • What is the Status of the completed pods, and are there any interesting events being emitted by them?
  • What is the state of the volume claim and underlying pv during pipeline execution?

You mentioned "sequential Tasks in a Pipeline": what did you mean by this? They're set to runAfter one another?

yes

What do the volumes and volumeMounts sections of the containers in one of the duplicate pods look like?

ummm.....like you want the pod yml when it comes up? https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a

How many nodes do you have in your cluster? It looks like all these pods are being scheduled to the same node, which is interesting.

3 worker nodes
isn't it expected they all get scheduled on the same node due to the magic labeling stuff that goes on. i thought i read it somewhere.

What is the Status of the completed pods, and are there any interesting events being emitted by them?

This gist has the pod.yml from one of the failed pods: https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a

What is the state of the volume claim and underlying pv during pipeline execution?

umm....good? its bound and everything and each pod is mounting the PVC (even the duplicate TaskRun pods).

Can you share a copy of the Pipeline, Tasks, and a Pod?

https://gist.github.com/itewk/edbac1dab03e25bc874195369306e845

This gist has tekton pipeline controller logs just after a reboot of the controller pod where at least one instance of duplicate TaskRuns was created. the two in there I know are duplicates are:

  • lzj7g--jhlj4
  • lzj7g--qhz2x

OK, thanks for sharing all that. I don't immediately see a root cause from reading through the logs. It appears that there's no clear message as to why ploigos-workflow-reference-quarkus-mvn-foobar-frui-lzj7g--qhz2x is started so soon after ploigos-workflow-reference-quarkus-mvn-foobar-frui-lzj7g--jhlj4. I agree that https://github.com/tektoncd/pipeline/issues/3126 seems like a very likely duplicate of this one (or vice-versa). It definitely seems like the PipelineRun reconciler is incorrectly starting multiple TaskRuns for a single Task.

I've attached a priority to this of critical-urgent so that it gets discussed ASAP at the next owners planning session.

For the record, I think I'm hitting the same bug. As most tasks mount a shared workspace, many pods are stuck initialising because they can't all mount at the same time ...

To have a successful pipeline run, I first need to manually cleanup all related pods, which looks like this:

→ oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
pod "build-and-deploy-go-run-pjjts-fetch-repository-n2g72-pod-hmxfr" deleted
pod "build-and-deploy-go-run-pjjts-fetch-repository-n2g72-pod-s4vqc" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--5976r" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--j57q5" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--xdbvj" deleted

It can also be different numbers of simultaneous pods:

→ oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-pcw5x" deleted
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-qqgft" deleted
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-wc99l" deleted
pod "build-and-deploy-go-run-l4f7c-set-bitbucket-build-status--bx7tx" deleted
pod "build-and-deploy-go-run-l4f7c-set-bitbucket-build-status--fztfl" deleted

I'm moving the priority label to #3126 since it was created first and will hopefully be a little easier to condense into a reproducible case, at least initially.

oc delete pod -l tekton.dev/pipeline=build-and-deploy-go

It's interesting that in both of the cases described here it appears to be open shift involved. @michaelsauter I'm not super familiar with open shift but does this mean you're using cri-o as the container runtime too?

This might not be related at all, especially since https://github.com/tektoncd/pipeline/issues/3126 doesn't appear to be related to open shift, but flagging here in case it becomes relevant in future.

Yes this is on OpenShift. Unfortunately I'm super new to this cluster and not an cluster admin for it so I don't know the exact details. It's a 4.5 cluster, which, according to https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html is using cri-o. Kubernetes is v1.18.3+2fbd7c7. I can't even figure out what the Tekton version is ... I'll follow up with this, assuming cluster admins can easily see this from the operator management section.

i'll close this as a dupe of https://github.com/tektoncd/pipeline/issues/3126 now that i understand what I am seeing is simply a symptom of that issue.

FWIW, I figured out we're on OpenShift Pipelines Technology Preview 1.1, which means we're on Tekton Pipelines 0.14.3.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

csantanapr picture csantanapr  Â·  3Comments

chmouel picture chmouel  Â·  3Comments

sujithjoseph picture sujithjoseph  Â·  3Comments

hrishin picture hrishin  Â·  3Comments

bobcatfish picture bobcatfish  Â·  4Comments