When having sequential Tasks in a Pipeline that all use the same PVC/PV created via volumeClaimTemplate definitions in a TriggerTemplate I would expect that by the time a Task starts running all previous Task's Pods have fully termininated. If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs.
Sometimes, intermittently, Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.
Most annoyingly, this happens 'sometimes' so you never know if on Task 15 of a pipeline run Task 14 will still be terminating and your Pipeline run will fail due to a file permisions issue and then you need to kick off the pipeline again.
volumeClaimTemplate (though i suspect this could be done with any PVC) triggering a Pipeline with multiple sequential tasks that all use that PVC performing write operations in the mounted directoryKubernetes version:
Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0+d59ce34", GitCommit:"d59ce3486ae3ca3a0c36e5498e56f51594076596", GitTreeState:"clean", BuildDate:"2020-10-08T15:58:07Z", GoVersion:"go1.15.0", Compiler:"gc", Platform:"linux/amd64"}
Tekton Pipeline version:
Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
v0.16.3%
Still trying to come up with one beyond just adding a sleep to the beginging of my tasks to lower chance that previous steps Pod is still running
I'm not sure I understand the issue here.
Expected Behavior
If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs
Both Tasks/Pods will have write permissions. The RWO means that the PV can be mounted only at a Node in writable mode at a time - multiple Pods on that Node can write to the volume. The example Parallel Tasks using PVC shows two tasks concurrently writing to the same PVC - in RWO mode.
Actual Behavior
Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.
I would not expect that file permissions is changed wether a Pod/process has terminated or not. In my view, it is up to the Pipeline-designer to make sure that the Tasks/Pods in the Pipeline can be executed in the declared order, e.g. that the expected files as available. I think this depends on how the Tasks handles the files? If a subsequent Task depends on files from a previous task, make sure to use runAfter.
I'm not 100% sure on my cause analysis, i took a best guess that its because a previous steps pod still had the RW access to the PVC in question. All I know for usre is that indeterminately a pipeline run will fail due to an inconsistent step failing due to Permission denied errors when writing to the workspace that is mounted to the home dir.
like i just added the following to the begining of my ClusterTask (that gets used multiple times in the same Pipeline with different input params).
echo "DEBUG"
ls -al $HOME
ls -al $HOME/.ssh
and on random steps will get:
step-workflow-step
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
ls: cannot open directory '/home/tssc': Permission denied
where the ClusterTask has a worksapce of
workspaces:
- name: home
mountPath: /home/tssc
and that workspace is mapped to a PVC/PV created via volumeClaimTemplate
i wasn't having these issues pre upgrade to 1.2. i had to upgrade to 1.2 to get ahold of the when statement
some more info:
updated task to run this first:
echo "DEBUG"
id
whoami
ls -al /home
ls -al $HOME
ls -al $HOME/.ssh
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
uid=1001(tssc) gid=0(root) groups=0(root),1001
tssc
total 4
drwxr-xr-x. 1 root root 18 Oct 23 17:40 .
drwxr-xr-x. 1 root root 37 Nov 10 14:14 ..
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 tssc
total 36
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 .
drwxr-xr-x. 1 root root 18 Oct 23 17:40 ..
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .docker
drwxrws---. 3 tssc 1001 4096 Nov 10 14:14 .gnupg
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .ssh
drwxrws---. 2 root 1001 16384 Nov 10 14:13 lost+found
drwxrwsr-x. 5 tssc 1001 4096 Nov 10 14:13 ploigos
total 20
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 ..
-rw-rw----. 1 tssc 1001 348 Nov 10 14:14 config
-rw-rw----. 1 tssc 1001 2610 Nov 10 14:14 id_git-ssh-keys-ploigos-workflow-reference-quarkus-mvn-foobar-frui
-rw-rw----. 1 tssc 1001 869 Nov 10 14:14 known_hosts
step-workflow-step
2020/11/10 14:15:19 unsuccessful cred copy: ".docker" from "/tekton/creds" to "/home/tssc": unable to create destination directory: mkdir /home/tssc/.docker: permission denied
2020/11/10 14:15:19 unsuccessful cred copy: ".ssh" from "/tekton/creds" to "/home/tssc": unable to create destination directory: mkdir /home/tssc/.ssh: permission denied
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
uid=1001(tssc) gid=0(root) groups=0(root),1001
tssc
total 4
drwxr-xr-x. 1 root root 18 Oct 23 17:40 .
drwxr-xr-x. 1 root root 37 Nov 10 14:15 ..
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 tssc
ls: cannot open directory '/home/tssc': Permission denied
?????
I am confused. it looks like in both Tasks, one happening right after the other, the directory permissions in question are the exact same but in one case get a permission denied and in the other dont
okay. i have learned new things.
echo "DEBUG"
ls -al /home
ls -al $HOME || (echo "PAUSE"; sleep 10000)
ls -al $HOME/.ssh
for ClusterTasks with multiple steps Tekton seems to be spinning up duplicate pods both with both containers, one for each step.
when i caught the task in the sleep and looked at the running pods, there were two pods running both with the same two step containers in them but then while it was in the sleep the second Pod terminated on its own. "Something" is creating duplicate Pods for the same Task.
actully, its not limtied to Tasks with multipel steps. my pipeline is running one long run Task at the moment and look at the output from oc pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
READINESS GATES
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050dcskj 0/1 Completed 0 15m 10.128.5.15 ip-host1.us-west-2.compute.internal <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050mnt4h 0/1 Completed 0 39s 10.128.5.56 ip-host1.us-west-2.compute.internal <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050q92xw 0/1 Completed 0 30m 10.128.4.253 ip-host1.us-west-2.compute.internal <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050r56j6 0/1 Completed 0 60m 10.128.4.193 ip-host1.us-west-2.compute.internal <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050v287s 0/1 Completed 0 45m 10.128.4.215 ip-host1.us-west-2.compute.internal <none>
<none>
el-ploigos-workflow-reference-quarkus-mvn-foobar-frui-5b5brctrd 1/1 Running 0 25h 10.128.4.15 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--4wpkq 0/1 Completed 0 12m 10.128.5.26 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--9p82j 0/1 Completed 0 13m 10.128.5.20 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--bxxms 0/1 Completed 0 13m 10.128.5.21 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--kt6bn 0/1 Completed 0 12m 10.128.5.30 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--m74q7 0/1 Completed 0 12m 10.128.5.28 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--nk57p 0/1 Completed 0 14m 10.128.5.19 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--ptt2v 0/1 Completed 0 12m 10.128.5.27 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--twqnq 0/2 Completed 0 13m 10.128.5.22 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-qv4pt--j8ccv 0/1 Completed 0 84s 10.128.5.54 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--285fg 0/1 Completed 0 3m34s 10.128.5.36 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--59dwl 0/1 Completed 0 2m13s 10.128.5.43 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--5pn2t 0/1 Completed 0 2m4s 10.128.5.45 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--7xpbc 2/2 Running 0 2m42s 10.128.5.40 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--ckw7p 0/1 Completed 0 85s 10.128.5.53 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--csfsk 0/1 Completed 0 115s 10.128.5.49 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--fsld5 0/1 Completed 0 2m4s 10.128.5.44 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--g2njp 0/1 Completed 0 3m19s 10.128.5.38 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--hk4xq 1/1 Running 0 2m14s 10.128.5.42 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--hzmqq 1/1 Running 0 16s 10.128.5.58 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--kq6c9 0/1 Completed 0 3m29s 10.128.5.37 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--lbkwb 1/1 Running 0 116s 10.128.5.50 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--lgtlp 1/1 Running 0 16s 10.128.5.57 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--ppn8m 0/1 Completed 0 44s 10.128.5.55 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--t4w7l 1/1 Running 0 86s 10.128.5.52 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--tkpqj 0/2 Completed 0 2m41s 10.128.5.41 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--wh967 0/1 Completed 0 96s 10.128.5.51 ip-host1.us-west-2.compute.internal <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4rmp--2zc7p 0/1 Completed 0 8m5s 10.128.5.31 ip-host1.us-west-2.compute.internal <none>
there is a huge set of duplciative pods all running even though their steps already long since ended. looks like all of them are in my "PAUSE" sleep state. they failed to get permisisons and then something spun up a duplicate pod?
looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere?
maybe realted / cause / dupe of:
reading those more it seems this is a dupe of those, but like this is happening to me now on every single run (on one step or another) now that I have upgraded to 0.16. was hoping to find a work around in those other items but i dont see one
looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere?
A github gist is my usual go-to for longer logs. https://gist.github.com/
Can you share a copy of the Pipeline, Tasks, and a Pod? If they're sensitive for some reason, maybe reduce them to a simplified example that exhibits the problem?
I have a lot of questions but no clear solutions to offer atm:
runAfter one another?volumes and volumeMounts sections of the containers in one of the duplicate pods look like?You mentioned "sequential Tasks in a Pipeline": what did you mean by this? They're set to runAfter one another?
yes
What do the volumes and volumeMounts sections of the containers in one of the duplicate pods look like?
ummm.....like you want the pod yml when it comes up? https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a
How many nodes do you have in your cluster? It looks like all these pods are being scheduled to the same node, which is interesting.
3 worker nodes
isn't it expected they all get scheduled on the same node due to the magic labeling stuff that goes on. i thought i read it somewhere.
What is the Status of the completed pods, and are there any interesting events being emitted by them?
This gist has the pod.yml from one of the failed pods: https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a
What is the state of the volume claim and underlying pv during pipeline execution?
umm....good? its bound and everything and each pod is mounting the PVC (even the duplicate TaskRun pods).
Can you share a copy of the Pipeline, Tasks, and a Pod?
https://gist.github.com/itewk/edbac1dab03e25bc874195369306e845
This gist has tekton pipeline controller logs just after a reboot of the controller pod where at least one instance of duplicate TaskRuns was created. the two in there I know are duplicates are:
lzj7g--jhlj4lzj7g--qhz2xOK, thanks for sharing all that. I don't immediately see a root cause from reading through the logs. It appears that there's no clear message as to why ploigos-workflow-reference-quarkus-mvn-foobar-frui-lzj7g--qhz2x is started so soon after ploigos-workflow-reference-quarkus-mvn-foobar-frui-lzj7g--jhlj4. I agree that https://github.com/tektoncd/pipeline/issues/3126 seems like a very likely duplicate of this one (or vice-versa). It definitely seems like the PipelineRun reconciler is incorrectly starting multiple TaskRuns for a single Task.
I've attached a priority to this of critical-urgent so that it gets discussed ASAP at the next owners planning session.
For the record, I think I'm hitting the same bug. As most tasks mount a shared workspace, many pods are stuck initialising because they can't all mount at the same time ...
To have a successful pipeline run, I first need to manually cleanup all related pods, which looks like this:
→ oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
pod "build-and-deploy-go-run-pjjts-fetch-repository-n2g72-pod-hmxfr" deleted
pod "build-and-deploy-go-run-pjjts-fetch-repository-n2g72-pod-s4vqc" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--5976r" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--j57q5" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--xdbvj" deleted
It can also be different numbers of simultaneous pods:
→ oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-pcw5x" deleted
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-qqgft" deleted
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-wc99l" deleted
pod "build-and-deploy-go-run-l4f7c-set-bitbucket-build-status--bx7tx" deleted
pod "build-and-deploy-go-run-l4f7c-set-bitbucket-build-status--fztfl" deleted
I'm moving the priority label to #3126 since it was created first and will hopefully be a little easier to condense into a reproducible case, at least initially.
oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
It's interesting that in both of the cases described here it appears to be open shift involved. @michaelsauter I'm not super familiar with open shift but does this mean you're using cri-o as the container runtime too?
This might not be related at all, especially since https://github.com/tektoncd/pipeline/issues/3126 doesn't appear to be related to open shift, but flagging here in case it becomes relevant in future.
Yes this is on OpenShift. Unfortunately I'm super new to this cluster and not an cluster admin for it so I don't know the exact details. It's a 4.5 cluster, which, according to https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html is using cri-o. Kubernetes is v1.18.3+2fbd7c7. I can't even figure out what the Tekton version is ... I'll follow up with this, assuming cluster admins can easily see this from the operator management section.
i'll close this as a dupe of https://github.com/tektoncd/pipeline/issues/3126 now that i understand what I am seeing is simply a symptom of that issue.
FWIW, I figured out we're on OpenShift Pipelines Technology Preview 1.1, which means we're on Tekton Pipelines 0.14.3.