Pipeline: yaml tests seem to be consistently timing out

Created on 4 May 2020 · 12Comments · Source: tektoncd/pipeline

Expected Behavior

"yaml tests" should only fail if something is actually wrong

Actual Behavior

All of the runs for #2531 have failed:

https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/pull/tektoncd_pipeline/2531/pull-tekton-pipeline-integration-tests/

And recent runs across PRs it seems most are failing too
https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/directory/pull-tekton-pipeline-integration-tests

Steps to Reproduce the Problem

Not sure what's going on yet

Additional Info

I can't decipher 1..60 sleep 10 for the life of me:

https://github.com/tektoncd/pipeline/blob/a4065de3e0b434c10bfa1fc2edd64eda5d13387a/test/e2e-common.sh#L65-L77

aretesting kinbug

Source

bobcatfish

👍2

Most helpful comment

@sbwsg thanks. It was exactly that task I was worried about. But that example does not provide much value, and it need to be adapted to any environment. So I think it is best to remove it.

But a similar problem may occur for other pipelines that use the same PVC in more than one task. We could move those to the no-ci folder as @vdemeester suggested.

I apologize for the flaky tests the last few days.

jlpettersson on 5 May 2020

❤2

All 12 comments

I added example pipelinerun-with-parallel-tasks-using-pvc.yaml in #2521, few days ago.

Things looks worse after that. But I don't really understand what is causing that. Maybe volumes takes time and there are some time outs? I find it hard to see what test is causing trouble.

jlpettersson on 4 May 2020

👍1

I added example pipelinerun-with-parallel-tasks-using-pvc.yaml in #2521, few days ago.

Things looks worse after that. But I don't really understand what is causing that. Maybe volumes takes time and there are some time outs? I find it hard to see what test is causing trouble.

Yeah, that's my guess :sweat:

I can't decipher 1..60 sleep 10 for the life of me:

it's gonna do 60 loops of 10s to check the status of the pipelinerun (or taskrun), meaning it times out after 10min.

vdemeester on 5 May 2020

/kind bug
/area testing

vdemeester on 5 May 2020

There is few ways to fix:

the quick one, add more time (aka do 90 loops)
the longer migrate to go to run those tests

2541 does the later.

vdemeester on 5 May 2020

I've bump the timeout in #2534 (90 loops instead of 60). It should fix the CI while #2541 gets worked on.

vdemeester on 5 May 2020

@vdemeester did it work better?

If it is a _regional cluster_ and the PVCs are _zonal_ the two parallel tasks may be executing in different zones and the third task that mount both PVCs is deadlocked since it can't mount two _zonal_ PVC in a pod. I propose that I remove the example, since it depends so much on what kind of storage and cluster that is used. The intentation was to document PVC _access modes_ but it is not strictly necessary to have an example.

jlpettersson on 5 May 2020

@vdemeester did it work better?

Not entirely sure. There is less failures but I see some still.

If it is a _regional cluster_ and the PVCs are _zonal_ the two parallel tasks may be executing in different zones and the third task that mount both PVCs is deadlocked since it can't mount two _zonal_ PVC in a pod. I propose that I remove the example, since it depends so much on what kind of storage and cluster that is used. The intentation was to document PVC _access modes_ but it is not strictly necessary to have an example.

Yeah, having it in a no-ci folder would work

vdemeester on 5 May 2020

It does appear this might have been related. Just spotted this in one of our release clusters:

And drilling down it does appear to be related to volume / node affinity.

sbwsg on 5 May 2020

👍1

@sbwsg thanks. It was exactly that task I was worried about. But that example does not provide much value, and it need to be adapted to any environment. So I think it is best to remove it.

But a similar problem may occur for other pipelines that use the same PVC in more than one task. We could move those to the no-ci folder as @vdemeester suggested.

I apologize for the flaky tests the last few days.

jlpettersson on 5 May 2020

❤2

But a similar problem may occur for other pipelines that use the same PVC in more than one.

Yeah this might be a good area we can add docs around at some point. I wonder how much of it is platform specific and how much Tekton can describe in a cross-platform way.

I apologize for the flaky tests the last few days.

No worries, thanks for making the PR to resolve, and all the contributions around Workspaces! We were bound to hit this issue eventually.

sbwsg on 5 May 2020

I am curious if we can use some kind of pod affinity to get tasks co-located on the same node.

Possibly co-locate all pods belonging to a single PipelineRun so they perfectly fine can use the same PVC as a workspace and perfectly fine can execute parallel. (this is essentially what any single-node CI/CD system does).

We would still be a distributed system where different PipelineRuns possibly scheduled to different nodes. Using different PVCs is "easier" for fan-out, but not for fan-in (e.g. git-clone and then parallel tasks using the same files)

jlpettersson on 5 May 2020

I don't think we've seen any evidence of this since @jlpettersson 's fixes, closing!

bobcatfish on 16 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

If the WhenExpression of the task is evaluated to false immediately after the pipelinerun is started, the pipelinerun stays in "Running(Started)" state

VeereshAradhya · 3Comments

making ci images fails with

objectiveous · 3Comments

Can I use tektoncd for building data processing pipelines?

sujithjoseph · 3Comments

enable script for conditions

pritidesai · 4Comments

Two quick questions: Github Webhook tutorial & how to rerun pipelines and TaskRuns?

JCzz · 3Comments