Pipeline: Optimize step signalling in entrypoint

Created on 14 Nov 2019  路  10Comments  路  Source: tektoncd/pipeline

The entrypoint signaling mechanism currently wakes up and checks for file changes written by the previous step every second. This is simple but in out experience slow. We have a synthetic test that runs a 20 step do nothing task that takes something like 60s to run. A similar raw Pod without the entrypoint runs in 10s. We should see if we can get those times down.

We might reduce sleep time to 500ms but another option is using fsnotify to make our signalling immediate. Another option described in #1569 is use a sidecar as a signaling hub.

areperformance help wanted kinfeature lifecyclrotten

All 10 comments

/kind feature
/priority important-longterm

/assign

Ok... so first off my initial numbers were totally incorrect. My imagePullPolicy was just not right so that was a good part of what I was seeing. Redoing my numbers in my cluster I see 6s for a vanilla pod case and 17s for the TaskRun case for a 20 step.

So I played with the entrypoint wait time...:
raw pod -- 6s (lower limit...)
1ms -- 12s (burning laptop... power percentage going down in real time)
50ms -- 10-11s
100ms -- 11-12s
200ms -- 11-12s
250ms -- 12-15s (sudden jump here -- not sure why -- might be specific to my test)
300ms -- 14-15s
500ms -- 14-16s
750ms -- 15-16s
1000ms -- 15-17s


The point here is not to pick a magic number like 200ms but to point out that in the process of optimizing the entrypoint the first big problem is that we're currently spending a significant chunk of time waiting that goes up more or less linearly with the number of steps. fsnotify might let us bring that overhead for the waiting bit to roughly zero so I'll try that out next.

Later I think it would be good to do a bit of analysis on the initial sync and maybe what the initcontainers are doing...

pod gist
taskrun gist

Thanks for that data Simon! This makes me think we should have a metric for "overhead" time -- time spent between step[n].finish and step[n+1].start. That would let us gather data across a bunch of runs before and after (and during) tweaks to the poll interval and while moving to something better.

This is also something an operator might want to monitor, in case they want to precache popular step images for instance.

Unfortunately today we don't have a good strong signal about when a step actually started executing due to entrypoint rewriting. Tackling that first could help here and probably other places.

I've been working with Kata containers a fair bit lately and... inotify does not work there 馃樋 I guess that means our advanced sleep technology is a really good choice for now.

(remove/re-add labels to check if project automation bot is working, plz ignore)

@skaegi feel free to bring this back to the API WG for discussion if it need priority attention

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings