Test-infra: Support run_after_failure jobs

Created on 7 May 2018 · 14Comments · Source: kubernetes/test-infra

Brought up during kubecon

@krzyzacy @cjwagner

/area prow
/kind feature

areprow help wanted kinfeature lifecyclrotten

Source

kargakis

Most helpful comment

I don't think that is the consensus. Like Ben is saying, we really need something more general than tacking on another triggering mechanism.
run_after_success is already a problem because it has to be specifically handled in a number of places and introduces dependencies between jobs. Here are some examples:

Tide doesn't always work properly with run_after_success jobs because it doesn't yet have logic to correctly trigger parent jobs of required run_after_success child jobs.
https://github.com/kubernetes/test-infra/pull/8492#issuecomment-401125711
https://github.com/kubernetes/test-infra/pull/8415#issue-196287044

I'm worried that adding a new triggering mechanism without generalizing and exposing a better interface to Prow components will result in more technical debt and bugs like these.
This could probably use a design discussion breakout session?

cjwagner on 2 Jul 2018

👍2

All 14 comments

/help

kargakis on 27 May 2018

@kargakis:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 27 May 2018

We should think through a clear boundary for the logic that we want to support here, these sorts of logic things cause the majority of correctness bugs in prow.

stevekuznetsov on 31 May 2018

simply run_after, and some condition like succeed?

krzyzacy on 31 May 2018

If we're going to do that, I think something a tad more general might be
worth thinking about. The coupling of our triggering to job definitions is
a bit awkward right now. We also have a couple of mutually exclusive
prowjob fields already..

On Thu, May 31, 2018 at 11:17 AM Sen Lu notifications@github.com wrote:

simply run_after, and some condition like succeed?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/7951#issuecomment-393625845,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq268WAG-imIo33JXZy0JfLmJSqHCks5t4DOjgaJpZM4T05Hs
.

BenTheElder on 31 May 2018

👍2

yeah, also that :-)

krzyzacy on 31 May 2018

While using prow for Prometheus, I really missed this feature.
Also discussed a bit on slack with @cjwagner & @fejta.
If there is a consensus, I would like to implement this.
It seems just need to add a similar feature like runAfterSuccess to kube.PodFailed & kube.PodPending

sipian on 2 Jul 2018

Tide doesn't always work properly with run_after_success jobs because it doesn't yet have logic to correctly trigger parent jobs of required run_after_success child jobs.
https://github.com/kubernetes/test-infra/pull/8492#issuecomment-401125711
https://github.com/kubernetes/test-infra/pull/8415#issue-196287044

cjwagner on 2 Jul 2018

👍2

Managing run_after_success jobs has indeed been very problematic. Today I was thinking of splitting creation of run_after_success jobs into its own service that has some advantages over the current state of things. Namely, prow controllers (plank, jenkins operator) are going to be simplified:

we can remove the github client entirely once reporting is its own service
we trim rbac for agent controllers down since they don't need access to create prowjobs anymore
no need to extend agent controllers for handling run_after_whatever anymore and less code to maintain

The extra service can also handle creation of run_after_success jobs for tide I think which means that tide is also going to be slightly simplified?

kargakis on 12 Jul 2018

I like this idea a lot.

I'd love to see a refactor someday (not necessarily worth the effort, but
it could be nice...) where we manage to totally decouple triggering from
job definitions so anyone can more easily integrate triggers for say, "run
if a release is tagged on github", "run a downgrade job if my cluster is
unresponsive", etc...

On Thu, Jul 12, 2018 at 9:45 AM Michalis Kargakis notifications@github.com
wrote:

Managing run_after_success jobs has indeed been very problematic. Today I
was thinking of splitting creation of run_after_success jobs into its own
service that has some advantages over the current state of things. Namely,
prow controllers (plank, jenkins operator) are going to be simplified:

we can remove the github client entirely once reporting is its own
service

we trim rbac for agent controllers down since they don't need access
to create prowjobs anymore

no need to extend agent controllers for handling run_after_whatever
anymore and less code to maintain

The extra service can also handle creation of run_after_success jobs for
tide I think which means that tide is also going to be slightly simplified?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/7951#issuecomment-404576274,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq6FNMnB6ukV8wa0e36sZB5i4tXB1ks5uF30vgaJpZM4T05Hs
.

BenTheElder on 12 Jul 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 10 Oct 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 9 Nov 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 9 Dec 2018

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close