Test-infra: tide: serial merges should occur when batches fail

Created on 23 Jul 2019 · 20Comments · Source: kubernetes/test-infra

What happened: Tide spent many hours retesting a failing batch of github.com/kubernetes/enhancements PRs 1111, 1134, 1151, 1153, 1155. One PR caused a test to fail, and no PRs merged during that time.

https://prow.k8s.io/tide-history?repo=kubernetes%2Fenhancements
https://prow.k8s.io/?repo=kubernetes%2Fenhancements&type=batch

What you expected to happen: Tide should merge one of the passing PRs instead of retesting the same exact batch > 270 times without merging anything or trying a different batch. (And many _more_ times before that when fewer PRs were ready to merge).

How to reproduce it (as minimally and precisely as possible): ¯\_(ツ)_/¯

Please provide links to example occurrences, if any: In description above.

Anything else we need to know?:
/area prow
/area prow/tide

areprow areprotide kinbug lifecyclfrozen sitesting

Source

BenTheElder

All 20 comments

After kicking out the PR causing the batch failure with an /lgtm cancel the other PRs merged, so this particular instance is mitigated for the moment.

BenTheElder on 23 Jul 2019

Thanks for the issue Ben.
Tide triggers a serial test whenever it sees a batch running and no up-to-date pending or passing serial test. Based on the timestamps it looks like Tide was continually triggering a new batch because the existing one was failing before the next Tide sync and we prioritize triggering batches over trigger serial tests so we never got the chance to trigger serial tests.

I think this can be addressed by prioritizing triggering serial tests or by making Tide trigger both batch tests and serial tests in the same sync loop when appropriate.
/assign @stevekuznetsov
WDYT?

cjwagner on 23 Jul 2019

We should probably trigger both.

stevekuznetsov on 23 Jul 2019

👍1

/sig testing

spiffxp on 26 Jul 2019

IMHO we should just stop silently swallowing batch test errors and re-testing forever and instead report them to the PRs, so tide doen't consider the PRs again until someone issues a /retest to make their contexts green again: https://github.com/kubernetes/test-infra/issues/12216#issuecomment-529192971

alvaroaleman on 8 Sep 2019

Isn't that against the premise of tide though? A flake in a batch shouldn't require human intervention IMO. Even the /retest on LGTM + approve + flake is onerous and has been ~automated

stevekuznetsov on 9 Sep 2019

👍1

The /retest on LGTM+approve+flake _is_ automated though? So if it did report the failure we'd auto retest it, serially.

BenTheElder on 11 Sep 2019

The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially.

But still introduce some jitter as the /retest command gets posted after some delay

alvaroaleman on 11 Sep 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 10 Dec 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 9 Jan 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 8 Feb 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 8 Feb 2020

this came up again today, we're seeing an ever-expanding batch in k/k
/lifecycle frozen

BenTheElder on 17 Jun 2020

~One failure mode where this happens is if a batch of a given set of prs failed but in the meantime, a new PR became eligible for retesting and merging. We will then start the batch with the new set rather than merging the one pr. This is something I guess we could fix.~
This statement was wrong. We prioritize merging a single PR over creating a batch test. But when we trigger a re-test, we trigger either a batch or a single PR but not both. It would probably be good to do both to at least make some progress in case of failing batches.

Completely regardless of that I think we must introduce some jitter. If a batch fails, kick out any pr of it. Ppl can still retest if they think it was not because of the pr and the retesting can also be triggered automatically via commenter.

alvaroaleman on 18 Jun 2020

The original intent here is that:

we always have both a batch and single retest running at the same time
we wait until the batch completes and merge it if it passes
otherwise we look at the single and merge that if it passes
repeat ad nauseam

IMO any other behavior than this is a bug -- such as there being multiple approved PRs and no batch and/or not scheduling a serial run.

fejta on 20 Jun 2020

It seems openshift disabled batching to work around https://github.com/openshift/release/pull/9786 (x-ref-ed this issue above)

BenTheElder on 10 Jul 2020

@BenTheElder very briefly, it was subsequently re-activated in https://github.com/openshift/release/pull/9831

alvaroaleman on 10 Jul 2020

👍1

Apparently present again in https://github.com/kubernetes/node-problem-detector/issues/495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.

BenTheElder on 16 Nov 2020

Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.

The problem there is that the batch tests fail very quickly. We always start either a batch (if available and none running yet) or a serial retest. Since the batch always fails until Tides next sync we never get to the point of starting a serial retest there. You can see that nicely on https://prow.k8s.io/tide-history?repo=kubernetes%2Fnode-problem-detector where a batch test is started every two minutes which matches prow.k8s.io ttide sync period

alvaroaleman on 16 Nov 2020

👍1

This doesn't seem like a super unlikely failure mode though, we've seen stuff like this before. I think tide should run a serial test in the background to prevent infinitely spamming broken batches with no progress.

Even if it wasn't done concurrently ordinarily, since we do record history, we could detect repeated batches and opt to start a serial job instead.

BenTheElder on 16 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

hook: response to /test NAME is that NAME is skipped

stevekuznetsov · 3Comments

Pod (Utilities) racing

cjwagner · 3Comments

gke-ubuntustable1-k8sstable2-ingress misbehaving?

MrHohn · 4Comments

Prow issue: People in OWNERS files could not add LGTM

Aisuko · 3Comments

trusted_org not working in trigger

BenTheElder · 4Comments