What happened: Tide spent many hours retesting a failing batch of github.com/kubernetes/enhancements PRs 1111, 1134, 1151, 1153, 1155. One PR caused a test to fail, and no PRs merged during that time.
https://prow.k8s.io/tide-history?repo=kubernetes%2Fenhancements
https://prow.k8s.io/?repo=kubernetes%2Fenhancements&type=batch
What you expected to happen: Tide should merge one of the passing PRs instead of retesting the same exact batch > 270 times without merging anything or trying a different batch. (And many _more_ times before that when fewer PRs were ready to merge).
How to reproduce it (as minimally and precisely as possible): ¯\_(ツ)_/¯
Please provide links to example occurrences, if any: In description above.
Anything else we need to know?:
/area prow
/area prow/tide
After kicking out the PR causing the batch failure with an /lgtm cancel the other PRs merged, so this particular instance is mitigated for the moment.
Thanks for the issue Ben.
Tide triggers a serial test whenever it sees a batch running and no up-to-date pending or passing serial test. Based on the timestamps it looks like Tide was continually triggering a new batch because the existing one was failing before the next Tide sync and we prioritize triggering batches over trigger serial tests so we never got the chance to trigger serial tests.
I think this can be addressed by prioritizing triggering serial tests or by making Tide trigger both batch tests and serial tests in the same sync loop when appropriate.
/assign @stevekuznetsov
WDYT?
We should probably trigger both.
/sig testing
IMHO we should just stop silently swallowing batch test errors and re-testing forever and instead report them to the PRs, so tide doen't consider the PRs again until someone issues a /retest to make their contexts green again: https://github.com/kubernetes/test-infra/issues/12216#issuecomment-529192971
Isn't that against the premise of tide though? A flake in a batch shouldn't require human intervention IMO. Even the /retest on LGTM + approve + flake is onerous and has been ~automated
The /retest on LGTM+approve+flake _is_ automated though? So if it did report the failure we'd auto retest it, serially.
The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially.
But still introduce some jitter as the /retest command gets posted after some delay
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
this came up again today, we're seeing an ever-expanding batch in k/k
/lifecycle frozen
~One failure mode where this happens is if a batch of a given set of prs failed but in the meantime, a new PR became eligible for retesting and merging. We will then start the batch with the new set rather than merging the one pr. This is something I guess we could fix.~
This statement was wrong. We prioritize merging a single PR over creating a batch test. But when we trigger a re-test, we trigger either a batch or a single PR but not both. It would probably be good to do both to at least make some progress in case of failing batches.
Completely regardless of that I think we must introduce some jitter. If a batch fails, kick out any pr of it. Ppl can still retest if they think it was not because of the pr and the retesting can also be triggered automatically via commenter.
The original intent here is that:
IMO any other behavior than this is a bug -- such as there being multiple approved PRs and no batch and/or not scheduling a serial run.
It seems openshift disabled batching to work around https://github.com/openshift/release/pull/9786 (x-ref-ed this issue above)
@BenTheElder very briefly, it was subsequently re-activated in https://github.com/openshift/release/pull/9831
Apparently present again in https://github.com/kubernetes/node-problem-detector/issues/495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.
Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.
The problem there is that the batch tests fail very quickly. We always start either a batch (if available and none running yet) or a serial retest. Since the batch always fails until Tides next sync we never get to the point of starting a serial retest there. You can see that nicely on https://prow.k8s.io/tide-history?repo=kubernetes%2Fnode-problem-detector where a batch test is started every two minutes which matches prow.k8s.io ttide sync period
This doesn't seem like a super unlikely failure mode though, we've seen stuff like this before. I think tide should run a serial test in the background to prevent infinitely spamming broken batches with no progress.
Even if it wasn't done concurrently ordinarily, since we do record history, we could detect repeated batches and opt to start a serial job instead.