This is on kubernetes/kubernetes#55794
For example https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/55794/pull-kubernetes-bazel-build/16450/ fails with:
I1116 08:53:09.571] Call: git checkout -B test 6e950cc629981ad34e28cdc1a32834b930d6f679
W1116 08:53:09.896] Reset branch 'test'
I1116 08:53:09.899] process 34 exited with code 0 after 0.0m
I1116 08:53:09.900] Call: git merge --no-ff -m 'Merge +refs/pull/55794/head:refs/pr/55794' bd76307d9340ded350b3fb3fb616e1c095bba8be
W1116 08:53:09.918] merge: bd76307d9340ded350b3fb3fb616e1c095bba8be - not something we can merge
E1116 08:53:09.918] Command failed
I1116 08:53:09.919] process 55 exited with code 1 after 0.0m
When I do this on my side it works fine:
porridge@kielonek:~/projects/go/src/k8s.io/kubernetes$ git checkout -B test 6e950cc629981ad34e28cdc1a32834b930d6f679
Switched to a new branch 'test'
porridge@kielonek:~/projects/go/src/k8s.io/kubernetes$ git merge --no-ff -m 'Merge +refs/pull/55794/head:refs/pr/55794' bd76307d9340ded350b3fb3fb616e1c095bba8be
Merge made by the 'recursive' strategy.
cluster/gce/gci/master-helper.sh | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
porridge@kielonek:~/projects/go/src/k8s.io/kubernetes$
I suspect this earlier step is not fetching the correct commit?
I1116 08:53:04.779] Call: git fetch --quiet --tags https://github.com/kubernetes/kubernetes master +refs/pull/55794/head:refs/pr/55794
I1116 08:53:09.570] process 25 exited with code 0 after 0.1m
FWIW, a /retest later got over this.
Yeah this looks to be a GitHub issue on serving us stale git data -- unless this is pervasive I'm not sure there can be any action taken.
/cc @BenTheElder
/area bootstrap
Retries? I think there's a high probability that retrying for a minute or
so will just work.
This is by design.
/reopen
Sorry, what's by design? The fact that a test triggered by a pushed event is expected to fail, and you are expected to go and manually say /retest to re-trigger it?
I suspect this earlier step is not fetching the correct commit?
I1116 08:53:04.779] Call: git fetch --quiet --tags https://github.com/kubernetes/kubernetes master +refs/pull/55794/head:refs/pr/55794
I1116 08:53:09.570] process 25 exited with code 0 after 0.1m
That's fetching the repo master branch and your PR's branch, these are definitely correct...
Sorry, what's by design? The fact that a test triggered by a pushed event is expected to fail, and you are expected to go and manually say /retest to re-trigger it?
It's by design that bootstrap attempts to merge against master.
There's no reason for every test run to hope that if it continues to try to fetch & merge master this will work... We get a large number of tests with genuinely un-merge-able code, I don't think we want to have all our builders futilely trying to merge PRs that have merge conflicts.
If you force push an update to your PR there's a chance there will be stale data served by GitHub which will cause the merge to fail.
The fact that a test triggered by a pushed event is expected to fail
No, to be clear the behavior you saw was most likely GitHub API and GitHub git servers being inconsistent. The design is - the infrastructure determines the exact revision that is to be tested and failure to locate it is a failure to test it.
There's not much we should do about GitHub API flakes, this is GitHub informing our services of an update to your PR before GitHub is prepared to serve the updated data. This only sometimes occurs.
Also, the far more common case is that tests were triggered but a force push has removed the commits. It's impossible to tell which situation the clone fails from and retries in the common case are not going to be useful.
I may be missing some details, but it seems to me that the cost of running
"git fetch" a few times in a loop, breaking it early if "git cat-file -t
$sha" exits successfully is not a big cost.
I suppose GitHub already knows about the occasional inconsistency and can
provide an upper bound on how long it makes sense to wait for?
Surely, we don't want developers to do the repetitive job of clicking
"retry" that machines are better at?
Having administered a Prow cluster supporting about 100 developers on many projects for close to six months now I have never seen a developer hit the issue you have seen in this specific report. We _could_ add the logic, it just seems like overkill. Every other part of the system falls over if GitHub misbehaves, so it seems fine if this part does, too.
I'm somewhat surprised that building a robust system does not seem to be a
priority.
If you'd like to provide a PR I am sure we will review it. That being said, one of the critical assumptions for the majority of the Prow system is that GitHub is responsive and correct. I'm not sure it's always worth the effort to engineer the system to be resilient in the face of that sort of failure. That helps to keep our scope small and focus on providing more features.
Let's wait for more evidence of this happening before investing effort. It
could be that I was just extremely unlucky, but from my experience the more
likely explanation is that these kind of inconsistencies come and go in
waves.
Leaving this issue open for others to find easily.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
This does happen from time to time, but I think we're replacing the git checkout logic entirely, we should consider having some minimal amount of retry in this @stevekuznetsov. Letting the job fail does let it be rescheduled however, and when I see this it's generally after a force push leaving us with incorrect refs. In this case letting the job be rescheduled works correctly, so we probably don't want jobs to spend very long retrying checkout.
Retry would only help in the case that the GitHub API gave us a PR HEAD that a git fetch could not resolve from their git servers, right? And that retry would just extend the time to failure for every job that triggers this condition by being started in the middle of a number of rebases. I feel like the case where someone does a rebase quickly in succession is much more common than the case where GitHub flakes out and gives us bad git data, but I dont have numbers on that. What retry strategy do you think makes the most sense, @BenTheElder ?
It might be reasonable to immediately retry once or twice or after a rather short delay? I agree that my immediate thinking is that jobs should generally fail and be re-run since usually we seem to just get bad refs / genuinely un-mergable code (not rebased yet), but we should perhaps take a look at options now that the old checkout scripts are hopefully on the way out.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close