Test-infra: PR tests triggered by a push occasionally fail with "not something we can merge"

Created on 16 Nov 2017 · 20Comments · Source: kubernetes/test-infra

This is on kubernetes/kubernetes#55794
For example https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/55794/pull-kubernetes-bazel-build/16450/ fails with:

I1116 08:53:09.571] Call:  git checkout -B test 6e950cc629981ad34e28cdc1a32834b930d6f679
W1116 08:53:09.896] Reset branch 'test'
I1116 08:53:09.899] process 34 exited with code 0 after 0.0m
I1116 08:53:09.900] Call:  git merge --no-ff -m 'Merge +refs/pull/55794/head:refs/pr/55794' bd76307d9340ded350b3fb3fb616e1c095bba8be
W1116 08:53:09.918] merge: bd76307d9340ded350b3fb3fb616e1c095bba8be - not something we can merge
E1116 08:53:09.918] Command failed
I1116 08:53:09.919] process 55 exited with code 1 after 0.0m

When I do this on my side it works fine:

porridge@kielonek:~/projects/go/src/k8s.io/kubernetes$ git checkout -B test 6e950cc629981ad34e28cdc1a32834b930d6f679
Switched to a new branch 'test'
porridge@kielonek:~/projects/go/src/k8s.io/kubernetes$ git merge --no-ff -m 'Merge +refs/pull/55794/head:refs/pr/55794' bd76307d9340ded350b3fb3fb616e1c095bba8be
Merge made by the 'recursive' strategy.
 cluster/gce/gci/master-helper.sh | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)
porridge@kielonek:~/projects/go/src/k8s.io/kubernetes$

I suspect this earlier step is not fetching the correct commit?

I1116 08:53:04.779] Call:  git fetch --quiet --tags https://github.com/kubernetes/kubernetes master +refs/pull/55794/head:refs/pr/55794
I1116 08:53:09.570] process 25 exited with code 0 after 0.1m

arebootstrap lifecyclrotten

Source

porridge

All 20 comments

FWIW, a /retest later got over this.

porridge on 16 Nov 2017

Yeah this looks to be a GitHub issue on serving us stale git data -- unless this is pervasive I'm not sure there can be any action taken.

/cc @BenTheElder
/area bootstrap

stevekuznetsov on 16 Nov 2017

Retries? I think there's a high probability that retrying for a minute or
so will just work.

porridge on 16 Nov 2017

This is by design.

fejta on 16 Nov 2017

/reopen
Sorry, what's by design? The fact that a test triggered by a pushed event is expected to fail, and you are expected to go and manually say /retest to re-trigger it?

porridge on 17 Nov 2017

I suspect this earlier step is not fetching the correct commit?

I1116 08:53:04.779] Call:  git fetch --quiet --tags https://github.com/kubernetes/kubernetes master +refs/pull/55794/head:refs/pr/55794
I1116 08:53:09.570] process 25 exited with code 0 after 0.1m

That's fetching the repo master branch and your PR's branch, these are definitely correct...

Sorry, what's by design? The fact that a test triggered by a pushed event is expected to fail, and you are expected to go and manually say /retest to re-trigger it?

It's by design that bootstrap attempts to merge against master.
There's no reason for every test run to hope that if it continues to try to fetch & merge master this will work... We get a large number of tests with genuinely un-merge-able code, I don't think we want to have all our builders futilely trying to merge PRs that have merge conflicts.

If you force push an update to your PR there's a chance there will be stale data served by GitHub which will cause the merge to fail.

BenTheElder on 17 Nov 2017

The fact that a test triggered by a pushed event is expected to fail

No, to be clear the behavior you saw was most likely GitHub API and GitHub git servers being inconsistent. The design is - the infrastructure determines the exact revision that is to be tested and failure to locate it is a failure to test it.

stevekuznetsov on 17 Nov 2017

👍1

There's not much we should do about GitHub API flakes, this is GitHub informing our services of an update to your PR before GitHub is prepared to serve the updated data. This only sometimes occurs.

BenTheElder on 17 Nov 2017

Also, the far more common case is that tests were triggered but a force push has removed the commits. It's impossible to tell which situation the clone fails from and retries in the common case are not going to be useful.

stevekuznetsov on 17 Nov 2017

👍1

I may be missing some details, but it seems to me that the cost of running
"git fetch" a few times in a loop, breaking it early if "git cat-file -t
$sha" exits successfully is not a big cost.
I suppose GitHub already knows about the occasional inconsistency and can
provide an upper bound on how long it makes sense to wait for?
Surely, we don't want developers to do the repetitive job of clicking
"retry" that machines are better at?

porridge on 20 Nov 2017

Having administered a Prow cluster supporting about 100 developers on many projects for close to six months now I have never seen a developer hit the issue you have seen in this specific report. We _could_ add the logic, it just seems like overkill. Every other part of the system falls over if GitHub misbehaves, so it seems fine if this part does, too.

stevekuznetsov on 20 Nov 2017

I'm somewhat surprised that building a robust system does not seem to be a
priority.

porridge on 20 Nov 2017

If you'd like to provide a PR I am sure we will review it. That being said, one of the critical assumptions for the majority of the Prow system is that GitHub is responsive and correct. I'm not sure it's always worth the effort to engineer the system to be resilient in the face of that sort of failure. That helps to keep our scope small and focus on providing more features.

stevekuznetsov on 20 Nov 2017

Let's wait for more evidence of this happening before investing effort. It
could be that I was just extremely unlucky, but from my experience the more
likely explanation is that these kind of inconsistencies come and go in
waves.
Leaving this issue open for others to find easily.

porridge on 20 Nov 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Feb 2018

This does happen from time to time, but I think we're replacing the git checkout logic entirely, we should consider having some minimal amount of retry in this @stevekuznetsov. Letting the job fail does let it be rescheduled however, and when I see this it's generally after a force push leaving us with incorrect refs. In this case letting the job be rescheduled works correctly, so we probably don't want jobs to spend very long retrying checkout.

BenTheElder on 20 Feb 2018

Retry would only help in the case that the GitHub API gave us a PR HEAD that a git fetch could not resolve from their git servers, right? And that retry would just extend the time to failure for every job that triggers this condition by being started in the middle of a number of rebases. I feel like the case where someone does a rebase quickly in succession is much more common than the case where GitHub flakes out and gives us bad git data, but I dont have numbers on that. What retry strategy do you think makes the most sense, @BenTheElder ?

stevekuznetsov on 21 Feb 2018

It might be reasonable to immediately retry once or twice or after a rather short delay? I agree that my immediate thinking is that jobs should generally fail and be re-run since usually we seem to just get bad refs / genuinely un-mergable code (not rebased yet), but we should perhaps take a look at options now that the old checkout scripts are hopefully on the way out.

BenTheElder on 21 Feb 2018

👍1

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 23 Mar 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 22 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

prow/blunderbuss: possibly use UserStatus?

cblecker · 4Comments

hook: response to /test NAME is that NAME is skipped

stevekuznetsov · 3Comments

Prow issue: People in OWNERS files could not add LGTM

Aisuko · 3Comments

gke-ubuntustable1-k8sstable2-ingress misbehaving?

MrHohn · 4Comments

Deploy gubernator without webhook_secret and secrets.json

fejta · 4Comments