Test-infra: Excessive failed batches

Created on 15 Nov 2017 · 15Comments · Source: kubernetes/test-infra

Deck is showing lots of recent consecutive batch job failures. It looks like the last batch and the current batch are healthier now, but deck currently shows 15 (nearly) consecutive batch failures.
/cc @ixdy
http://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=batch

This error text is common between the failures that I checked:

cmd/kube-proxy/app/server.go:402:19: cannot use config.Burst (type int32) as type int in assignment

arejobs kinbug lifecyclstale

Source

cjwagner

All 15 comments

https://github.com/kubernetes/kubernetes/pull/53787/files changes config.Burst

BenTheElder on 15 Nov 2017

it also appears to be in all the failed batches

BenTheElder on 15 Nov 2017

W1115 03:43:36.244] # k8s.io/kubernetes/cmd/kube-proxy/app
W1115 03:43:36.245] cmd/kube-proxy/app/server.go:402:19: cannot use config.Burst (type int32) as type int in assignment
W1115 03:44:03.848] !!! [1115 03:44:03] Call tree:
W1115 03:44:03.850] !!! [1115 03:44:03]  1: /go/src/k8s.io/kubernetes/hack/lib/golang.sh:707 kube::golang::build_binaries_for_platform(...)
W1115 03:44:03.852] !!! [1115 03:44:03]  2: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
W1115 03:44:03.856] !!! [1115 03:44:03] Call tree:
W1115 03:44:03.858] !!! [1115 03:44:03]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
W1115 03:44:03.862] !!! [1115 03:44:03] Call tree:
W1115 03:44:03.863] !!! [1115 03:44:03]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
W1115 03:44:03.864] make[1]: *** [all] Error 1
I1115 03:44:03.964] Makefile:92: recipe for target 'all' failed
I1115 03:44:03.965] FAILED   hack/make-rules/../../hack/verify-generated-docs.sh    128s

That change is definitely in #53787. Not sure why it passed in single PR mode though.

BenTheElder on 15 Nov 2017

That's because https://github.com/kubernetes/kubernetes/pull/53850 changed the type and was merged just a day ago.

-   Burst int
 +  Burst int32

https://github.com/kubernetes/kubernetes/commit/7950609b31df2354f31199dba6706d62959b48b4#diff-96a1b32fdd44ac3b9ab11bfaa36df4cdR40

Then https://github.com/kubernetes/kubernetes/pull/53787 will not be needed any longer. I will close it.

xiangpengzhao on 15 Nov 2017

👍1

We should figure out a way to prevent this sort of batch blocking in the future. This would have been avoided by EG triggering another serial run, or by backing off from including this PR in the batch after so many failed batches including it ... 🤔
@cjwagner @kargakis @spxtr

BenTheElder on 15 Nov 2017

maybe it's two combination PRs have caused the failure? That happened before.

krzyzacy on 15 Nov 2017

@krzyzacy, yeah as @xiangpengzhao said another PR went in between the single PR run and the batch testing which changed the type, but the submit-queue was happy to keep trying to merge the now broken PR in nearly every batch all day. We ought to be able to avoid that.

Edit: in particular maybe tide can avoid this failure mode

BenTheElder on 15 Nov 2017

The two PRs didn't have non-logic conflict so needs-rebase wasn't detected. It'd be good to have a way to find which PR is the criminal of the batch testing failure and exclude it out of the batch or even the submit-queue.

xiangpengzhao on 15 Nov 2017

👍1

but the submit-queue was happy to keep trying to merge the now broken PR in nearly every batch all day. We ought to be able to avoid that.

Given that the single PR run passed I think the submit-queue was acting reasonably. It makes sense to kick other PRs out of the batch since the actually broken PR had passed on its own.

cjwagner on 15 Nov 2017

@rmmh

spxtr on 16 Nov 2017

Identifying which PR in a batch has issues is hard to do automatically.

If you identify this happening because of a single PR, you can usually run /test all on it to get it out of the queue.

rmmh on 17 Nov 2017

Or /lgtm cancel since the breaking changes will most likely not be merged at the point you start the tests, thus /test all ends up green and the PR is added back to a batch.

kargakis on 17 Nov 2017

👍1

/lgtm cancel will also take effect faster than waiting for a serial build to fail.

cjwagner on 17 Nov 2017

Just putting a /hold worked, but first humans had to notice an entire day of failed batches and find the common PR. I think we can rate-limit batch testing a PR.

BenTheElder on 17 Nov 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale