Deck is showing lots of recent consecutive batch job failures. It looks like the last batch and the current batch are healthier now, but deck currently shows 15 (nearly) consecutive batch failures.
/cc @ixdy
http://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=batch
This error text is common between the failures that I checked:
cmd/kube-proxy/app/server.go:402:19: cannot use config.Burst (type int32) as type int in assignment
https://github.com/kubernetes/kubernetes/pull/53787/files changes config.Burst
it also appears to be in all the failed batches
W1115 03:43:36.244] # k8s.io/kubernetes/cmd/kube-proxy/app
W1115 03:43:36.245] cmd/kube-proxy/app/server.go:402:19: cannot use config.Burst (type int32) as type int in assignment
W1115 03:44:03.848] !!! [1115 03:44:03] Call tree:
W1115 03:44:03.850] !!! [1115 03:44:03] 1: /go/src/k8s.io/kubernetes/hack/lib/golang.sh:707 kube::golang::build_binaries_for_platform(...)
W1115 03:44:03.852] !!! [1115 03:44:03] 2: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
W1115 03:44:03.856] !!! [1115 03:44:03] Call tree:
W1115 03:44:03.858] !!! [1115 03:44:03] 1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
W1115 03:44:03.862] !!! [1115 03:44:03] Call tree:
W1115 03:44:03.863] !!! [1115 03:44:03] 1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
W1115 03:44:03.864] make[1]: *** [all] Error 1
I1115 03:44:03.964] Makefile:92: recipe for target 'all' failed
I1115 03:44:03.965] FAILED hack/make-rules/../../hack/verify-generated-docs.sh 128s
That change is definitely in #53787. Not sure why it passed in single PR mode though.
That's because https://github.com/kubernetes/kubernetes/pull/53850 changed the type and was merged just a day ago.
- Burst int
+ Burst int32
https://github.com/kubernetes/kubernetes/commit/7950609b31df2354f31199dba6706d62959b48b4#diff-96a1b32fdd44ac3b9ab11bfaa36df4cdR40
Then https://github.com/kubernetes/kubernetes/pull/53787 will not be needed any longer. I will close it.
We should figure out a way to prevent this sort of batch blocking in the future. This would have been avoided by EG triggering another serial run, or by backing off from including this PR in the batch after so many failed batches including it ... 馃
@cjwagner @kargakis @spxtr
maybe it's two combination PRs have caused the failure? That happened before.
@krzyzacy, yeah as @xiangpengzhao said another PR went in between the single PR run and the batch testing which changed the type, but the submit-queue was happy to keep trying to merge the now broken PR in nearly every batch all day. We ought to be able to avoid that.
Edit: in particular maybe tide can avoid this failure mode
The two PRs didn't have non-logic conflict so needs-rebase wasn't detected. It'd be good to have a way to find which PR is the criminal of the batch testing failure and exclude it out of the batch or even the submit-queue.
but the submit-queue was happy to keep trying to merge the now broken PR in nearly every batch all day. We ought to be able to avoid that.
Given that the single PR run passed I think the submit-queue was acting reasonably. It makes sense to kick other PRs out of the batch since the actually broken PR had passed on its own.
@rmmh
Identifying which PR in a batch has issues is hard to do automatically.
If you identify this happening because of a single PR, you can usually run /test all on it to get it out of the queue.
Or /lgtm cancel since the breaking changes will most likely not be merged at the point you start the tests, thus /test all ends up green and the PR is added back to a batch.
/lgtm cancel will also take effect faster than waiting for a serial build to fail.
Just putting a /hold worked, but first humans had to notice an entire day of failed batches and find the common PR. I think we can rate-limit batch testing a PR.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale