Specifically the gke-gpu job. Something is screwy with the cache even though the extremely similar gce-gpu job is acting fine.
see: https://github.com/kubernetes/kubernetes/pull/46662#issuecomment-331321429
discussion here: https://kubernetes.slack.com/archives/C5NV3MT97/p1506038144000171
I've been debugging this myself and I've already asked @ixdy and @krzyzacy to take a look and we've noticed some oddities (namely the workspace seems to also be dirty on the bad job) but this still needs solving.
Thankfully so far the affected job is an infrequent manually triggered / non blocking job, but I suspect there's something not quite right with how we're doing the cache in general.
/assign
Hmm this should really be something like area/bazel. Maybe we need a label @ixdy :-)
Ref: https://github.com/kubernetes/test-infra/pull/4689#issuecomment-331344498
The same problem (?) spotted on @krzyzacy's kops port-to-prow :confused:
Example of dirty tree for the kops job on gubernator.
So the dirty workspace issue at least seems to consistently happen with pull-kubernetes-e2e-kops-aws-prow and pull-kubernetes-e2e-gke-gpu currently but not pull-kubernetes-e2e-gce-gpu (all of which should be doing scenarios/kubernetes_e2e with --build=bazel on Prow with very similar job configs...)
pull-kubernetes-e2e-kops-aws-prow prow is a different issue.
We're downloading the kops binary into the kubernetes tree before running the build:
W0922 01:01:47.496] + curl -fsS --retry 3 -o /go/src/k8s.io/kubernetes/kops https://storage.googleapis.com/kops-ci/bin/1.7.1-beta.3+0756ece56/linux/amd64/kops
...
W0922 01:01:47.885] + chmod +x /go/src/k8s.io/kubernetes/kops
so it's correct that the build is dirty.
I updated my test PR to list the /go/src/k8s.io/kubernetes directory.
Looking at the logs in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46662/pull-kubernetes-e2e-gke-gpu/11/, vendor/ looks fine, but staging/ is missing.
I'm beginning to think we may want to resurrect #2906 or #2910.
So the aws and gke-gpu failure are indeed related, though the exact details are a bit confusing.
On Jenkins, $WORKSPACE is a unique directory on the host for that job (but not for the build), and HOME is by default set to /var/lib/jenkins, which is not unique at all. As a result, in bootstrap we currently set HOME to $WORKSPACE so that anything mucking in $HOME goes somewhere unique-ish. We currently apply this logic regardless of where bootstrap is running.
An additional complication is that when we run the kubekins-e2e container on Jenkins, we set HOME=WORKSPACE=/workspace inside the container, but map the kubernetes repo to /go/src/k8s.io/kubernetes.
On jobs running on prow, WORKSPACE is somehow set to /go/src/k8s.io/kubernetes (not sure how - probably bootstrap, maybe unintentionally?), and then HOME also gets set to this. As a result, many things explode.
Whoops that shouldn't close this just yet :-)
FWIW, I'm seeing the bazel job failing on individual PRs pretty frequently, too (e.g. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/53003/pull-kubernetes-e2e-gce-bazel/31755/)
@abgworrall I don't think that's related. It also looks to be mostly green currently:
https://k8s-testgrid.appspot.com/kubernetes-presubmits#pull-kubernetes-e2e-gce-bazel
https://k8s-testgrid.appspot.com/kubernetes-presubmits#pull-kubernetes-e2e-gce-gpu