Test-infra: e2e jobs on Prow with Bazel are acting up

Created on 22 Sep 2017 · 10Comments · Source: kubernetes/test-infra

Specifically the gke-gpu job. Something is screwy with the cache even though the extremely similar gce-gpu job is acting fine.

see: https://github.com/kubernetes/kubernetes/pull/46662#issuecomment-331321429
discussion here: https://kubernetes.slack.com/archives/C5NV3MT97/p1506038144000171

I've been debugging this myself and I've already asked @ixdy and @krzyzacy to take a look and we've noticed some oddities (namely the workspace seems to also be dirty on the bad job) but this still needs solving.

Thankfully so far the affected job is an infrequent manually triggered / non blocking job, but I suspect there's something not quite right with how we're doing the cache in general.

/assign

areprow kinbug

Source

BenTheElder

All 10 comments

Hmm this should really be something like area/bazel. Maybe we need a label @ixdy :-)

BenTheElder on 22 Sep 2017

Ref: https://github.com/kubernetes/test-infra/pull/4689#issuecomment-331344498
The same problem (?) spotted on @krzyzacy's kops port-to-prow :confused:
Example of dirty tree for the kops job on gubernator.

BenTheElder on 22 Sep 2017

So the dirty workspace issue at least seems to consistently happen with pull-kubernetes-e2e-kops-aws-prow and pull-kubernetes-e2e-gke-gpu currently but not pull-kubernetes-e2e-gce-gpu (all of which should be doing scenarios/kubernetes_e2e with --build=bazel on Prow with very similar job configs...)

BenTheElder on 22 Sep 2017

pull-kubernetes-e2e-kops-aws-prow prow is a different issue.

We're downloading the kops binary into the kubernetes tree before running the build:

W0922 01:01:47.496] + curl -fsS --retry 3 -o /go/src/k8s.io/kubernetes/kops https://storage.googleapis.com/kops-ci/bin/1.7.1-beta.3+0756ece56/linux/amd64/kops
...
W0922 01:01:47.885] + chmod +x /go/src/k8s.io/kubernetes/kops

so it's correct that the build is dirty.

ixdy on 22 Sep 2017

I updated my test PR to list the /go/src/k8s.io/kubernetes directory.

Looking at the logs in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46662/pull-kubernetes-e2e-gke-gpu/11/, vendor/ looks fine, but staging/ is missing.

I'm beginning to think we may want to resurrect #2906 or #2910.

ixdy on 22 Sep 2017

👍1

So the aws and gke-gpu failure are indeed related, though the exact details are a bit confusing.

On Jenkins, $WORKSPACE is a unique directory on the host for that job (but not for the build), and HOME is by default set to /var/lib/jenkins, which is not unique at all. As a result, in bootstrap we currently set HOME to $WORKSPACE so that anything mucking in $HOME goes somewhere unique-ish. We currently apply this logic regardless of where bootstrap is running.

An additional complication is that when we run the kubekins-e2e container on Jenkins, we set HOME=WORKSPACE=/workspace inside the container, but map the kubernetes repo to /go/src/k8s.io/kubernetes.

On jobs running on prow, WORKSPACE is somehow set to /go/src/k8s.io/kubernetes (not sure how - probably bootstrap, maybe unintentionally?), and then HOME also gets set to this. As a result, many things explode.

ixdy on 22 Sep 2017

👍1

luxas on 25 Sep 2017

Whoops that shouldn't close this just yet :-)

BenTheElder on 25 Sep 2017

FWIW, I'm seeing the bazel job failing on individual PRs pretty frequently, too (e.g. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/53003/pull-kubernetes-e2e-gce-bazel/31755/)