Test-infra: Implement Bazel Remote Caching [Tracking Issue]

Created on 13 Feb 2018  ยท  13Comments  ยท  Source: kubernetes/test-infra

Bazel 0.10.0 is now in use in test-infra and kubernetes (release-1.10/master). This release contains some nice improvements to the HTTP remote caching system, we should leverage this instead of our existing "use persistent storage for the local cache".

Why?

  • using a remote cache means the cache is global so all jobs can share it
  • the remote cache is (mostly) content addressed and designed for sharing, the local cache is not so much
  • we can't have a broken node with a bad cache to hunt down if we don't put the cache on the node

Why not?

  • it's another thing to deploy and keep an eye on
  • bugs / invalid caching can break builds

Action Items

  • [x] configure bazel for remote caching and test this by building and testing test-infra, kubernetes, etc.
  • [x] automate configuration (ref: https://github.com/kubernetes/test-infra/pull/6661, https://github.com/kubernetes/test-infra/pull/6673, https://github.com/kubernetes/test-infra/pull/6682, https://github.com/kubernetes/test-infra/pull/6685)

    • NOTE: we're working around the invalid caching issue by auto configuring jobs to point to a (per-repo, per-host-tools-versions (hashed)) cache, so they simply use a different cache

  • [x] evaluate caching backends (some discussion here: https://github.com/kubernetes/test-infra/pull/6767)

    • GCS is nice and definitely something to consider, but having the cache in-cluster should be faster and we can control the cache size / costs / eviction better.

    • Hazelcast is a fancy distributed cache, but much more complex and comes with JVM baggage

    • nginx with webdav or apache httpd with webdav works, but in particular doesn't handle eviction, stats etc.

    • bazel-remote is great, but we want to host one server for multiple "caches" so we can split up repos etc. to be defensive about incorrect caching

    • Our own minimal implementation: easy enough, the protocol is essentially just HTTP PUT/GET of blobs

  • Improve our implementation

    • [x] implement an eviction strategy (https://github.com/kubernetes/test-infra/pull/6812)

    • [x] document (https://github.com/kubernetes/test-infra/pull/6879)

    • [x] rename from nursery to greenhouse to be less confusing :-) (https://github.com/kubernetes/test-infra/pull/6879)

    • [x] add metrics

  • [x] set up monitoring / dashboard
  • roll out to "real" jobs / integrate with our images [WIP (test-infra)]

    • [x] integrate into our images (https://github.com/kubernetes/test-infra/pull/6859)

    • [x] enable for test-infra (https://github.com/kubernetes/test-infra/pull/6879)

    • [x] enable for everyone else who wants it (talked to cluster-registry, they're now using it) [WIP]

    • [x] enable for kubernetes (rolled out to ci bazel build / test, presubmit test) [WIP]

/area bazel
/area jobs
/assign

arebazel arejobs kinfeature kinvelocity-improvement

Most helpful comment

I've now turned this on for the kubernetes CI bazel-build and bazel-test jobs with great results*
screen shot 2018-02-24 at 11 30 28 am
screen shot 2018-02-24 at 11 42 01 am
* Note: the build job only runs in post-submit, and once every 6 hours, currently. Once this job is properly continuous the results for it will be more obvious.

We also have a monitoring dashboard now:
screen shot 2018-02-24 at 11 31 27 am

All 13 comments

Some more notes:

  • The invalid cache sharing work around seems to work well with the canary jobs at least
  • Experimental Jobs using the cache can be much faster:

    • pull-kubernetes-bazel-test takes 25-30m currently, once the cache is hot pull-kubernetes-bazel-test-canary takes ~5min typically. There is a lot of variation for both though mostly due to the load on the node the job runs on

    • pull-test-infra-bazel takes 8-10m currently, pull-test-infra-bazel-canary takes ~3 min (about two minutes of which is spent installing python deps and running pylint...)

    • pull-kubernetes-bazel-build-canary is not caching well currently, we probably need to mark things like hyperkube and the tarballs as no-cache

FYI @perotinus we can probably look at using this with the cluster-registry soon, test-infra is using it now ๐Ÿ˜„

Tested eviction a bit more with: https://github.com/BenTheElder/test-infra/blob/20d7d58ac34d59e241eddfb107e3b735398cd8d7/experiment/fill_cache.sh
Will PR some logging changes but so far WAI

Edit: see also, results of turning this on for test-infra:
image

Testing a new test-infra PR appears to have 3468 action cache hits, 7 action cache misses, and 1920 CAS hits (!)

I've now turned this on for the kubernetes CI bazel-build and bazel-test jobs with great results*
screen shot 2018-02-24 at 11 30 28 am
screen shot 2018-02-24 at 11 42 01 am
* Note: the build job only runs in post-submit, and once every 6 hours, currently. Once this job is properly continuous the results for it will be more obvious.

We also have a monitoring dashboard now:
screen shot 2018-02-24 at 11 31 27 am

https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#pull-kubernetes-bazel-test&graph-metrics=test-duration-minutes

As expected instead of ~25+ minutes we're seeing ~5-6 minutes for pull-kubernetes-bazel-test after switching this on today.

Absolutely amazing work @BenTheElder. Congratulations!

Thanks Jakob :-)

On Fri, Mar 2, 2018 at 8:18 AM Jakob Buchgraber notifications@github.com
wrote:

Absolutely amazing work @BenTheElder https://github.com/bentheelder.
Congratulations!

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/6808#issuecomment-369969999,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq3O6fYMjBbZtwdfK8IGwjLzmB7HPks5taXDqgaJpZM4SEbFa
.

I think once https://github.com/kubernetes/test-infra/pull/7205 is in we can close this, I've significantly upped the cache storage and we're flipping it on for pretty much all other presubmits that should leverage caching.

/close
We've rolled this out to many more jobs, pull-kubernetes-bazel-build in particular is now trending towards 5-8 minutes instead of 13+
image

woohooo!!! ๐Ÿพ ๐ŸŽ† ๐ŸŽ‰

cc @ulfjack

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lavalamp picture lavalamp  ยท  3Comments

cjwagner picture cjwagner  ยท  3Comments

fejta picture fejta  ยท  4Comments

MrHohn picture MrHohn  ยท  4Comments

cblecker picture cblecker  ยท  4Comments