Bazel: Bazel remote cache is not a clear win

Created on 7 Mar 2019 · 19Comments · Source: bazelbuild/bazel

Description of the ~problem~ / feature request:

Applying a Bazel HTTP remote cache can make the build _slower_, depending on the project and artefacts being built.

I tried a few projects:

Cartographer ~2.8x _faster_ with cache
Abseil ~1.5x _faster_ with cache
cppitertools ~2x _slower_ with cache
OpenTracing ~2x _slower_ with cache

For the cache server, I tried both bazel-remote and my own Node.js server that I cobbled together. Both yielded similar results.

The cache was hosted on a reasonable Digital Ocean box in the same city:

$ less /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 2294.608
...

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3944         119         153           0        3672        3544
Swap:             0           0           0

Internet connection speed for the client was around 10mbps, latency ~20ms. Not the fastest, but Bazel should be able to adapt to this.

The Bazel client was running on a fairly high-end laptop:

$ less /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
stepping        : 10
microcode       : 0x9a
cpu MHz         : 700.060
cache size      : 8192 KB
...

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15806        4246        7541         755        4018       10582
Swap:         32767           0       32767

Suggestion

Perhaps the HTTP cache should record the time it took to build an artefact (according to the client). This would give Bazel enough information to decide if it is better to build or fetch.

Relevant variables:

Connection speed
Size of artefact
Estimated time to build artefact
Current build workload

Currently the server receives minimal metadata from Bazel.

What operating system are you running Bazel on?

Ubuntu 18.10

What's the output of `bazel info release`?

release 0.23.1

Have you found anything relevant by searching the web?

P2 team-Remote-Exec feature request

Source

njlr

👍6

Most helpful comment

Will pick this up soon!

buchgr on 29 Mar 2019

🚀3

All 19 comments

Also https://github.com/bazelbuild/bazel/commit/76370d5453110c494b8066d0006e1986b8b039fa has been implemented which could help with such cases.

Globegitter on 8 Mar 2019

👍2

Ideally that'd use same/similar logic used by Dynamic Scheduling in Remote Execution https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html

artem-zinnatullin on 9 Mar 2019

What is nice about the combined disk/http cache is that it will only use CPU/memory for compiling etc. when actually necessary but yeah I was also thinking that it would be great to get the dynamic scheduler functionality for the remote caching use-case. Is that something feasible @jin @jmmv ?

Globegitter on 11 Mar 2019

@Globegitter absolutely. In Bazel we'll probably limit dynamic scheduling to remote caching only (and disallow remote execution for safety reasons).

@njlr thanks a lot for doing these benchmarks. We'll be landing https://github.com/bazelbuild/bazel/issues/6862 this week in Bazel master probably and it'd love to run these benchmarks again with this change in.

buchgr on 11 Mar 2019

Will pick this up soon!

buchgr on 29 Mar 2019

🚀3

Fantastic! I would be keen to re-run some benchmarks when you are ready :+1:

njlr on 29 Mar 2019

Any updates on this?

nkoroste on 10 Apr 2020

@buchgr @jin - is anyone planing to work on this soon, otherwise I'm happy to contribute if you give me pointers, basic outline, do you want a design doc etc etc

jongerrish on 16 Apr 2020

@philwo would know, but he's unavailable currently. Escalating to @jhfield / @dslomov.

jin on 16 Apr 2020

I'm not sure that we have enough information here to decide on a course of action. What's the reason for the cached case to be slower? Is that something that can be fixed?

If it's "just" the network round-trip time, then it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.

However, I'm not sure how to handle cache writes. It is technically possible to make them async, but that'll make it difficult to report errors.

ulfjack on 16 Apr 2020

👍2

@jmmv has some nice blog posts on this topic.

edbaunton on 16 Apr 2020

it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.

Big +1 we would love to see this. I asked @jmmv about this very thing on twitter and he replied:

No plans on that unfortunately, at least from my team at this point... I think this would take a very different implementation than the current dynamic scheduler though.

kastiglione on 16 Apr 2020

👍1

Same here. Huge +1 if we could get dynamic strategy for cache lookup/download. Right now we have to have developers flip on and off their remote cache based on their download speed. For some it changes based on the time of day because of shared internet resources.

brentleyjones on 16 Apr 2020

👍2

The motivation behind jongerrish@ request is stemmed from the fact that we download ~2GB of data from cache for our build which is heavily depending on your download speed. We also have a per-action breakdown comparing building with cache and no-cache and we can see that on machines with lower network speed it's clearly faster to build locally vs downloading it from cache. Another interesting data set is that around ~3000 actions are <1KB so when taking latency into account it's probably not even worth checking if they are present in the cache

nkoroste on 16 Apr 2020

Looking for some implementation guidance for this feature... would it be reasonable to have a new mode where we register a RemoteSpawnStrategy() that takes a new class RemoteCacheSpawnRunner that is more or less just an adapter to a RemoteCache? @ulfjack @philwo @buchgr @jin similar how the existing remote execution strategy is built here: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteActionContextProvider.java#L110

jongerrish on 18 Apr 2020

I was thinking about that, but I'm not sure how to get the results written back to the cache. Right now, the interface requires both lookup and write to be done in the same, err, context.

As much as I like the technical challenge here, can we first confirm that it's due to the lookup overhead? Did you try to increase the number of jobs to see if that helps to hide the latency?

ulfjack on 20 Apr 2020

The CPU, RAM and Network are all maxed out during a Bazel run. On my machine it sometimes even dips into SWAP memory so increasing the number of jobs causes an OOM. Alternatively, we can try to experiment with increasing --remote_max_connections as the default is 100. However, I doubt that it will change anything as I can see my network download speed is reaching max - actually it would probably help with checking cache hit on many smaller actions.

Regarding cache writes, first of all we don't seed remote cache from local builds. Instead we have a stable machine on CI that does that. So for v1 it's probably acceptable to not support upload to cache. Unless you're talking about local workspace writes? I didn't check the code but in general I'd assume only the "winning" SpawnRunner should be responsible for writing to cache at the very end.

nkoroste on 20 Apr 2020

@nkoroste I'm afraid in that case this won't be much of a win. The proposal here is to trade CPU for latency while using additional threads - that's only going to be an improvement if you have extra CPU, and if you're almost running OOM, your overall build latency might be dominated by gc rather than network round-trip latency.

AFAICT, --remote_max_connections won't do anything without also changing --jobs. The latter is the primary mechanism to control the number of threads, and they perform blocking calls to the cache.

ulfjack on 21 Apr 2020

Sorry on the delay on this, to add more context and visibility from some of the offline conversations:

I'm not suggesting that increasing # of jobs and max connections will improve anything. In fact, we benchmarked various variations of those two flags and the performance is generally worse if you increase the numbers for these 2 flags.

All I'm saying is that Bazel produces GBs of data, specially for Android builds, that are required to be downloaded from remote cache. This is obviously directly correlated with your network overall speed and latency. During a build with a high cache hit rate (85%+), for a big app, majority of the time is spent downloading bytes from cache while most of the machine's CPU/Ram are idle/free.

With dynamic spawn strategy we can utilize some of the machines resources and reduce the number of bytes downloaded from cache to hopefully improve the overall build time for developers with bad network connection.

In the meantime, we try to improve the android rules themselves to produce less unnecessary date that will help with this as well. See https://github.com/bazelbuild/bazel/pull/11253 for example.

nkoroste on 13 May 2020

Was this page helpful?

0 / 5 - 0 ratings