Applying a Bazel HTTP remote cache can make the build _slower_, depending on the project and artefacts being built.
I tried a few projects:
For the cache server, I tried both bazel-remote and my own Node.js server that I cobbled together. Both yielded similar results.
The cache was hosted on a reasonable Digital Ocean box in the same city:
$ less /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping : 4
microcode : 0x1
cpu MHz : 2294.608
...
$ free -m
total used free shared buff/cache available
Mem: 3944 119 153 0 3672 3544
Swap: 0 0 0
Internet connection speed for the client was around 10mbps, latency ~20ms. Not the fastest, but Bazel should be able to adapt to this.
The Bazel client was running on a fairly high-end laptop:
$ less /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 142
model name : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
stepping : 10
microcode : 0x9a
cpu MHz : 700.060
cache size : 8192 KB
...
$ free -m
total used free shared buff/cache available
Mem: 15806 4246 7541 755 4018 10582
Swap: 32767 0 32767
Suggestion
Perhaps the HTTP cache should record the time it took to build an artefact (according to the client). This would give Bazel enough information to decide if it is better to build or fetch.
Relevant variables:
Currently the server receives minimal metadata from Bazel.
Ubuntu 18.10
bazel info release
?release 0.23.1
Related discussion: https://github.com/bazelbuild/bazel/issues/6091
Also https://github.com/bazelbuild/bazel/commit/76370d5453110c494b8066d0006e1986b8b039fa has been implemented which could help with such cases.
Ideally that'd use same/similar logic used by Dynamic Scheduling in Remote Execution https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html
What is nice about the combined disk/http cache is that it will only use CPU/memory for compiling etc. when actually necessary but yeah I was also thinking that it would be great to get the dynamic scheduler functionality for the remote caching use-case. Is that something feasible @jin @jmmv ?
@Globegitter absolutely. In Bazel we'll probably limit dynamic scheduling to remote caching only (and disallow remote execution for safety reasons).
@njlr thanks a lot for doing these benchmarks. We'll be landing https://github.com/bazelbuild/bazel/issues/6862 this week in Bazel master probably and it'd love to run these benchmarks again with this change in.
Will pick this up soon!
Fantastic! I would be keen to re-run some benchmarks when you are ready :+1:
Any updates on this?
@buchgr @jin - is anyone planing to work on this soon, otherwise I'm happy to contribute if you give me pointers, basic outline, do you want a design doc etc etc
@philwo would know, but he's unavailable currently. Escalating to @jhfield / @dslomov.
I'm not sure that we have enough information here to decide on a course of action. What's the reason for the cached case to be slower? Is that something that can be fixed?
If it's "just" the network round-trip time, then it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.
However, I'm not sure how to handle cache writes. It is technically possible to make them async, but that'll make it difficult to report errors.
@jmmv has some nice blog posts on this topic.
it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.
Big +1 we would love to see this. I asked @jmmv about this very thing on twitter and he replied:
No plans on that unfortunately, at least from my team at this point... I think this would take a very different implementation than the current dynamic scheduler though.
Same here. Huge +1 if we could get dynamic strategy for cache lookup/download. Right now we have to have developers flip on and off their remote cache based on their download speed. For some it changes based on the time of day because of shared internet resources.
The motivation behind jongerrish@ request is stemmed from the fact that we download ~2GB of data from cache for our build which is heavily depending on your download speed. We also have a per-action breakdown comparing building with cache and no-cache and we can see that on machines with lower network speed it's clearly faster to build locally vs downloading it from cache. Another interesting data set is that around ~3000 actions are <1KB so when taking latency into account it's probably not even worth checking if they are present in the cache
Looking for some implementation guidance for this feature... would it be reasonable to have a new mode where we register a RemoteSpawnStrategy() that takes a new class RemoteCacheSpawnRunner that is more or less just an adapter to a RemoteCache? @ulfjack @philwo @buchgr @jin similar how the existing remote execution strategy is built here: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteActionContextProvider.java#L110
I was thinking about that, but I'm not sure how to get the results written back to the cache. Right now, the interface requires both lookup and write to be done in the same, err, context.
As much as I like the technical challenge here, can we first confirm that it's due to the lookup overhead? Did you try to increase the number of jobs to see if that helps to hide the latency?
The CPU, RAM and Network are all maxed out during a Bazel run. On my machine it sometimes even dips into SWAP memory so increasing the number of jobs causes an OOM. Alternatively, we can try to experiment with increasing --remote_max_connections
as the default is 100. However, I doubt that it will change anything as I can see my network download speed is reaching max - actually it would probably help with checking cache hit on many smaller actions.
Regarding cache writes, first of all we don't seed remote cache from local builds. Instead we have a stable machine on CI that does that. So for v1 it's probably acceptable to not support upload to cache. Unless you're talking about local workspace writes? I didn't check the code but in general I'd assume only the "winning" SpawnRunner should be responsible for writing to cache at the very end.
@nkoroste I'm afraid in that case this won't be much of a win. The proposal here is to trade CPU for latency while using additional threads - that's only going to be an improvement if you have extra CPU, and if you're almost running OOM, your overall build latency might be dominated by gc rather than network round-trip latency.
AFAICT, --remote_max_connections
won't do anything without also changing --jobs
. The latter is the primary mechanism to control the number of threads, and they perform blocking calls to the cache.
Sorry on the delay on this, to add more context and visibility from some of the offline conversations:
I'm not suggesting that increasing # of jobs and max connections will improve anything. In fact, we benchmarked various variations of those two flags and the performance is generally worse if you increase the numbers for these 2 flags.
All I'm saying is that Bazel produces GBs of data, specially for Android builds, that are required to be downloaded from remote cache. This is obviously directly correlated with your network overall speed and latency. During a build with a high cache hit rate (85%+), for a big app, majority of the time is spent downloading bytes from cache while most of the machine's CPU/Ram are idle/free.
With dynamic spawn strategy we can utilize some of the machines resources and reduce the number of bytes downloaded from cache to hopefully improve the overall build time for developers with bad network connection.
In the meantime, we try to improve the android rules themselves to produce less unnecessary date that will help with this as well. See https://github.com/bazelbuild/bazel/pull/11253 for example.
Most helpful comment
Will pick this up soon!