Bazel: Jobs generating large amounts of files drastically slow down building with remote cache.

Created on 6 Sep 2018  Â·  18Comments  Â·  Source: bazelbuild/bazel

Description of the problem / feature request:

Building the go standard library with remote caching enabled is considerably slower than building locally.

IRL discussion with @buchgr resulted in the conclusion that the contents of the standard library were uploaded serially within Bazel, which lead to major performance decreases.
It was noted that some concurrency is present for the rest of the build, since Bazel executes multiple jobs at once, resulting in pseudo-concurrent results; but since the go standard library counts as a single job, the upload is then performed serially.

Feature requests: what underlying problem are you trying to solve with this feature?

Decrease build time when populating empty remote caches by allowing concurrent uploads of files in individual actions.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build any go project using an empty remote cache; on normal internet connections, this can take a _long_ time as the go standard library files are uploaded one by one.

Example timings, running bazel test on one of our projects with a clean local cache:

With remote caching:

bazel test --remote_http_cache=https://storage.googleapis.com/<our cache>  
0.07s user 0.15s system 0% cpu 14:07.77 total

No remote caching:

bazel test
0.05s user 0.08s system 0% cpu 2:46.71 total
P2 team-Remote-Exec bug

Most helpful comment

Hm, so upon running bazel build //... in rules_haskell many more times (with remote caching on GCS turned on), it looks like my first number was an aberration. I now use --remote_upload_local_results=false to make sure we're not measuring any uploading (only downloading). I consistently get ~225 seconds on my connection. Only a 30% improvement on total build time, but at least caching is now an improvement, not a pessimization.

I was wondering - what's the algorithm for fetching from the cache? In my use case, I have 329MB of cache data to transfer to my build machine. I notice using ntop that the network fetches are very bursty. Sustained 50MB/sec for first few seconds then averages at about 1MB/sec until end of build (with some big bursts). Hence why it takes 200+ seconds to download the fully 329MB. This seems a bit slow. Since the dependency graph is fully known by the end of the analysis phase, couldn't we in principle fetch all action outputs for a given target in one go? And if so, shouldn't I expect the build to be fully network bandwidth limited if the cache hit rate is 100%? Especially if all fetches are properly pipelined (i.e. async)?

All 18 comments

@buchgr @ola-rozenfeld Has this been fixed in the meantime? I remember seeing a change about implementing parallel uploads flying by.

Yes, I believe Jakob's parallel uploads change made this much better. I also have a change in flight now that adds batched uploads, thank you for reminding me to test it on this use case, I believe it should additionally improve it.

@PaulSonOfLars Do you think we can close this now? :)

@ola-rozenfeld Thanks for the update!

ff008f445905bf6f4601a368782b620f7899d322 implemented parallel downloads, but output upload in SimpleBlobStoreActionCache is still serial.

Shoot, sorry, Benjamin is right! Apologies, I take that back. And my
batching change is not relevant either.

On Mon, Sep 10, 2018 at 5:08 PM Benjamin Peterson notifications@github.com
wrote:

ff008f4
https://github.com/bazelbuild/bazel/commit/ff008f445905bf6f4601a368782b620f7899d322
implemented parallel downloads, but output upload in
SimpleBlobStoreActionCache is still serial.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/6091#issuecomment-420061510,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AYoKuP6lHYAY8usPBvhfB5SvVYXsNWxYks5uZtS-gaJpZM4Wcjw9
.

Awesome, thanks for looking into this! I had a few looks at recent commits to see if anything had touched the subject before opening the issue and couldn't see anything - but yes, basically need Jakob's change for uploads instead of downloads :)

@buchgr Friendly ping. :)

I won't get to it any time soon, but contributions are certainly welcome. It should be rather straightforward to implement given that under the hood HttpBlobStore is already fully async. The change should mostly be about updating blocking interfaces to non-blocking.

With v0.18.0, which GitHub tells me includes ff008f4, I'm still seeing significantly slower builds when caching is enabled. A from-clean-local-cache build with no remote cache of rules_haskell on my machine takes 330 seconds. Once the remote cache has been primed, a from-clean-local-cache build with remote cache turned on takes 880 seconds. This is despite the very high cache hit rate, according to top and the status messages.

I'm using a Google Cloud Storage bucket as the cache.

Hm, so upon running bazel build //... in rules_haskell many more times (with remote caching on GCS turned on), it looks like my first number was an aberration. I now use --remote_upload_local_results=false to make sure we're not measuring any uploading (only downloading). I consistently get ~225 seconds on my connection. Only a 30% improvement on total build time, but at least caching is now an improvement, not a pessimization.

I was wondering - what's the algorithm for fetching from the cache? In my use case, I have 329MB of cache data to transfer to my build machine. I notice using ntop that the network fetches are very bursty. Sustained 50MB/sec for first few seconds then averages at about 1MB/sec until end of build (with some big bursts). Hence why it takes 200+ seconds to download the fully 329MB. This seems a bit slow. Since the dependency graph is fully known by the end of the analysis phase, couldn't we in principle fetch all action outputs for a given target in one go? And if so, shouldn't I expect the build to be fully network bandwidth limited if the cache hit rate is 100%? Especially if all fetches are properly pipelined (i.e. async)?

Is this still an issue? As far as I can tell, Bazel should be downloading files fully concurrently. For the HTTP cache, it uses HTTP/1.1, but it does reuse the connection.

It's still the case that HTTP cache output uploads are serial.

Are you saying that Bazel is uploading outputs of a locally run action serially?

Yes, the Go standard library mentioned at the top can take 20 minutes to build with remote caching enabled, when there is a remote cache miss. By contrast, it takes about 30s to build from scratch without remote cache involved. The problem is with the upload, not the download.

Are you saying that Bazel is uploading outputs of a locally run action serially?

To the HTTP cache, yes.

Err... so the API uses futures, and the HTTP client uses futures, but apparently the glue code between the two is blocking. That is ... unfortunate, but luckily easy to fix.

Ok, I sent a patch.

Was this page helpful?
0 / 5 - 0 ratings