Bazel: remote/performance: If remote cache is inaccessible, fall back to building without the cache, rather than failing the build

Created on 8 May 2017  Â·  24Comments  Â·  Source: bazelbuild/bazel

Description of the problem / feature request / question:

Feature request:

I was able to get caching to work with --spawn_strategy=remote --rest_cache_url=.... It works well; but if the cache is inaccessible for any reason (e.g. I have gone offline and am working while commuting, or the server has gone down), then my builds fail.

Of course, I can change the options I'm using to launch Bazel; but that isn't always a good option. For one thing, my company has quite a lot of developers, and I would prefer that they not all have to learn this workaround. Secondly, in our automated Jenkins builds, launching with different command-line arguments isn't an option.

What I have hacked together for our own use is some changes to Bazel so that:

  • Each time an error occurs trying to read or write the remote cache, it displays a short warning message, but continues the build. (get operations pretend the item was not found in the cache; put operations pretend the operation succeeded.)
  • After ten consecutive such errors with no intervening successful cache accesses, Bazel displays a message that says, "Cache encountered multiple consecutive errors; disabling cache for 5 minutes."

My code for this was pretty quick-and-dirty, so it's not really in a shareable state, but it was pretty easy to write.

It would make sense for this to work regardless of what remote caching protocol is used -- REST API, gRPC, etc.

Environment info

  • Bazel version (output of bazel info release):

0.4.5

P2 team-Remote-Exec feature request

Most helpful comment

@benjaminp has prepared a change [1] that once merged, will not attempt to upload to remote cache if the lookup failed. So that solves half the problem. There will be retries, making the build really slow.

However, you can set --experimental_remote_retry=false and then the build should be quick even if the remote cache is down once in a while.

I already have an idea of how to make this happen with retries, but it will take a while until I have time to implement it.

[1] https://bazel-review.googlesource.com/c/15070

All 24 comments

This is going to be important for my organization. We are rolling out a replacement of sbt in favor of a cached bazel build. We do not want our cache server's downtime to block development.

Just leaving this here for documentation. For others that want a way to work around with using a release there is a documented set of error codes. The error code bazel returns for this is 37.

37 - Unhandled Exception / Internal Bazel Error.

To work around cache downtime ( or offline builds ), I'm falling back on a build that doesn't use remote cache if the return code is 37.

Ideally bazel would gracefully fallback on local cache if a remote cache could not be reached printing an informative warning that the cache is not available atm

The stack trace looks something like this

$ bazel bulid ...
Caused by: org.apache.http.conn.HttpHostConnectException: Connection to https://bogus.cache.com refused
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
    at com.google.devtools.build.lib.remote.ConcurrentMapFactory$RestUrlCache.get(ConcurrentMapFactory.java:109)
    ... 19 more
Caused by: java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
    at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:414)
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
    ... 26 more

$ echo $?
37

+1

Ideally also provide:

  1. Console warning that build cache is configured but failing over to local build.
  2. Configure timeout for contacting build cache.
  3. Option to force use of build cache and fail if unavailable (this is the current behaviour and useful for checking that build cache is working).
  4. Finally do parallel execution of local build, so that it doesn't block waiting on a response to the cache server before starting the local build. In many cases where the bazel build is very fast, e.g. under 1 second, it may finish the build before any possible response from the build cache.

Maybe we can fit this into 0.7. One thing I was planning to do (independently of this request) is to try to establish a connection at the start of a build, and have all actions block on that. I suppose if that times out, we could give a warning if -k is enabled (or maybe another flag). I don't think we want to fall back by default.

@ulfjack - thanks for considering this for the 0.7 release. A configurable timeout, e.g. 100 milliseconds may be a good interim step. Most build caches are local to an office with extremely low ping times. Remote caches are likely rarer and maybe not as helpful due to bandwidths constraints.

Given that bazel often completes very quickly, I want to avoid a 2 second blocking action slowing down a build that takes 300 milliseconds locally. Ideal approach would run both actions in parallel then add use of the build cache if available. Since the logic for that may be complex, another alternative is that if the build cache is confirmed as available and the local build hasn't yet finished, then restart the entire build with the build cache. That would lose any local build to that point but would handle the important common cases of:

  1. No blocking on possible very fast local bazel build
  2. Build cache used if available
  3. Automatic failover to local build if build cache not available

Sure, I suppose we could add a timeout to the initial remote cache request and treat that as failed if it times out. This would be opt-in with a flag. That might be easier than I previously thought.

It's easy to attempt a lookup, timeout, run the action, upload.

It's difficult to start the action cache lookup, wait a bit, then start local execution, and then cancel local execution if the action cache lookup returns successfully when the local process is already running.

Ulf - I'll defer to your great knowledge of the bazel internals. For now,
just configuring an opt in short timeout and fallback to local build would
solve 70% of the problem. I'll defer to you on how to approach it beyond
that.

On Mon, May 15, 2017, 1:02 PM Ulf Adams notifications@github.com wrote:

It's easy to attempt a lookup, timeout, run the action, upload.

It's difficult to start the action cache lookup, wait a bit, then start
local execution, and then cancel local execution if the action cache lookup
returns successfully when the local process is already running.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2964#issuecomment-301588358,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApdq1V-rJQiNJQnguS6IMdB2s2HnAe9ks5r6K9rgaJpZM4NThYT
.

There is already a flag --remote_local_fallback, which should control fallback, but the error handling currently isn't very good (#1682), and I don't think it's possible to configure a timeout. I think I'll be tackling #1682 next, which should move us forward.

We've improved the error handling for 0.5.3 (upcoming). I think it's correctly falling back now.

Thanks @ulfjack. Can you explain how this works? My expectation was that if I halt the nginx server that is running my remote cache, then Bazel would still finish the build; but that doesn't seem to be the case.

I think what I see happening is that when trying to download the first action result, Bazel correctly detects that it can't connect to the cache, so it does the build locally; but then it tries to upload that result to the cache, and at that point it aborts with an error message. This is with Bazel built from the head of master (commit 688dbf7a7).

My command line looks a little like this (the --jobs=1 is just for debugging purposes):

bazel \
    --host_jvm_args=-Dbazel.DigestFunction=SHA1 \
    build \
    --jobs=1 \
    --spawn_strategy=remote \
    --remote_rest_cache=... \
    --remote_local_fallback \
    --remote_upload_local_results \
    :mytarget

The output:

ERROR: /Users/mikemorearty/src/bazel/test/BUILD:15:1: Executing genrule //:mytarget failed: Unexpected IO error.: Connect to localhost:8000 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused.
Target //:mytarget failed to build

Thanks for trying. Apparently I missed that part. :-( I'll make sure to add a test before I close this again.

@mmorearty Your analysis is spot on.

@ulfjack @mmorearty
When --remote_upload_local_results is enabled, I propose to print a warning (one in total) if one or more uploads fail instead of failing the build (see also #3368).

There's another problem, however. Currently, if the remote cache lookup fails we retry with exponential fallback. If the remote cache is down, this adds a 6 second retry period per action before we even attempt to build locally. So that will slow down any build significantly. It's possible to disable the retry mechanism via a command line flag, but at some point combining all these flags gets too complicated and I am also not sure if we want to make this flag stable eventually (it's experimental right now).

I think we should either not make this error retryable or introduce a mechanism that we just stop retrying after N retries failed with such an error.

@benjaminp has prepared a change [1] that once merged, will not attempt to upload to remote cache if the lookup failed. So that solves half the problem. There will be retries, making the build really slow.

However, you can set --experimental_remote_retry=false and then the build should be quick even if the remote cache is down once in a while.

I already have an idea of how to make this happen with retries, but it will take a while until I have time to implement it.

[1] https://bazel-review.googlesource.com/c/15070

I have a prototype implementation of a circuit breaker in our retry logic [1]. Will share soon.

[1] https://martinfowler.com/bliki/CircuitBreaker.html

Any updates on this? Should --remote_local_fallback work yet if the remote rest cache is unreachable?

@jgavris unfortunately this still under review (mostly because I haven't had time to address comments).

👋

Any update on this? We're using --experimental_external_repositories=true but the timeout seems to always take 30 seconds, regardless of --remote_local_fallback or setting a --remote_timeout. On 0.8.0. Though a local fallback does seem to be happening.

@mkilpatrick Heya! I have recently fixed the --remote_timeout part. That is, --remote_rest_cache now respects --remote_timeout.

What release did that happen in?

@softprops 0.9.0

Any update?

This has been fixed a while back.

Was this page helpful?
0 / 5 - 0 ratings