When using --experimental_remote_download_outputs=toplevel, and having a cache return a 500, bazel crashes. I would expect it to recover in this case and fall back to retrying the request, or worst case doing the build locally. Without this flag it does seem to be resilient to this.
ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:388:1: Linking of rule '@com_google_protobuf//:protoc' failed due to unexpected I/O exception: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>
com.google.devtools.build.lib.remote.blobstore.http.HttpException: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>
at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:116)
at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:41)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1476)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1225)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1272)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
bazel run TARGET when pointing at an unhealthy http cache and passing --experimental_remote_download_outputs=toplevel
macOS
bazel info release?release 0.26.0
cc @buchgr
@keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no?
Regardless of retries what's the expected behavior in this case?
--
Keith Smiley
On Jun 3, 2019, at 08:07, Jakob Buchgraber notifications@github.com wrote:
@keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
@keith to fallback to local execution. it's a bug either way.
I can't reproduce this with 0.27.1 ... can you?
buchgr@buchgr0:~/code/bazel2$ bazel build src:bazel --remote_http_cache=http://localhost:9090 --experimental_remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files
INFO: Invocation ID: 0ee5a3cd-bdeb-42ad-b149-5b7dc94cbe01
INFO: Analyzed target //src:bazel (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
WARNING: Reading from Remote Cache:
500 Internal Server Error
timed out
[54 / 89] 22 actions running
Building src/main/java/com/google/devtools/build/lib/remote/libremote.jar (25 source files) and running annotation processors (OptionProcessor); 3s remote-cache, worker
Compiling Java headers src/main/java/com/google/devtools/build/lib/exec/local/liblocal-hjar.jar (5 files); 2s remote-cache, linux-sandbox
Do you have a specific fix in mind that would have addressed this? I'll have to merge this again on our side to test well enough to see if it breaks again
It looks like I can't test this in our case because we hit this:
ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:130:1: C++ compilation of rule '@com_google_protobuf//:protobuf' failed due to unexpected I/O exception: Failed to fetch file with hash '1ef4741cf10441feeca5c8dde2b6c754f0cbd96fcae697d54a604757289a5e9b' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '1ef4741cf10441feeca5c8dde2b6c754f0cbd96fcae697d54a604757289a5e9b' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
So if you think this is fixed on your side I guess we can close this
@keith in order to remove this error you need to clear your remote cache. which remote cache are you using?
This is from an internal cache implementation. But in theory anything could be evicted from other caches during a build as well right?
@buchgr do you know how bazel would handle errors from the cache, even if the key itself hasn't been evicted, it's just temporarily inaccessible? Specifically if we returned a 500 in this case would this be better than returning a 404?
@buchgr I enabled this flag set on envoy and we've started seeing this crash as well https://circleci.com/gh/envoyproxy/envoy/249586 we're pointing to the GCS remote cache. Any ideas on how we can mitigate this?
@keith disable garbage collection on GCS. The issue is that GCS does not understand the action graph of what it's caching and evicts entries purely based on time. So it might have a cache entry for an action A1 but deleted all the outputs of this action. If an action A2 now needs the outputs of A1, then Bazel can't download them and also can't re-run action A1 and thus has to print this error.
I ll have to write a blog post about this and discussing potential ways for mitigation. For GCS specifically the only currently viable strategy is to disable garbage collection and wipe the whole cache from time to time. I imagine it would be easy enough to write a cloud function (or something) that reads in the action graph from GCS and properly evicts items from time to time.
Pretty sure this is still a problem FWIW
Thanks Keith, we should probably document what Jakob described.
Is there any active work on bits without the bytes to make it gracefully handle download errors like the normal build does?
I just uprev'd to v3.4.1 and tried experimenting with this feature (--remote_download_toplevel and --remote_download_minimal). I am able to produce the below crash 100% of the time from fully deleted ~/.cache/bazel directory. However my stacktrace is quite different, does this look like the same issue?
It is not a garbage collection issue in my case because we only cleanup on a cron once/month. Although I suspect it is a similar problem, for example:
ERROR: <snip>/BUILD.bazel:499:10: Couldn't build file <snip>/binary: Linking of rule '//<snip>:binary' failed due to Exec failed due to IOException: null
com.google.devtools.build.lib.remote.BulkTransferException
at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:228)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:127)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:332)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:785)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:948)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:907)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:425)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:926)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:310)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Suppressed: io.netty.channel.ExtendedClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
I'm collecting all issues for build-without-the-bytes in the tracking issue (https://github.com/bazelbuild/bazel/issues/6862). I'm making slow progress getting through the backlog. If you have issues that are easy to repro, I can take a look.
I am unable to reproduce this issue.
FYI I still see a similar stack to one above here:
WARNING: Reading from Remote Cache:
UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
load_shed
ERROR: /Users/iosci/buildkite/src/iosci84/lyft/ios/Modules/AppShortcuts/BUILD:25:10: Bundling, processing and signing AppShortcutsTests.__internal__.__test_bundle failed due to Exec failed due to IOException: null
com.google.devtools.build.lib.remote.BulkTransferException
at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.start(LocalSpawnRunner.java:315)
at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.runOnce(LocalSpawnRunner.java:217)
at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.run(LocalSpawnRunner.java:191)
at com.google.devtools.build.lib.exec.local.LocalSpawnRunner.exec(LocalSpawnRunner.java:141)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
but I can file a new issue if you think it's unrelated
Is this with the gRPC remote cache? AFAIK, the HTTP remote cache only supports HTTP 1.1. Also, can you provide the full stack trace? I think the interesting part is below where you cut it off.
This is with GCP RBE via gRPC. Super long log: https://gist.github.com/keith/53adb5908b9419443d682398d7dcb832
This seems to be working as expected from Bazel's point of view. You should bring this up with, err, RBE team? @EricBurnett @bergsieker
So, I created a remote execution service that artificially causes GOAWAY like the one you are seeing. Bazel retries the error 5 times (for a total of 6 calls) with exponentially increasing delays between calls. The default exponential backoff configuration is still fairly aggressive, with delays of 100ms, 200ms, 400ms, 800ms, and 1600ms. If the remote system is overloaded for a period of time, then it is not entirely surprising if it returns GOAWAY 6 times in ~3.1 seconds.
If this is a common issue, then it would look like a service misconfiguration to me. For example, the service could reduce the number of concurrent streams the client is allowed to open. Or it could use flow control to reduce the rate of incoming packets.
We could also increase the backoff times in Bazel. They look pretty aggressive to me.
@ulfjack I have a reproducible example, although I haven't been able to narrow it down to a small example repro I can share here. In order to hopefully collect more information I built bazel from source and replaced the netty-all-4.1.34.jar with the latest netty-all-4.1.52.jar and the new netty, however it provides a similar stacktrace without much info.
https://mvnrepository.com/artifact/io.netty/netty-all/4.1.52.Final
200911 22:33:11.745:WT 280 [com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run] Aborting evaluation while evaluating ActionLookupData{actionLookupKey=<snip> BuildConfigurationValue.Key[fef9552d036c94cb695ac8e87cfd860d9e76bea766e0fdd0a035c9050fb3dad6], actionIndex=598}
com.google.devtools.build.lib.skyframe.ActionExecutionFunction$ActionExecutionFunctionException: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: Linking of rule '<snip>' failed: I/O exception during sandboxed execution: null
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:329)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: Linking of rule '<snip>' failed: I/O exception during sandboxed execution: null
... 6 more
Caused by: com.google.devtools.build.lib.actions.UserExecException: I/O exception during sandboxed execution: null
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:96)
at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:469)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:332)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:785)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:948)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:907)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:425)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:926)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:310)
... 5 more
Caused by: com.google.devtools.build.lib.remote.BulkTransferException
at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:122)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:91)
... 21 more
Suppressed: io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
Suppressed: io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
I've been able to confirm that HttpCacheClient in its get() method is trying to writeAndFlush() to a channel that is already closed after a succeedAndReset() call on that channel has closed it. I would expect that it shouldn't be possible to acquire a closed channel, but it seems like there is a race condition between a channel being released/closed and re-acquired/used.
@ulfjack -- I have created a small repo that reproduces the issue by building an open source project (protobufs): https://github.com/bjacklyn/bazel-remote-download-failure-repro
I can file a separate github issue for this. It would be great if you can confirm that you are able to reproduce what I am seeing ^ though. There seem to be a lot of dials and knobs in order to hit the issue, but if my understanding is correct the root cause seems straight forward -- the channel is closed after it has been released which causes the next guy who acquires it to fail.
@bjacklyn, is that related to this bug? I fixed a bug in the HTTP cache uploader / downloader that was due to a race condition where it tried to acquire a closed channel. That fix is here: https://github.com/bazelbuild/bazel/commit/e6089c9630daf60ea57b930d1d66d9b92980c0d6
Can you still repro the issue after that change?
Bazel should run the local action to rebuild the missing entry. I will work on this.
Most helpful comment
@keith to fallback to local execution. it's a bug either way.