Sometimes builds with http remote caching enabled fail with Failed to delete output files after incomplete download. Cannot continue with local execution.
See EG the logs here: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/
I've had some difficulty making this bug reproducible unfortunately, you need to have remote caching enabled and fail to download to an entry, which often still succeeds and falls back to local building.
A debian jessie based docker container.
bazel info release
?release 0.11.0
cc @buchgr @ixdy @rmmh
I've spent a fair bit of time trying to make this reproducible, unsuccesfuly. even digging through our logs most runs are either successful or actual failures, but this happens often enough to be a bad experience.
Thanks for reporting. I ll take a look!
Still happening occasionally in in bazel 0.13.0. Failure from this morning: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/54372/pull-kubernetes-e2e-gce-device-plugin-gpu/31984/
W0515 15:50:29.365] INFO: Analysed target //build/release-tars:release-tars (1997 packages loaded).
W0515 15:50:29.370] INFO: Found 1 target...
W0515 15:50:29.510] [0 / 15] [-----] Writing file build/release-tars/kubernetes-manifests.args
W0515 15:50:34.291] [49 / 2,338] [-----] Writing file vendor/github.com/google/certificate-transparency-go/x509/go_default_library._cgo_.o-2.params
W0515 15:50:37.467] ERROR: /bazel-scratch/.cache/bazel/_bazel_prow/48d5366022b4e3197674c8d6e2bee219/external/io_bazel_rules_go/BUILD.bazel:9:1: GoStdlib external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /bazel-scratch/.cache/bazel/_bazel_prow/48d5366022b4e3197674c8d6e2bee219/execroot/__main__/bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg (Directory not empty)
Thanks @buchgr!
we're seeing this again with 0.15.0 :(
Test run a and b both hit an error like:
ERROR: /bazel-scratch/.cache/bazel/_bazel_root/27a95fe2fc68aa1195ddf1a4bbaaea62/external/io_bazel_rules_go/BUILD.bazel:9:1: GoStdlib external/io_bazel_rules_go/linux_amd64_race_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /bazel-scratch/.cache/bazel/_bazel_root/27a95fe2fc68aa1195ddf1a4bbaaea62/execroot/__main__/bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_race_stripped/stdlib~/pkg/linux_amd64_race (Directory not empty)
Per the logs these tests ran in the image gcr.io/k8s-testimages/kubekins-e2e:v20180627-0ed738715-experimental
which is definitely 0.15.0
:
$ docker run --rm -it --entrypoint=/bin/sh gcr.io/k8s-testimages/kubekins-e2e:v20180627-0ed738715-experimental
# bazel version
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Build label: 0.15.0
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jun 26 12:10:19 2018 (1530015019)
Build timestamp: 1530015019
Build timestamp as int: 1530015019
I've personally hit this 3x today, I think we will downgrade to 0.14.0 in the short term..
Edit: I can't be completely certain yet, but AFAICT this regressed somewhere between 0.14.0 and 0.15.0
@BenTheElder sorry about that :(. I don't understand what's happening there yet. Can you try running with --remote_max_connections=200
. I think this will avoid the failures in the first place.
I will try to reproduce the error.
cc: @laurentlb might need a patch release once we understand what's going on there :(.
Hey thanks for the response!
Can you try running with --remote_max_connections=200. I think this will avoid the failures in the first place.
Sure, I'll set up a test job with 0.15.0 and this flag and see if we can still repro. Am I correct that this flag is not available before 0.15?
Also FWIW we're not in a huge rush to upgrade to any 0.15 features for Kubernetes I think and we staged the rollout to just some smaller repo(s) not the main one so we're fine for the moment after rollback. We'll of course want to help hunt this down and fix so we can continue to keep up with stable though.
Hey Ben,
so in 0.15.0 we made a change for all action outputs to be downloaded in parallel. There is no concurrency limit applied by default, but I figured the number of action outputs would probably be somewhat reasonable and thus there would be a natural limit. I found after the 0.15.0 release (after my vacation) that rules_go to have actions with very high number of outputs and that we would then run into open file descriptor limits and those kind of things. I plan to add a ~200 connections limit in the 0.16.0 release, or 0.15.1 if there will be one.
I think with the --remote_max_connections=200
you will not encounter the error, but the error itself is still unexplained. It almost seems as if Bazel is interpreting a directory as a file.
Correct, the flag was introduced in 0.15.0
Got it, thanks! I'll start testing this after our team standup (~nowish).
I've not been able to reproduce with --remote_max_connections=200
enabled for bazel 0.15 based jobs, with the flag accidentally disabled in initial setup I did see this again though. I'm going to keep kicking off some more runs but I'm pretty confident that this does fix things for us.
I think with the --remote_max_connections=200 you will not encounter the error, but the error itself is still unexplained. It almost seems as if Bazel is interpreting a directory as a file.
This sounds somewhat familiar with rules_go
any thoughts @jayconrod ?
Hmmm perhaps not... spotted one failure now.
W0701 01:48:08.092] WARNING: Error reading from the remote cache:
W0701 01:48:08.093] Exhaused retry attempts (0)
W0701 01:48:08.442] ERROR: /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/external/io_bazel_rules_go/BUILD.bazel:8:1: GoStdlib external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/execroot/__main__/bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg/linux_amd64 (Directory not empty)
As far as I can tell this run should have had bazel 0.15 with --remote_max_connections=200
in the same bazelrc as the other caching flags.
@BenTheElder We have a rule that builds the Go standard library for the target platform and mode. It's used as an input for most other actions. It's a directory output, but I guess that's expressed as individual files over the wire.
I'm planning to change the implementation of that rule to split the standard library into pieces that can be used as separate inputs (tools, archives, sources, headers). We'd also stop including packages that can't be imported from other Go code (e.g., vendored and internal packages in the standard library). That should improve performance quite a bit, both locally and with remote execution.
I've added more debugging since you couldn't actually tell for certain if that flag was enabled from the logs, doing some more testing.
FYI: The error seems to be happening at this line: https://source.bazel.build/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/AbstractRemoteActionCache.java;l=239?q=AbstractRemoteAction
The remote caching protocol expects a file but gets a directory.
@buchgr @jayconrod sounds like a rules_go issue? Per the logs this appears to happen while trying to download GoStdlib
@buchgr GoStdLib
produces two directory outputs. Those directories can be pretty large and may contain a mix of files and subdirectories. That should be okay, right? I haven't heard of any limitations on the contents of directory outputs.
@BenTheElder GoStdLib
is a particularly heavy directory output. It wouldn't surprise me if it triggered a bug in the interaction between directory outputs and remote caching. If you can think of a possible root cause in rules_go, please let me know though.
Yeah, still seeing this with --remote_max_connections=200
on 0.15
Pod logs including the bazelrc settings auto-configured for caching: bazel-test-canary-podlog.txt
Ben,
can you point me to instructions for reproducing this? That is, which project to checkout and which targets to run. I ll try to understand what's happening tomorrow.
Sure, one target is make bazel-test
against https://github.com/kubernetes/kubernetes master, with
startup --host_jvm_args=-Dbazel.DigestFunction=sha256
build --experimental_remote_spawn_cache
build --remote_local_fallback
build --remote_http_cache=http://bazel-cache.default:8080/kubernetes/kubernetes,d66682966f1853c098fdbb7294330b90
build --remote_max_connections=200
as extra options. http://bazel-cache.default:8080
is an instance of the greenhouse cache server running within the CI cluster.
I'm not sure how trivial it is to accurately reproduce this locally, I think there needs to be some amount of network flakes involved to trigger this.
That particular run ran with this podspec which included gcr.io/k8s-testimages/kubekins-e2e@sha256:02fe7ded2e8e477e52f3c82eab5fb6632ad7cd09612e2d7c5944d143974f3685
as the exact image it ran in (which includes the bazel binaries, gcc, rpmbuild etc.)
I think this will also trigger with something like bazel build //...
against https://github.com/kubernetes/test-infra with the same setup (though different cache path FWIW, something like build --remote_http_cache=http://bazel-cache.default:8080/kubernetes/test-infra,d66682966f1853c098fdbb7294330b90
)
Those options come form this script running inside an image like the one above.
We are still seeing this as well with Bazel 0.15.2 and a max connections at 64.
(14:51:28) ERROR: /root/.cache/bazel/_bazel_root/7b7747ec045ae606eb720a1222f56098/external/io_bazel_rules_go/BUILD.bazel:8:1: Couldn't build file external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg: GoStdlib external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /root/.cache/bazel/_bazel_root/7b7747ec045ae606eb720a1222f56098/execroot/__main__/bazel-out/host/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/src (Directory not empty)
We're using the following options:
build:remote_cache --experimental_remote_spawn_cache
build:remote_cache --remote_local_fallback
build:remote_cache --remote_max_connections=64
build:remote_cache --remote_timeout=10
We are also using https://github.com/buchgr/bazel-remote as the remote cache (GCS is too slow :()
Maybe I ll be able to repro it by also using as the backend https://github.com/buchgr/bazel-remote
I still can't repro, but I could verify that the go stdlib output is in fact a huge directory(22KiB metadata) by running this command inside the ac
folder of the remote cache:
ls -I remote_execution.proto | xargs cat | protoc --decode=google.devtools.remoteexecution.v1test.ActionResult remote_execution.proto | grep -A10 output_directories
output_directories {
path: "bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_static_pure_stripped/stdlib~/pkg"
tree_digest {
hash: "aa90fbe195e15cce6cef95268851cfe18ba240dba1a1671285cef57c1e990d76"
size_bytes: 22019
}
}
output_directories {
path: "bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg"
tree_digest {
hash: "ea99dcc1a04a7d2ca5582522867b73ede975df0ea4ff6ea3282313d7acb44834"
size_bytes: 22268
}
}
Ok that took a while, but I believe I understand the error now.
So effectively we asynchronously trigger concurrent downloads for all output files, which gives us a list of futures and we then at the end we wait for all downloads to finish. If one download fails, we immediately trigger a routine to delete all output files (which fails), so that we can continue with local execution instead.
However, when triggering this routine to delete all output files we don't wait for all downloads to have finished. They continue downloading in the background. That routine to delete files, recursively deletes all files in a directory and then tries to delete the directory itself. Now that's racing with the async downloads that are still running in the background T_T.
Async programming is hard. I ll send out a fix and make sure it gets cherry picked into 0.16.0 as this is a regression.
Thanks @buchgr!
This issue has been fixed in master (I hope :D). It will be cherry-picked into the next release candidate for 0.16.0. Would be great if you could give it a spin!
Will do! I'll monitor the rcs
On Wed 25 Jul 2018 at 14:45, Jakob Buchgraber notifications@github.com
wrote:
This issue has been fixed in master (I hope :D). It will be cherry-picked
into the next release candidate for 0.16.0. Would be great if you could
give it a spin!—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/5047#issuecomment-407741661,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIY-zPZ0qb8mvWqSZS8Ee37C3MPO6Xvks5uKGhOgaJpZM4TatA2
.>
twitter.com/steeve
github.com/steeve
linkd.in/smorin
It will be release candidate 3. should be out later today.
Thank you! We'll start testing once the RC is out :smile:
I'm still seeing this issue using Bazel 0.17.1 and a medium sized iOS application (using rules_apple 0.7.0 and rules_swift as of 7bbcf9584613169cda709b9c217f5ac29cc5a089). Unfortunately I can't share the app but I'd be happy to help debug the issue.
My config:
build --spawn_strategy=standalone
build --genrule_strategy=standalone
build --experimental_objc_enable_module_maps
build --features swift.no_generated_module_map
build --symlink_prefix=build/
build --xcode_version 9.4.1
build --remote_http_cache=$REMOTE_CACHE_URL
And I'm building with the command:
bazel build --jobs 128 //:Learning
Where Learning
is an ios_application
. The error message is:
ERROR: /Users/obonilla/r/learning-ios_trunk/BUILD.bazel:317:1: SwiftCompile Learning.swiftmodule failed: Unexpected IO error.: Exhaused retry attempts (0)
Target //:Learning failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /Users/obonilla/r/learning-ios_trunk/BUILD.bazel:1259:1 SwiftCompile Learning.swiftmodule failed: Unexpected IO error.: Exhaused retry attempts (0)
Can you run with --verbose_failures and see if you get the same error message?
Same error message.
ERROR: /Users/obonilla/r/learning-ios_trunk/LearningDataModel/BUILD.bazel:2:1: C++ compilation of rule '//LearningDataModel:LearningDataModel_ObjC' failed: Unexpected IO error.: Exhaused retry attempts (0)
This time though, I noticed that buried in the compilation output there is this:
WARNING: Error reading from the remote cache:
ClosedChannelException
However, the output seems wrong since I can't tell if it retried, and if so how many times it retried. I also ran it adding the flag --remote_local_fallback
, which I expected would fail back to building if there was a problem with the remote cache, but it didn't appear to do anything.
Maybe this is a different issue... should I open a new Issue?
That's actually a different error, as the output does not contain Failed to delete output files after incomplete download. Cannot continue with local execution.
. It's possible that local fallback is broken :. Please open a new issue.
I'm seeing this one with Bazel 0.20.0. I think a failure in the remote cache:
[538 / 5,501] CompileStrings scripts/bazel/deviceBuilds/appStore/Learning.resources/LIAuthLibraryBundle.bundle/nb.lproj/LIAuthLibrary.strings; 3s remote-cache ... (1426 actions, 1423 running)
WARNING: Error reading from the remote cache:
502 Bad Gateway
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.12.2</center>
</body>
</html>
....
WARNING: Error reading from the remote cache:
Exhausted retry attempts (0)
triggered this again:
ERROR:... XibCompile .../CodeBlockView.ibtool-outputs failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: ...CodeBlockView.ibtool-outputs/CodeBlockView.nib (Directory not empty)
INFO: Elapsed time: 45.007s, Critical Path: 21.92s
this was with --jobs 2000
so it might be that same race?
That's actually a different error, as the output does not contain
Failed to delete output files after incomplete download. Cannot continue with local execution.
. It's possible that local fallback is broken :. Please open a new issue.
I am getting a ClosedChannelException error when trying to use bazel remote cache deployed on k8s. Was a separate issue opened for this?
Most helpful comment
Ok that took a while, but I believe I understand the error now.
So effectively we asynchronously trigger concurrent downloads for all output files, which gives us a list of futures and we then at the end we wait for all downloads to finish. If one download fails, we immediately trigger a routine to delete all output files (which fails), so that we can continue with local execution instead.
However, when triggering this routine to delete all output files we don't wait for all downloads to have finished. They continue downloading in the background. That routine to delete files, recursively deletes all files in a directory and then tries to delete the directory itself. Now that's racing with the async downloads that are still running in the background T_T.
Async programming is hard. I ll send out a fix and make sure it gets cherry picked into 0.16.0 as this is a regression.