Bazel: Tracking issue for "Remote Builds without the Bytes"

Created on 7 Dec 2018 · 82Comments · Source: bazelbuild/bazel

This issue tracks the progress towards Remote Builds without the Bytes.

Milestones

[x] Implement basic support for not downloading intermediate artifacts (In Bazel 0.25)
The feature can be enabled by adding the flag --experimental_remote_download_outputs=minimal to the remote caching / execution invocation. The feature mostly works, but comes with the limitations stated in the subsequent tasks.
[x] Fix local actions that need to fetch remote outputs (In Bazel 0.25)
While Bazel can run most actions remotely there are a few dozen or so action implementations that currently can't. This can be problematic because such actions often require Bazel to download remote files which this change tries to avoid. An example of such an action is the SymlinkAction.
[x] Implement fetching of top level artifacts (In Bazel 0.26)
Implement --experimental_remote_download_outputs=toplevel as described in the design document.
[ ] Store per output file metadata in the persistent action cache
The persistent action cache stores the state of Bazel's output base. It does not know about remotely stored files. We need to extend it to do so. See https://github.com/bazelbuild/bazel/issues/8248
[ ] Support remote cache eviction
As detailed in the design document it's possible for Bazel to wrongly think that an output file exists on the remote system because it has been evicted ("garbage collected") in the meantime. If a Bazel action then depends on an evicted file there is currently no other way but to fail the build/test. See https://github.com/bazelbuild/bazel/issues/8250
[ ] Add up- and download stats to the profiler
A user should have insight into what is being up/downloaded and why.
[x] Be compatible with the Build Event Service local file uploader (In Bazel 0.28.0)
This change currently doesn't interact well with the BEP and BES. See https://github.com/bazelbuild/bazel/issues/8249
[x] Don't create a runfiles tree for remote actions
For a test that runs remotely we currently create a runfiles tree in the output base with dangling symlinks that's unused. We need to fix the SymlinkTreeStrategy.
[ ] Work seamlessly with IDEs

P1 team-Remote-Exec feature request

Source

buchgr

👍36 ❤11 🎉3 🚀2 👀1

Most helpful comment

Related issues:

[ ] #11532 --remote_download_toplevel doesn't download symlinked files
[ ] #11119 Bazel unnecessarily downloads inputs for symlink actions
[ ] bazelbuild/rules_kotlin#317 Build-without-the-bytes doesn't work well with kotlin rules
[ ] #8508 Build-without-the-bytes crash
[x] Don't create local runfiles trees for tests with --remote_download_minimal (this was fixed in 03246077f948f2790a83520e7dccc2625650e6df)
[ ] #10902 When build-without-the-bytes is enabled, bazel query invalidates the analysis phase
[ ] #11339 Bazel does not cleanup partial downloads
[x] #11825 Bazel crashes with --remote_download_toplevel if remote execution is disabled
[ ] ##11942 Bazel unable to upload file to BEP

ulfjack on 4 Jun 2020

👍9

All 82 comments

Please assign a priority (see the maintainers guide).

iirina on 7 Dec 2018

@buchgr Do I understand correctly that Fix local actions that need to fetch remote outputs is a pre-requisite for developer machines to have cache hits from CI?
Also can you elaborate a bit about Work seamlessly with IDEs? What won't work off the bat?
Alternatively will we be able to run on CI with --experimental_remote_fetch_outputs=minimal and developer machines without and still have cache hits?

ittaiz on 28 Dec 2018

@buchgr does this mean the change is only one commit? (github says This branch is 1 commit ahead, 770 commits behind bazelbuild:master.)
I'd like to try it out and since we're using 0.22.0 I want to rebase your changes (if easy and applicable) on top of 0.22.0

ittaiz on 9 Feb 2019

Edit: sorry, disregard, my bad, it does compile with 0.22.0!

Which bazel version should we use to compile this with? For me, bazel build //src:bazel on this repository didn't work with any bazel releases in {0.16.1, 0.17.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0}. I tried bootstrapping it, but that didn't work either. Thank you!

ola-rozenfeld on 13 Feb 2019

Hey, I was trying this branch on a very simple project today and it seems to fail to fetch binaries in some cases. I have a very simple BUILD.bazel file

cc_binary(
    name = "cbin",
    srcs = ["main.c"],
)

where main.c is just a hello world program.
I’m working in clean repository against a fresh cache started with docker run -v /tmp/bazel_empty_cache:/data -p 9090:8080 buchgr/bazel-remote-cache.

I’m now running the following commands:

~/code/bazel/bazel-bin/src/bazel run //:cbin --experimental_remote_fetch_outputs=minimal
~/code/bazel/bazel-bin/src/bazel clean --expunge
~/code/bazel/bazel-bin/src/bazel run //:cbin --experimental_remote_fetch_outputs=minimal

The first invocation works completely fine and outputs "Hello World". However, after cleaning the cache this fails and I get the following output:

Starting local Bazel server and connecting to it...
INFO: Invocation ID: c2199026-669c-4271-9412-fb8abce40f1b
INFO: Analysed target //:cbin (8 packages loaded, 58 targets configured).
INFO: Found 1 target...
Target //:cbin up-to-date:
  bazel-bin/cbin
INFO: Elapsed time: 3.058s, Critical Path: 0.56s
INFO: 2 processes: 2 remote cache hit.
INFO: Build completed successfully, 6 total actions
ERROR: Non-existent or non-executable /home/moritz/.cache/bazel/_bazel_moritz/9461a41f957a379ce0aaa8755f8e70f9/execroot/__main__/bazel-ouINFO: Build completed successfully, 6 total actions

So the binary was never downloaded. The problem appears to be in RemoteSpawnCache, if I change the MINIMAL case to behave like ALL it is downloaded again but ofc that is not the solution as it goes against the whole purpose of this patch. I am not entirely sure how this is supposed to work. If I change CppLinkAction to add the output to requiredLocalOutputs then it also works but if I understand the patch correctly this shouldn’t be necessary

moritzkiefer-da on 14 Feb 2019

/subscribe

jin on 14 Feb 2019

👀1 😕1

+1. For me, it sometimes fetches the binary, and sometimes doesn't. I'm building protoc_test, and it always says:

Target //:protoc_test up-to-date:
  bazel-bin/protoc_test

but sometimes the binary won't be there. I haven't figured out how to reproduce reliably, but it happens often enough that it shouldn't be difficult.

ola-rozenfeld on 15 Feb 2019

This is pure awesomeness! What is the current status?

baskus on 28 Feb 2019

We've done some testing today and were able to run some builds, and get some performance numbers, which looked good. (It wasn't me personally, so I can't say if the referenced branch works as-is. YMMV.)

@buchgr is in the process of splitting the commit up into smaller pieces that can be independently submitted.

ulfjack on 28 Feb 2019

Re IDE support: depending on how the Bazel/IDE integration works, the IDE may need some intermediate files. For example, IntelliJ may want to read intermediate .jar files with generated code to support code completion and incremental testing. We've seen such problems in the past, so it's good to be aware of that. However, it's probably not a blocker for CI.

ulfjack on 28 Feb 2019

Re cache hits: Cache hits are independent of whether you download the files or not. You can even say =none (if such an option exists) on CI and still get cache hits on dev machines, and vice versa. @ittaiz

ulfjack on 28 Feb 2019

yes to all what @ulfjack wrote. I ll update the branch tomorrow morning with a more up to date version that does fix some issues. I am also talking to Ola about the issue she found.

buchgr on 28 Feb 2019

@buchgr
Any chance you can ping this issue when you think it’s relevant to “take it out for a spin”? Just to get a feel of the gain we can make

ittaiz on 8 Mar 2019

@ittaiz will do!

buchgr on 11 Mar 2019

@buchgr any chance this will make 0.25?

ittaiz on 26 Mar 2019

@ittaiz yep, if nothing goes bad it will!

buchgr on 29 Mar 2019

🚀1

🤞
Btw, What does "Be compatible with the Build Event Service local file uploader
This change currently doesn't interact well with the BEP and BES." mean?
Will we be able to use it with ResultStore?

ittaiz on 31 Mar 2019

@ittaiz the last change has just been merged and will be in Bazel 0.25 :-) Yay!

Btw, What does "Be compatible with the Build Event Service local file uploader
This change currently doesn't interact well with the BEP and BES." mean?
Will we be able to use it with ResultStore?

Yes, but you will not be able to see artifacts produced by Bazel (test.log, etc). The plan is to fix interactions with result store in Bazel 0.26.

buchgr on 3 Apr 2019

Awesome! Fantastic news! Thank you for your hard work.
I'm not sure if we'll be able to switch our CI without ResultStore support but I'll see if I can add it to some early adopters somehow

ittaiz on 4 Apr 2019

@buchgr we weren't able to find internal users willing to use it without ResultStore :(
Are you still aiming to land support for ResultStore in 0.26?

ittaiz on 18 Apr 2019

@ittaiz yes

buchgr on 23 Apr 2019

🚀2 🎉2

This is awesome, thanks for working on this! We've been testing it out over the past few days and it's working great. In particular, we've seen a fairly significant speedup in our CI runs.

For now I have just one bit of feedback - the behavior for toplevel with build was a bit surprising. Perhaps I'm doing something wrong, but it appears that it does not fetch the output runfiles, right? My expectation was that I would be able to build a target with toplevel and then execute the downloaded output, even if it had outputs from upstream actions in its runfiles. Any chance we can have that behavior changed? It also seems a bit inconsistent with how test and coverage work - in those cases the runfiles are downloaded in order to execute the tests.

FWIW - not blocking, and fairly easy to workaround, we just added an additional output group with all of the files (output + runfiles) and specify building that group on the command line.

dicarlo2 on 30 Apr 2019

Encountered this error on the 0.25 rc when doing =minimal. Problem refused to correct itself, even with clean, even with shutdown, had to remove the download_outputs directive to proceed.

Note that this failed when trying to read an input of the action and that the input in question was a generated protobuf C++ source under bazel-bin.

```
ERROR: /path/to/BUILD:XXX:1: C++ compilation of rule '' failed (Exit 34). Note: Remote connection/protocol failed with: execution failed io.grpc.StatusRuntimeException: CANCELLED: Failed to read next chunk.: java.io.IOException: io.grpc.StatusRuntimeException: CANCELLED: Failed to read next chunk.
at com.google.devtools.build.lib.remote.ByteStreamUploader.uploadBlobs(ByteStreamUploader.java:172)
at com.google.devtools.build.lib.remote.GrpcRemoteCache.ensureInputsPresent(GrpcRemoteCache.java:257)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$0(RemoteSpawnRunner.java:243)
at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:104)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:234)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:224)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:113)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:78)
at com.google.devtools.build.lib.exec.ProxySpawnActionContext.exec(ProxySpawnActionContext.java:52)
at com.google.devtools.build.lib.rules.cpp.CppCompileAction.execute(CppCompileAction.java:1340)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:830)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:965)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:937)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:114)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:78)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:563)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:707)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:253)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:453)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:387)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: CANCELLED: Failed to read next chunk.
at io.grpc.Status.asRuntimeException(Status.java:517)
at com.google.devtools.build.lib.remote.ByteStreamUploader$AsyncUpload$1.onClose(ByteStreamUploader.java:356)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:41)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:684)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:41)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:391)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:475)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:557)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:478)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:590)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
... 3 more
Caused by: java.io.FileNotFoundException: /execroot/__main__/bazel-out/k8-fastbuild/bin/ (No such file or directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(Unknown Source)
at java.base/java.io.FileInputStream.(Unknown Source)
at java.base/java.io.FileInputStream.(Unknown Source)
at com.google.devtools.build.lib.vfs.AbstractFileSystem$ProfiledFileInputStream.(AbstractFileSystem.java:117)
at com.google.devtools.build.lib.vfs.AbstractFileSystem.createFileInputStream(AbstractFileSystem.java:67)
at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:47)
at com.google.devtools.build.lib.vfs.Path.getInputStream(Path.java:858)
at com.google.devtools.build.lib.remote.Chunker$Builder.lambda$setInput$4(Chunker.java:288)
at com.google.devtools.build.lib.remote.Chunker.maybeInitialize(Chunker.java:221)
at com.google.devtools.build.lib.remote.Chunker.next(Chunker.java:166)
at com.google.devtools.build.lib.remote.ByteStreamUploader$AsyncUpload$1.onReady(ByteStreamUploader.java:377)
at io.grpc.ForwardingClientCallListener.onReady(ForwardingClientCallListener.java:46)
at io.grpc.ForwardingClientCallListener.onReady(ForwardingClientCallListener.java:46)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamOnReady.runInContext(ClientCallImpl.java:607)
... 5 more

werkt on 1 May 2019

@dicarlo2 thanks for the feedback!

For now I have just one bit of feedback - the behavior for toplevel with build was a bit surprising. Perhaps I'm doing something wrong, but it appears that it does not fetch the output runfiles, right? My expectation was that I would be able to build a target with toplevel and then execute the downloaded output, even if it had outputs from upstream actions in its runfiles. Any chance we can have that behavior changed? It also seems a bit inconsistent with how test and coverage work - in those cases the runfiles are downloaded in order to execute the tests.

It's complicated. Technically, runfiles aren't outputs and that's why it doesn't work (yet). I don't have a good idea on how to make it work yet but'll make it work eventually.

Are you using bazel run or are you manually invoking the binary?

buchgr on 2 May 2019

@werkt

It looks like the file Bazel is trying to upload is an output file. So it seems like the file fell out of the cache during the build, which means that no copy of it exists anymore anywhere. There is currently no way we can recover from this. I ll send a patch to provide a better error message.

buchgr on 2 May 2019

It's complicated. Technically, runfiles aren't outputs and that's why it doesn't work (yet). I don't have a good idea on how to make it work yet but'll make it work eventually.

Ah, right, makes sense.

Are you using bazel run or are you manually invoking the binary?

IIRC we do a bit of both.

dicarlo2 on 3 May 2019

@werkt

It looks like the file Bazel is trying to upload is an output file. So it seems like the file fell out of the cache during the build, which means that no copy of it exists anymore anywhere. There is currently no way we can recover from this. I ll send a patch to provide a better error message.

This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.

werkt on 3 May 2019

This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.

Do I understand correctly that your backend regularly returns action cache entries that references output files that don't exist? We could introduce a second RPC to call FindMissingBlobs(...) but that would require two round trips for every cache hit which might be acceptable.

However, I am not convinced that it would materially improve things as it would only catch cases as mentioned above but not outputs being evicted at a later stage.

buchgr on 3 May 2019

This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.

Do I understand correctly that your backend regularly returns action cache entries that references output files that don't exist? We could introduce a second RPC to call FindMissingBlobs(...) but that would require _two_ round trips for every cache hit which might be acceptable.

You are correct. I removed all validation from AC fetches long ago. I suppose there are some heuristics to be applied here with metadata and misses, but that will be unable to correct an individual build with it's client behavior, just future fetches

However, I am not convinced that it would materially improve things as it would only catch cases as mentioned above but not outputs being evicted at a later stage.

Yeah, transiency is going to be an issue here in general, but this is certainly not meant to combat it directly

werkt on 3 May 2019

I'm not sure how I feel about it, but Mark added code to Bazel to support action rewinding, which might cover this use case.

ulfjack on 6 May 2019

@ulfjack thanks for pointing this out.

At this point I'd prefer to not use action rewinding for missing build outputs in Bazel. My major argument against it is that a prerequisite for this is for all actions to be bit by bit reproducible (we could of course detect non reproducible actions, but some actions might legitimately not be bit by bit reproducible) I think I prefer a solution could be found around attaching TTLs to build outputs.

buchgr on 6 May 2019

Normally, when Bazel first fetches an ActionResult and then it's referenced
output files, a HTTP remote cache could calculate LRU simply based on file
system access time, since the output files have a later access time.

"Remote Builds without the Bytes" seems very promising, but without action
rewinding, the remote caches might have to understand how entries in the
CAS relates to each other, in order to garbage collect entries based on
LRU, without causing build failures.

Would TTLs be attached to all entries in remote CAS?

Would the remote caches then be FIFO instead of LRU? Or do you imagine the
TTL to be updated somehow?

ufalklof on 6 May 2019

I don't really see how action rewinding precludes reproducibility - with the required re-execution of an action in order to produce an output file comes any new digest that should satisfy the downstream actions, just remap/recalculate the actions downstream of a rewound action based on the newly mapped digest result (and give me an option to make that a build ERROR, because I don't want non-reproducible actions. ever).

werkt on 8 May 2019

I don't think it precludes reproducibility, and we don't guarantee bit-for-bit reproducibility either.

ulfjack on 9 May 2019

Does rewinding work with open-source Skyframe? I haven't seen a single public test for it.

benjaminp on 9 May 2019

It does not work for the mode of Skyframe where it keeps reverse edges. I think that code is all in Skyframe, however, I'm not sure if it can be enabled or if it works. It can probably be made to work with reverse edges, but that has not been a priority.

ulfjack on 9 May 2019

(Mark here.) What @werkt and @ulfjack say about rewinding sounds right to me too.

The current implementation of rewinding deals with nondeterministic rewound actions by invalidating the intersection of the rewound action's transitive reverse deps and the failed (due to the lost input) action's transitive deps.

That subgraph is usually small: failed action -> lost output artifact -> generating action. It can sometimes be larger, e.g. when the lost output artifact is in a runfiles collection.

This could be extended to invalidating _all_ of the rewound action's transitive reverse deps. That would have to happen after it's made compatible with Skyframe modes that track reverse edges, and would raise questions about how to handle reporting for actions run twice in a single build.

Agreed that failing the build in the event of an unequal rewind is a feasible extension.

We could make some existing rewind tests public after making it compatible with --notrack_incremental_state.

anakanemison on 9 May 2019

@buchgr we weren't able to find internal users willing to use it without ResultStore :(
Are you still aiming to land support for ResultStore in 0.26?

I'm guessing this didn't made it in, right? Any chance it will be a simple cherry-pick on top of 0.26 so we will be able to run with a custom bazel version until 0.27 lands?

ittaiz on 13 May 2019

I can provide you a cherry pick for 0.26. I ll hope to merge it next week in time for 0.27.

buchgr on 16 May 2019

I’d love that, thanks!

ittaiz on 19 May 2019

@buchgr can I get the cherry-pick? are you waiting for the release itself?

ittaiz on 27 May 2019

@buchgr is out of office this week.

ulfjack on 27 May 2019

👍1

@buchgr hope you’re back in the office and all is well.
Any chance for that cherry pick? Really eager to test this feature on our CI

ittaiz on 3 Jun 2019

@buchgr hope I'm not harrasing you too much but we'd really like to try out the bes support.
Can we get the cherry-pick?
Also did it made it in to 0.27 rc?

ittaiz on 5 Jun 2019

Is anyone using this feature on CI? I'd love to hear how you're getting around the lack of BES/BEP together with lack of artifacts.
We really need this perf win (many minutes of ridiculously long build times for fully cached build :( )

ittaiz on 14 Jun 2019

Tested this out on 0.27.0 and some initial feedback

Speedups are incredible. I wish we could use this.
The fact that runfiles are not available either with minimal or toplevel is unfortunate and I don't really understand how these modes can be useful without them. Can at least bazel error out and let you know that the runfiles are not available (when using bazel test or bazel run) (versus runtime error that can manifest in all kinds of odd ways)

phb on 17 Jun 2019

👍1

@buchgr @dslomov is there any chance this will make 0.28 rc?

ittaiz on 18 Jun 2019

@phb can you share a reproducer please? Runfiles should be available...

buchgr on 23 Jun 2019

@ittaiz BES support has been merged in master and will be in 0.28.

buchgr on 23 Jun 2019

OMG! I'm literally dancing right now. Thanks! Any chance this is something I can pick a commit or two into 27 to try it out?

ittaiz on 23 Jun 2019

@ittaiz https://github.com/buchgr/bazel/commits/0.27.0-with-bes ... I believe this should work.

buchgr on 23 Jun 2019

❤2

@buchgr That's great to know. I'll dig into it and see what's happening.

Repro here: https://github.com/phb/bazel_minimal_missingrunfiles, see https://github.com/phb/bazel_minimal_missingrunfiles/blob/master/EXAMPLE_OUTPUT for output when I run it locally with a cache.

It seems to happen on any targets using the docker_rules.

--- update ---
Spent a long time trying to figure out what's going on here. I feel confident that rules_docker is doing the right thing, all runfiles seems to be listed correctly.

I couldn't actually find any code that downloads the runfiles before executing a test/runs a command. I imagine it should work similarly to how prefetch is used to fetch build dependencies before starting a local buildprocess where the deps are remote, but I wasn't able to find any code for that.
@buchgr any pointers would be very appreciated and I will be happy to continue debugging this.

phb on 24 Jun 2019

https://github.com/bazelbuild/bazel/issues/8508 is a blocker for this feature to us, is there any chance that could be prioritized?

keith on 24 Jun 2019

👍1

cc @ishikhman

meisterT on 25 Jun 2019

@buchgr were you able to run the reproducer?

phb on 12 Jul 2019

@phb apologies I lost track of your comment. probably because I was on vacation in between. I haven't tested out your example. Can you please open a separate issue about this, so that it's easier to track? The code to fetch remote inputs for local actions is in "RemoteActionInputFetcher.java".

buchgr on 15 Jul 2019

🚀1

@buchgr no worries. Filed #8899. I was looking at that file and I can see it being invoked correctly when there is a build action that depends on remote artifacts (including runfiles) but it doesn't seem to be invoked for a runnable target, if that's even a real distinction. Hope you can find some time to take a look, very appreciated.

phb on 16 Jul 2019

_Remote Builds without the Bytes_ have huge potential but there are still issues and rough edges:

#10880
Remaining milestones in #6862
Crash with stacktrace if enabling --remote_download_outputs=minimal without also enabling a remote cache.
Error messages referencing the older name –experimental_remote_download_outputs instead of the new --remote_download_outputs.
Non-experimental option --remote_download_outputs, can’t be used without experimental options --experimental_inmemory_jdeps_files and --experimental_inmemory_dotd_files.

Are there plans to finish _Remote builds without the bytes_? Or is it abandoned?

ulrfa on 1 Mar 2020

👍1

Certainly not abandoned - BwtB is useful for, and used in production on, a decent variety of use-cases at this point; if you encounter any regressions you can expect them to be fixed. I'd mostly classify it as "if it works, it works" from a completeness standpoint though, in that I'm not aware of _flake_ issues - beyond trying to use it with a lossy CAS, anyways (#10880) - but there are still certainly a number of feature gaps and polish issues to resolve, as you've highlighted.

To your underlying question, I'm not aware of anyone actively working on it on the bazel side, though I expect that's a temporary state. @philwo , anything to add?

EricBurnett on 3 Mar 2020

I'm interested in working on this myself, or funding work on it. Note that I no longer work at Google, and @buchgr has also left the team.

ulfjack on 4 Mar 2020

Should --remote_download_outputs=toplevel respect --output_groups? I'm trying to find a way to get the top level target, but also some auxiliary files, downloaded.

brentleyjones on 10 Mar 2020

It should, in the sense that it should give you the named outputs of that top level target(s). Not sure if that' what you're asking for. I don't think a reasonable interpretation would be for it to give you auxiliary files from all intermediate targets though.

I also cannot say whether it does - my suggestion there would simply be to try it, and if you find that extra outputs from a named target are not being downloaded as expected, file a bug for that.

EricBurnett on 10 Mar 2020

I'm not aware of flake issues - beyond trying to use it with a lossy CAS

According to https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md, the requirement that build-without-bytes places on the CAS is not only that it must be lossless, but that it must also provide consistent read-after-write, which I think is not required when not using build-without-bytes? This seems to imply, for example, that S3 or Redis backed remote caches cannot be safely used together with build-without-bytes, even if we never explicitly delete blobs from the cache.

Is this something that you think should be documented in https://docs.bazel.build/versions/master/remote-caching.html ? For example:

  When choosing a server, consider:

   * Networking speed. For example, if your team is in the same office, you may want to run your own local server.
   * Security. The remote cache will have your binaries and so needs to be secure.
   * Ease of management. For example, Google Cloud Storage is a fully managed service.

+ When using the remote build without the bytes feature (either with remote caching or
+ remote execution), the server must satisfy additional requirements:
+
+  * Consistency model. Build-without-bytes requires read-after-write consistency. 
+    For example, S3 only guarantees eventual consistency.
+  * Eviction. Build-without-bytes requires that CAS blobs are not evicted.

scele on 11 Mar 2020

it must also provide consistent read-after-write, which I think is not required when not using build-without-bytes.

It's maybe not strictly required to have read-after-write when not using build-without-the-bytes, in that bazel can paper over consistency gaps to a large degree, but if you don't have it it will be significantly detrimental to your implementation in practice. The API has many cases where read-after-write is expected - sending an action for execution involves writing inputs and a description of that action, calling Execute, and then the API server immediately reading that description of the action followed by dispatching to workers who will read the inputs themselves.

Remote caches used without execution are maybe safest there, in that bazel will only try at most once to download the inputs to an action it gets from the cache before falling back to local execution, so it can never fail the build entirely. But also, if you put heavy load on a Remote Cache without read-after-write you're going to get a lot of false starts (get a hit on an action, download some outputs, then hit one that's unexpectedly missing)...I don't know how much it will impact performance in aggregate though.

I'd personally be happiest to just document that read-after-write is expected from _all_ CAS implementations, with Eviction Policy an additional requirement for build-without-the-bytes. WDYT?

EricBurnett on 8 Apr 2020

An implementation that doesn't guarantee read-after-write may still return the expected result in most cases, say 99.999%, just not when there's a failure in a lower-level system. Being unable to use S3 or Redis for storage seems like a significant drawback, and maybe we can make the protocol in general and Bazel specifically just a little more resilient to lost inputs on the server?

ulfjack on 8 Apr 2020

👍5

Being resilient to spurious delays - read-after-write failing, but retrying XXms later succeeding - sure, seems doable, just need some amount of backoff+retries for NOT_FOUND errors.

Resilience to inputs permanently lost immediately after write - sure, I guess, though I don't know why a storage system would ack writes but lose them immediately, except by cutting corners and ack'ing writes it hasn't actually persisted yet.

Resilience to inputs lost arbitrarily later (e.g. due to failure in lower-level system) - I don't care to support this use-case, personally; on the tiny subset of builds affected by it better to simply retry the whole build. (Maybe bazel should offer internal whole-build retries for that?) But for people who _do_ want this, not using build-without-the-bytes should suffice here.

But if you're alluding to "rewinding", I remain unconvinced that bazel rewinding arbitrarily far back is worthwhile complexity, and I think "build without the bytes" is too important a feature to be willing to turn it off in most cases where it could otherwise be used. (But as I said, others are free to make a different judgement there, and bazel _without_ "build without the bytes" should continue to support lossy-CAS use-cases).

EricBurnett on 8 Apr 2020

I was thinking of even more basic functionality - as far as I can tell, there's no way for the execution system to tell the client that a file went missing in an "execute" call; the only way out is a hard fail.

I agree that rewinding comes with a lot of complexity, which I'd rather not have in Bazel. At this point, it seems safe to say that that ship has sailed. The code is there except it's not currently usable in Bazel, which seems like the worst of both worlds.

As long as the remote execution protocol has a way of telling the client on any call - "sorry, some digests got lost". At the very least, then the client has a choice of what to do. Does that make sense?

ulfjack on 8 Apr 2020

We don't expect the execution system to know _why_ files are missing (i.e. that they existed and got lost, vs having never existed at all), but it can indeed return that some expected inputs are missing: ref. I think that suffices for allowing the client to decide what to do?

EricBurnett on 8 Apr 2020

Thanks for pointing that out - I didn't see that bit. Bazel currently doesn't use it, but it should be feasible to fix.

ulfjack on 9 Apr 2020

I try to teach our users to trust bazel and avoid their old habits of ‘make clean’. Having situations (https://github.com/bazelbuild/bazel/issues/10880) requiring manual 'bazel shutdown' or ‘bazel clean’ would ruin that trust.

Rewinding arbitrary far back might not be necessary. Retrying the whole build could be good enough, as long as it happens automatically, so that the user can still trust bazel to take care of it.

ulrfa on 9 Apr 2020

👍4 ❤2

@ulrfa : big +1 to that - bazel shutdown or bazel clean should never be necessary.

I think for #10880 specifically Bazel should be calling GetMissingBlobs rather than blindly assuming that blobs referenced from an arbitrarily-stale cache will still exist. More generally, within the context of every build I'd suggest bazel should validate that all remotely-referenced blobs exist, or alternatively should take any unexpectedly missing blobs as a signal to aggressively re-validate it's caches, so at worst for a blob lost mid-build that _single_ build will fail, but retries will succeed. (In-bazel retries could sorta mask that too, but I haven't thought them through sufficiently to have a strong opinion that they're actually a good idea on balance.)

EricBurnett on 9 Apr 2020

Revalidating all files is almost certainly not an option from a performance point of view - that would be the same cost as running a locally clean build every time. The only performant option is to use rewinding. :-p

ulfjack on 9 Apr 2020

Technically, we could batch the stats, and bypass the entire execution machinery, but that is also a lot of complexity, and would also be incompatible with BuildFarm, IIUC what @werkt said the other day.

ulfjack on 9 Apr 2020

@ulfjack I meant issuing a FindMissingBlobs call for all the blobs you expect to still exist. That should be far cheaper than running a clean local build, no? I know that bazel does a whole lot of (relatively) heavy work on a clean build, including building and traversing the action graph, hashing local files, building merkle trees, etc etc. What I'm suggesting is limited to enumerating the blob references in the virtual filesystem and issuing batch calls for those. Note that bazel is already extremely prolific in terms of the FindMissingBlobs calls that it makes - it checks hundreds of blobs' presence per action lookup already.

But I'm not an expert on bazel - I'll happily defer to you on the performant way to make bazel gracefully handle blobs that go missing.

(In what way is this incompatible with BuildFarm? The fact that blobs could still go missing mid-build?)

EricBurnett on 9 Apr 2020

Not sure what I said the other day, but there are definitely scenarios that BuildFarm does not currently guard against that violate promises that RBw/oB today relies upon.

A blob can go missing without any concern, notification, or mitigation on the BuildFarm side, whether in memory (scheduled expiration or unscheduled server loss), shard worker (scheduled expiration or unscheduled shard loss), or external cas (unscheduled through whatever means a grpc/http/other cas might expire or lose content).
If a blob is pointed by one or more ActionResult outputs, we can consider that a "Promisable" blob.
A "Promised" blob is one that has had any of its ActionResults delivered from the API within the "promise ttl"
The "promise ttl" is nominally expressed currently as the uptime of the longest running bazel daemon, because no rewinding occurs, and these artifact references are maintained per daemon lifetime.
Since the lifetime of a bazel daemon is only maintained by an idle timeout, without specific intervention or client manipulation, this uptime is effectively indefinite, if not practically.
A "Promised" blob can be removed prior to the "promise ttl" for any missing blob reason listed above, resulting in a build failure for any attempt to resolve the promised content.

Scheduled removals are obviously manipulable, though determining a promise ttl is problematic. Unscheduled removals are extremely common under scaling fleet conditions for cost effectiveness, and represent a difficult to constrain risk for content that was, on original implementation, expected to be designed to be ephemeral. I certainly expected clients to always behave as though content was unreliable in the CAS, and to take all recovery steps necessary to reestablish prerequisites and converge on successful operations and outputs.

werkt on 9 Apr 2020

The Buildbarn ADR that @scele linked was written by me when trying to get Builds without the Bytes working. All of that work has landed in the meantime, meaning that Buildbarn can be configured to do Builds without the Bytes properly. So with that in mind I don't have any strong opinions on this matter -- as long as the requirements aren't made any stronger.

Buildbarn's storage is now LRU-ish. Both FindMissingBlobs() and GetActionResult() now touch all relevant objects, so both on the input and output side of build actions there shouldn't be any loss, except if too much of the infra were to fail at once. There is a Prometheus metric that gets exposed that provides the worst-case retention duration. As long as you put an alert on that and make sure it doesn't get too small, you're good.

Still, the main reason I'm chiming in here is to respond to @werkt:

Since the lifetime of a bazel daemon is only maintained by an idle timeout, without specific intervention or client manipulation, this uptime is effectively indefinite, if not practically.

Yeah, this is something I noticed as well. I'm also of the suspicion that this cache/knowledge is currently retained when switching between build clusters, meaning that if you ^C a build half way and change the gRPC endpoint, resuming the build causes failures.

EdSchouten on 10 Apr 2020

Related issues:

[ ] #11532 --remote_download_toplevel doesn't download symlinked files
[ ] #11119 Bazel unnecessarily downloads inputs for symlink actions
[ ] bazelbuild/rules_kotlin#317 Build-without-the-bytes doesn't work well with kotlin rules
[ ] #8508 Build-without-the-bytes crash
[x] Don't create local runfiles trees for tests with --remote_download_minimal (this was fixed in 03246077f948f2790a83520e7dccc2625650e6df)
[ ] #10902 When build-without-the-bytes is enabled, bazel query invalidates the analysis phase
[ ] #11339 Bazel does not cleanup partial downloads
[x] #11825 Bazel crashes with --remote_download_toplevel if remote execution is disabled
[ ] ##11942 Bazel unable to upload file to BEP

ulfjack on 4 Jun 2020

👍9

@ulfjack #11019. while not strictly RBwtB, it impacted us while using it.

brentleyjones on 12 Aug 2020

@ulfjack I opened an issue on BEP failing with --remote_download_minimal https://github.com/bazelbuild/bazel/issues/11942
Maybe this could be added to your list?