This issue tracks the progress towards Remote Builds without the Bytes.
--experimental_remote_download_outputs=minimal
to the remote caching / execution invocation. The feature mostly works, but comes with the limitations stated in the subsequent tasks.SymlinkAction
.--experimental_remote_download_outputs=toplevel
as described in the design document.SymlinkTreeStrategy
.https://github.com/bazelbuild/bazel/issues/1922 https://github.com/bazelbuild/bazel/issues/5505
Please assign a priority (see the maintainers guide).
@buchgr Do I understand correctly that Fix local actions that need to fetch remote outputs
is a pre-requisite for developer machines to have cache hits from CI?
Also can you elaborate a bit about Work seamlessly with IDEs
? What won't work off the bat?
Alternatively will we be able to run on CI with --experimental_remote_fetch_outputs=minimal
and developer machines without and still have cache hits?
@buchgr does this mean the change is only one commit? (github says This branch is 1 commit ahead, 770 commits behind bazelbuild:master.
)
I'd like to try it out and since we're using 0.22.0 I want to rebase your changes (if easy and applicable) on top of 0.22.0
Edit: sorry, disregard, my bad, it does compile with 0.22.0!
Which bazel version should we use to compile this with? For me, bazel build //src:bazel on this repository didn't work with any bazel releases in {0.16.1, 0.17.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0}. I tried bootstrapping it, but that didn't work either. Thank you!
Hey, I was trying this branch on a very simple project today and it seems to fail to fetch binaries in some cases. I have a very simple BUILD.bazel
file
cc_binary(
name = "cbin",
srcs = ["main.c"],
)
where main.c
is just a hello world program.
Iâm working in clean repository against a fresh cache started with docker run -v /tmp/bazel_empty_cache:/data -p 9090:8080 buchgr/bazel-remote-cache
.
Iâm now running the following commands:
~/code/bazel/bazel-bin/src/bazel run //:cbin --experimental_remote_fetch_outputs=minimal
~/code/bazel/bazel-bin/src/bazel clean --expunge
~/code/bazel/bazel-bin/src/bazel run //:cbin --experimental_remote_fetch_outputs=minimal
The first invocation works completely fine and outputs "Hello World". However, after cleaning the cache this fails and I get the following output:
Starting local Bazel server and connecting to it...
INFO: Invocation ID: c2199026-669c-4271-9412-fb8abce40f1b
INFO: Analysed target //:cbin (8 packages loaded, 58 targets configured).
INFO: Found 1 target...
Target //:cbin up-to-date:
bazel-bin/cbin
INFO: Elapsed time: 3.058s, Critical Path: 0.56s
INFO: 2 processes: 2 remote cache hit.
INFO: Build completed successfully, 6 total actions
ERROR: Non-existent or non-executable /home/moritz/.cache/bazel/_bazel_moritz/9461a41f957a379ce0aaa8755f8e70f9/execroot/__main__/bazel-ouINFO: Build completed successfully, 6 total actions
So the binary was never downloaded. The problem appears to be in RemoteSpawnCache
, if I change the MINIMAL
case to behave like ALL
it is downloaded again but ofc that is not the solution as it goes against the whole purpose of this patch. I am not entirely sure how this is supposed to work. If I change CppLinkAction
to add the output to requiredLocalOutputs
then it also works but if I understand the patch correctly this shouldnât be necessary
/subscribe
+1. For me, it sometimes fetches the binary, and sometimes doesn't. I'm building protoc_test
, and it always says:
Target //:protoc_test up-to-date:
bazel-bin/protoc_test
but sometimes the binary won't be there. I haven't figured out how to reproduce reliably, but it happens often enough that it shouldn't be difficult.
This is pure awesomeness! What is the current status?
We've done some testing today and were able to run some builds, and get some performance numbers, which looked good. (It wasn't me personally, so I can't say if the referenced branch works as-is. YMMV.)
@buchgr is in the process of splitting the commit up into smaller pieces that can be independently submitted.
Re IDE support: depending on how the Bazel/IDE integration works, the IDE may need some intermediate files. For example, IntelliJ may want to read intermediate .jar files with generated code to support code completion and incremental testing. We've seen such problems in the past, so it's good to be aware of that. However, it's probably not a blocker for CI.
Re cache hits: Cache hits are independent of whether you download the files or not. You can even say =none
(if such an option exists) on CI and still get cache hits on dev machines, and vice versa. @ittaiz
yes to all what @ulfjack wrote. I ll update the branch tomorrow morning with a more up to date version that does fix some issues. I am also talking to Ola about the issue she found.
@buchgr
Any chance you can ping this issue when you think itâs relevant to âtake it out for a spinâ? Just to get a feel of the gain we can make
@ittaiz will do!
@buchgr any chance this will make 0.25?
@ittaiz yep, if nothing goes bad it will!
đ€
Btw, What does "Be compatible with the Build Event Service local file uploader
This change currently doesn't interact well with the BEP and BES." mean?
Will we be able to use it with ResultStore?
@ittaiz the last change has just been merged and will be in Bazel 0.25 :-) Yay!
Btw, What does "Be compatible with the Build Event Service local file uploader
This change currently doesn't interact well with the BEP and BES." mean?
Will we be able to use it with ResultStore?
Yes, but you will not be able to see artifacts produced by Bazel (test.log, etc). The plan is to fix interactions with result store in Bazel 0.26.
Awesome! Fantastic news! Thank you for your hard work.
I'm not sure if we'll be able to switch our CI without ResultStore support but I'll see if I can add it to some early adopters somehow
@buchgr we weren't able to find internal users willing to use it without ResultStore :(
Are you still aiming to land support for ResultStore in 0.26?
@ittaiz yes
This is awesome, thanks for working on this! We've been testing it out over the past few days and it's working great. In particular, we've seen a fairly significant speedup in our CI runs.
For now I have just one bit of feedback - the behavior for toplevel with build
was a bit surprising. Perhaps I'm doing something wrong, but it appears that it does not fetch the output runfiles, right? My expectation was that I would be able to build
a target with toplevel and then execute the downloaded output, even if it had outputs from upstream actions in its runfiles. Any chance we can have that behavior changed? It also seems a bit inconsistent with how test
and coverage
work - in those cases the runfiles are downloaded in order to execute the tests.
FWIW - not blocking, and fairly easy to workaround, we just added an additional output group with all of the files (output + runfiles) and specify building that group on the command line.
Encountered this error on the 0.25 rc when doing =minimal
. Problem refused to correct itself, even with clean, even with shutdown, had to remove the download_outputs directive to proceed.
Note that this failed when trying to read an input of the action and that the input in question was a generated protobuf C++ source under bazel-bin.
```
ERROR: /path/to/BUILD:XXX:1: C++ compilation of rule '
at com.google.devtools.build.lib.remote.ByteStreamUploader.uploadBlobs(ByteStreamUploader.java:172)
at com.google.devtools.build.lib.remote.GrpcRemoteCache.ensureInputsPresent(GrpcRemoteCache.java:257)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$0(RemoteSpawnRunner.java:243)
at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:104)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:234)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:224)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:113)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:78)
at com.google.devtools.build.lib.exec.ProxySpawnActionContext.exec(ProxySpawnActionContext.java:52)
at com.google.devtools.build.lib.rules.cpp.CppCompileAction.execute(CppCompileAction.java:1340)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:830)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:965)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:937)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:114)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:78)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:563)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:707)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:253)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:453)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:387)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: CANCELLED: Failed to read next chunk.
at io.grpc.Status.asRuntimeException(Status.java:517)
at com.google.devtools.build.lib.remote.ByteStreamUploader$AsyncUpload$1.onClose(ByteStreamUploader.java:356)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:41)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:684)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:41)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:391)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:475)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:557)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:478)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:590)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
... 3 more
Caused by: java.io.FileNotFoundException:
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(Unknown Source)
at java.base/java.io.FileInputStream.
at java.base/java.io.FileInputStream.
at com.google.devtools.build.lib.vfs.AbstractFileSystem$ProfiledFileInputStream.
at com.google.devtools.build.lib.vfs.AbstractFileSystem.createFileInputStream(AbstractFileSystem.java:67)
at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:47)
at com.google.devtools.build.lib.vfs.Path.getInputStream(Path.java:858)
at com.google.devtools.build.lib.remote.Chunker$Builder.lambda$setInput$4(Chunker.java:288)
at com.google.devtools.build.lib.remote.Chunker.maybeInitialize(Chunker.java:221)
at com.google.devtools.build.lib.remote.Chunker.next(Chunker.java:166)
at com.google.devtools.build.lib.remote.ByteStreamUploader$AsyncUpload$1.onReady(ByteStreamUploader.java:377)
at io.grpc.ForwardingClientCallListener.onReady(ForwardingClientCallListener.java:46)
at io.grpc.ForwardingClientCallListener.onReady(ForwardingClientCallListener.java:46)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamOnReady.runInContext(ClientCallImpl.java:607)
... 5 more
@dicarlo2 thanks for the feedback!
For now I have just one bit of feedback - the behavior for toplevel with build was a bit surprising. Perhaps I'm doing something wrong, but it appears that it does not fetch the output runfiles, right? My expectation was that I would be able to build a target with toplevel and then execute the downloaded output, even if it had outputs from upstream actions in its runfiles. Any chance we can have that behavior changed? It also seems a bit inconsistent with how test and coverage work - in those cases the runfiles are downloaded in order to execute the tests.
It's complicated. Technically, runfiles aren't outputs and that's why it doesn't work (yet). I don't have a good idea on how to make it work yet but'll make it work eventually.
Are you using bazel run or are you manually invoking the binary?
@werkt
It looks like the file Bazel is trying to upload is an output file. So it seems like the file fell out of the cache during the build, which means that no copy of it exists anymore anywhere. There is currently no way we can recover from this. I ll send a patch to provide a better error message.
It's complicated. Technically, runfiles aren't outputs and that's why it doesn't work (yet). I don't have a good idea on how to make it work yet but'll make it work eventually.
Ah, right, makes sense.
Are you using bazel run or are you manually invoking the binary?
IIRC we do a bit of both.
@werkt
It looks like the file Bazel is trying to upload is an output file. So it seems like the file fell out of the cache during the build, which means that no copy of it exists anymore anywhere. There is currently no way we can recover from this. I ll send a patch to provide a better error message.
This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.
This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.
Do I understand correctly that your backend regularly returns action cache entries that references output files that don't exist? We could introduce a second RPC to call FindMissingBlobs(...) but that would require two round trips for every cache hit which might be acceptable.
However, I am not convinced that it would materially improve things as it would only catch cases as mentioned above but not outputs being evicted at a later stage.
This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.
Do I understand correctly that your backend regularly returns action cache entries that references output files that don't exist? We could introduce a second RPC to call FindMissingBlobs(...) but that would require _two_ round trips for every cache hit which might be acceptable.
You are correct. I removed all validation from AC fetches long ago. I suppose there are some heuristics to be applied here with metadata and misses, but that will be unable to correct an individual build with it's client behavior, just future fetches
However, I am not convinced that it would materially improve things as it would only catch cases as mentioned above but not outputs being evicted at a later stage.
Yeah, transiency is going to be an issue here in general, but this is certainly not meant to combat it directly
I'm not sure how I feel about it, but Mark added code to Bazel to support action rewinding, which might cover this use case.
@ulfjack thanks for pointing this out.
At this point I'd prefer to not use action rewinding for missing build outputs in Bazel. My major argument against it is that a prerequisite for this is for all actions to be bit by bit reproducible (we could of course detect non reproducible actions, but some actions might legitimately not be bit by bit reproducible) I think I prefer a solution could be found around attaching TTLs to build outputs.
Normally, when Bazel first fetches an ActionResult and then it's referenced
output files, a HTTP remote cache could calculate LRU simply based on file
system access time, since the output files have a later access time.
"Remote Builds without the Bytes" seems very promising, but without action
rewinding, the remote caches might have to understand how entries in the
CAS relates to each other, in order to garbage collect entries based on
LRU, without causing build failures.
Would TTLs be attached to all entries in remote CAS?
Would the remote caches then be FIFO instead of LRU? Or do you imagine the
TTL to be updated somehow?
I don't really see how action rewinding precludes reproducibility - with the required re-execution of an action in order to produce an output file comes any new digest that should satisfy the downstream actions, just remap/recalculate the actions downstream of a rewound action based on the newly mapped digest result (and give me an option to make that a build ERROR, because I don't want non-reproducible actions. ever).
I don't think it precludes reproducibility, and we don't guarantee bit-for-bit reproducibility either.
Does rewinding work with open-source Skyframe? I haven't seen a single public test for it.
It does not work for the mode of Skyframe where it keeps reverse edges. I think that code is all in Skyframe, however, I'm not sure if it can be enabled or if it works. It can probably be made to work with reverse edges, but that has not been a priority.
(Mark here.) What @werkt and @ulfjack say about rewinding sounds right to me too.
The current implementation of rewinding deals with nondeterministic rewound actions by invalidating the intersection of the rewound action's transitive reverse deps and the failed (due to the lost input) action's transitive deps.
That subgraph is usually small: failed action -> lost output artifact -> generating action. It can sometimes be larger, e.g. when the lost output artifact is in a runfiles collection.
This could be extended to invalidating _all_ of the rewound action's transitive reverse deps. That would have to happen after it's made compatible with Skyframe modes that track reverse edges, and would raise questions about how to handle reporting for actions run twice in a single build.
Agreed that failing the build in the event of an unequal rewind is a feasible extension.
We could make some existing rewind tests public after making it compatible with --notrack_incremental_state.
@buchgr we weren't able to find internal users willing to use it without ResultStore :(
Are you still aiming to land support for ResultStore in 0.26?
I'm guessing this didn't made it in, right? Any chance it will be a simple cherry-pick on top of 0.26 so we will be able to run with a custom bazel version until 0.27 lands?
I can provide you a cherry pick for 0.26. I ll hope to merge it next week in time for 0.27.
Iâd love that, thanks!
@buchgr can I get the cherry-pick? are you waiting for the release itself?
@buchgr is out of office this week.
@buchgr hope youâre back in the office and all is well.
Any chance for that cherry pick? Really eager to test this feature on our CI
@buchgr hope I'm not harrasing you too much but we'd really like to try out the bes support.
Can we get the cherry-pick?
Also did it made it in to 0.27 rc?
Is anyone using this feature on CI? I'd love to hear how you're getting around the lack of BES/BEP together with lack of artifacts.
We really need this perf win (many minutes of ridiculously long build times for fully cached build :( )
Tested this out on 0.27.0 and some initial feedback
@buchgr @dslomov is there any chance this will make 0.28 rc?
@phb can you share a reproducer please? Runfiles should be available...
@ittaiz BES support has been merged in master and will be in 0.28.
OMG! I'm literally dancing right now. Thanks! Any chance this is something I can pick a commit or two into 27 to try it out?
@ittaiz https://github.com/buchgr/bazel/commits/0.27.0-with-bes ... I believe this should work.
@buchgr That's great to know. I'll dig into it and see what's happening.
Repro here: https://github.com/phb/bazel_minimal_missingrunfiles, see https://github.com/phb/bazel_minimal_missingrunfiles/blob/master/EXAMPLE_OUTPUT for output when I run it locally with a cache.
It seems to happen on any targets using the docker_rules.
--- update ---
Spent a long time trying to figure out what's going on here. I feel confident that rules_docker is doing the right thing, all runfiles seems to be listed correctly.
I couldn't actually find any code that downloads the runfiles before executing a test/runs a command. I imagine it should work similarly to how prefetch is used to fetch build dependencies before starting a local buildprocess where the deps are remote, but I wasn't able to find any code for that.
@buchgr any pointers would be very appreciated and I will be happy to continue debugging this.
https://github.com/bazelbuild/bazel/issues/8508 is a blocker for this feature to us, is there any chance that could be prioritized?
cc @ishikhman
@buchgr were you able to run the reproducer?
@phb apologies I lost track of your comment. probably because I was on vacation in between. I haven't tested out your example. Can you please open a separate issue about this, so that it's easier to track? The code to fetch remote inputs for local actions is in "RemoteActionInputFetcher.java".
@buchgr no worries. Filed #8899. I was looking at that file and I can see it being invoked correctly when there is a build action that depends on remote artifacts (including runfiles) but it doesn't seem to be invoked for a runnable target, if that's even a real distinction. Hope you can find some time to take a look, very appreciated.
_Remote Builds without the Bytes_ have huge potential but there are still issues and rough edges:
Are there plans to finish _Remote builds without the bytes_? Or is it abandoned?
Certainly not abandoned - BwtB is useful for, and used in production on, a decent variety of use-cases at this point; if you encounter any regressions you can expect them to be fixed. I'd mostly classify it as "if it works, it works" from a completeness standpoint though, in that I'm not aware of _flake_ issues - beyond trying to use it with a lossy CAS, anyways (#10880) - but there are still certainly a number of feature gaps and polish issues to resolve, as you've highlighted.
To your underlying question, I'm not aware of anyone actively working on it on the bazel side, though I expect that's a temporary state. @philwo , anything to add?
I'm interested in working on this myself, or funding work on it. Note that I no longer work at Google, and @buchgr has also left the team.
Should --remote_download_outputs=toplevel
respect --output_groups
? I'm trying to find a way to get the top level target, but also some auxiliary files, downloaded.
It should, in the sense that it should give you the named outputs of that top level target(s). Not sure if that' what you're asking for. I don't think a reasonable interpretation would be for it to give you auxiliary files from all intermediate targets though.
I also cannot say whether it does - my suggestion there would simply be to try it, and if you find that extra outputs from a named target are not being downloaded as expected, file a bug for that.
I'm not aware of flake issues - beyond trying to use it with a lossy CAS
According to https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md, the requirement that build-without-bytes places on the CAS is not only that it must be lossless, but that it must also provide consistent read-after-write, which I think is not required when not using build-without-bytes? This seems to imply, for example, that S3 or Redis backed remote caches cannot be safely used together with build-without-bytes, even if we never explicitly delete blobs from the cache.
Is this something that you think should be documented in https://docs.bazel.build/versions/master/remote-caching.html ? For example:
When choosing a server, consider:
* Networking speed. For example, if your team is in the same office, you may want to run your own local server.
* Security. The remote cache will have your binaries and so needs to be secure.
* Ease of management. For example, Google Cloud Storage is a fully managed service.
+ When using the remote build without the bytes feature (either with remote caching or
+ remote execution), the server must satisfy additional requirements:
+
+ * Consistency model. Build-without-bytes requires read-after-write consistency.
+ For example, S3 only guarantees eventual consistency.
+ * Eviction. Build-without-bytes requires that CAS blobs are not evicted.
it must also provide consistent read-after-write, which I think is not required when not using build-without-bytes.
It's maybe not strictly required to have read-after-write when not using build-without-the-bytes, in that bazel can paper over consistency gaps to a large degree, but if you don't have it it will be significantly detrimental to your implementation in practice. The API has many cases where read-after-write is expected - sending an action for execution involves writing inputs and a description of that action, calling Execute, and then the API server immediately reading that description of the action followed by dispatching to workers who will read the inputs themselves.
Remote caches used without execution are maybe safest there, in that bazel will only try at most once to download the inputs to an action it gets from the cache before falling back to local execution, so it can never fail the build entirely. But also, if you put heavy load on a Remote Cache without read-after-write you're going to get a lot of false starts (get a hit on an action, download some outputs, then hit one that's unexpectedly missing)...I don't know how much it will impact performance in aggregate though.
I'd personally be happiest to just document that read-after-write is expected from _all_ CAS implementations, with Eviction Policy an additional requirement for build-without-the-bytes. WDYT?
An implementation that doesn't guarantee read-after-write may still return the expected result in most cases, say 99.999%, just not when there's a failure in a lower-level system. Being unable to use S3 or Redis for storage seems like a significant drawback, and maybe we can make the protocol in general and Bazel specifically just a little more resilient to lost inputs on the server?
Being resilient to spurious delays - read-after-write failing, but retrying XXms later succeeding - sure, seems doable, just need some amount of backoff+retries for NOT_FOUND errors.
Resilience to inputs permanently lost immediately after write - sure, I guess, though I don't know why a storage system would ack writes but lose them immediately, except by cutting corners and ack'ing writes it hasn't actually persisted yet.
Resilience to inputs lost arbitrarily later (e.g. due to failure in lower-level system) - I don't care to support this use-case, personally; on the tiny subset of builds affected by it better to simply retry the whole build. (Maybe bazel should offer internal whole-build retries for that?) But for people who _do_ want this, not using build-without-the-bytes should suffice here.
But if you're alluding to "rewinding", I remain unconvinced that bazel rewinding arbitrarily far back is worthwhile complexity, and I think "build without the bytes" is too important a feature to be willing to turn it off in most cases where it could otherwise be used. (But as I said, others are free to make a different judgement there, and bazel _without_ "build without the bytes" should continue to support lossy-CAS use-cases).
I was thinking of even more basic functionality - as far as I can tell, there's no way for the execution system to tell the client that a file went missing in an "execute" call; the only way out is a hard fail.
I agree that rewinding comes with a lot of complexity, which I'd rather not have in Bazel. At this point, it seems safe to say that that ship has sailed. The code is there except it's not currently usable in Bazel, which seems like the worst of both worlds.
As long as the remote execution protocol has a way of telling the client on any call - "sorry, some digests got lost". At the very least, then the client has a choice of what to do. Does that make sense?
We don't expect the execution system to know _why_ files are missing (i.e. that they existed and got lost, vs having never existed at all), but it can indeed return that some expected inputs are missing: ref. I think that suffices for allowing the client to decide what to do?
Thanks for pointing that out - I didn't see that bit. Bazel currently doesn't use it, but it should be feasible to fix.
I try to teach our users to trust bazel and avoid their old habits of âmake cleanâ. Having situations (https://github.com/bazelbuild/bazel/issues/10880) requiring manual 'bazel shutdown' or âbazel cleanâ would ruin that trust.
Rewinding arbitrary far back might not be necessary. Retrying the whole build could be good enough, as long as it happens automatically, so that the user can still trust bazel to take care of it.
@ulrfa : big +1 to that - bazel shutdown or bazel clean should never be necessary.
I think for #10880 specifically Bazel should be calling GetMissingBlobs rather than blindly assuming that blobs referenced from an arbitrarily-stale cache will still exist. More generally, within the context of every build I'd suggest bazel should validate that all remotely-referenced blobs exist, or alternatively should take any unexpectedly missing blobs as a signal to aggressively re-validate it's caches, so at worst for a blob lost mid-build that _single_ build will fail, but retries will succeed. (In-bazel retries could sorta mask that too, but I haven't thought them through sufficiently to have a strong opinion that they're actually a good idea on balance.)
Revalidating all files is almost certainly not an option from a performance point of view - that would be the same cost as running a locally clean build every time. The only performant option is to use rewinding. :-p
Technically, we could batch the stats, and bypass the entire execution machinery, but that is also a lot of complexity, and would also be incompatible with BuildFarm, IIUC what @werkt said the other day.
@ulfjack I meant issuing a FindMissingBlobs
call for all the blobs you expect to still exist. That should be far cheaper than running a clean local build, no? I know that bazel does a whole lot of (relatively) heavy work on a clean build, including building and traversing the action graph, hashing local files, building merkle trees, etc etc. What I'm suggesting is limited to enumerating the blob references in the virtual filesystem and issuing batch calls for those. Note that bazel is already extremely prolific in terms of the FindMissingBlobs calls that it makes - it checks hundreds of blobs' presence per action lookup already.
But I'm not an expert on bazel - I'll happily defer to you on the performant way to make bazel gracefully handle blobs that go missing.
(In what way is this incompatible with BuildFarm? The fact that blobs could still go missing mid-build?)
Not sure what I said the other day, but there are definitely scenarios that BuildFarm does not currently guard against that violate promises that RBw/oB today relies upon.
Scheduled removals are obviously manipulable, though determining a promise ttl is problematic. Unscheduled removals are extremely common under scaling fleet conditions for cost effectiveness, and represent a difficult to constrain risk for content that was, on original implementation, expected to be designed to be ephemeral. I certainly expected clients to always behave as though content was unreliable in the CAS, and to take all recovery steps necessary to reestablish prerequisites and converge on successful operations and outputs.
The Buildbarn ADR that @scele linked was written by me when trying to get Builds without the Bytes working. All of that work has landed in the meantime, meaning that Buildbarn can be configured to do Builds without the Bytes properly. So with that in mind I don't have any strong opinions on this matter -- as long as the requirements aren't made any stronger.
Buildbarn's storage is now LRU-ish. Both FindMissingBlobs() and GetActionResult() now touch all relevant objects, so both on the input and output side of build actions there shouldn't be any loss, except if too much of the infra were to fail at once. There is a Prometheus metric that gets exposed that provides the worst-case retention duration. As long as you put an alert on that and make sure it doesn't get too small, you're good.
Still, the main reason I'm chiming in here is to respond to @werkt:
- Since the lifetime of a bazel daemon is only maintained by an idle timeout, without specific intervention or client manipulation, this uptime is effectively indefinite, if not practically.
Yeah, this is something I noticed as well. I'm also of the suspicion that this cache/knowledge is currently retained when switching between build clusters, meaning that if you ^C a build half way and change the gRPC endpoint, resuming the build causes failures.
Related issues:
--remote_download_toplevel
doesn't download symlinked files--remote_download_minimal
(this was fixed in 03246077f948f2790a83520e7dccc2625650e6df)--remote_download_toplevel
if remote execution is disabled@ulfjack #11019. while not strictly RBwtB, it impacted us while using it.
@ulfjack I opened an issue on BEP failing with --remote_download_minimal
https://github.com/bazelbuild/bazel/issues/11942
Maybe this could be added to your list?
Thanks, @exoson, added to the list.
@brentleyjones, unfortunately, I haven't had time to take a look yet.
bazel run
seem to not work with --remote_download_minimal
https://github.com/bazelbuild/bazel/issues/12015
Most helpful comment
Related issues:
--remote_download_toplevel
doesn't download symlinked files--remote_download_minimal
(this was fixed in 03246077f948f2790a83520e7dccc2625650e6df)--remote_download_toplevel
if remote execution is disabled