We have had some luck in being able to use bazel's dynamic execution (e.g. --experimental_spawn_scheduler) with buildfarm to compile C++ code, with bazel 3.1.0 and earlier. With bazels later than that, we have started to see builds fail when using this. The failures are more difficult to reproduce with the "old" dynamic execution mechanism (--internal_spawn_scheduler --spawn_strategy=dynamic), but they seem to be easy to reproduce with the "new" dynamic execution mechanism (--internal_spawn_scheduler --spawn_strategy=dynamic --legacy_spawn_scheduler=false).
Please see the attached example.tar.gz.
The enclosed README file has instructions, which I will repeat here:
Here are directions for reproducing issues with the new version of the
experimental spawn scheduler in conjunction with (at least) buildfarm.
For best results, you will likely need two systems: One to run the
buildfarm processes, and one to run the bazel client.
I happened to use the following: a 36-cpu desktop system with 128 GB RAM
to run the buildfarm processes, and an 8-cpu AWS instance with 64 GB RAM
to run the bazel client. Both were running similar configurations of
Ubuntu 18.04 (similar compilers, etc).
The scripts herein default to running bazel as simply "bazel". You can
set the environment variable BAZEL to override this.
On the system where you will be running buildfarm, clone the buildfarm
repo and run the enclosed "bf-mem-example" script. Note that this
script will run bazel, so you might need to set BAZEL to help it to find
the bazel that you want it to use.
git clone [email protected]:bazelbuild/bazel-buildfarm.git bazel-buildfarm0
cd bazel-buildfarm0
# run the bf-mem-example script enclosed herein
On the system where you will be running the bazel client, you will need
all the rest of the files enclosed here. As above, you might need to set
BAZEL to help the scripts find the bazel that you want them to use. In
addition, you will need to set BUILDFARM to the DNS name or IP address
of the system where you are running buildfarm.
First, build the tree with remote execution:
./try.baseline
This should complete without any problems.
The try.{0,1,2,3} scripts will do the same build, but with various spawn
scheduler configurations. To use the simplest configuration with the
new version of the spawn scheduler, use try.1:
./try.1
It does not _always_ fail on the first try, but it nearly always fails
without too many repetitions.
Ubuntu 18.04
bazel info release?release 3.7.0
The following bazel-discuss thread: https://groups.google.com/g/bazel-discuss/c/xEWci2lcTzw/m/hZJJ1LPiBgAJ
Here is an example failure produced with bazel-3.7.0 using the try.1 script in the attached example:
+ . try.rc
++ bazelrc=bazel.rc
++ cat bazel.rc
build --curses=no
build --color=no
# Dynamic execution
#
# --config=dynamic-execution0 --experimental_spawn_scheduler with --local_cpu_resources=HOST_CPUS*0.75 --local_ram_resources=HOST_RAM*0.75
# --config=dynamic-execution1 Use the new dynamic scheduler
# --config=dynamic-execution2 Use the new dynamic scheduler and --experimental_local_lockfree_output
# --config=dynamic-execution3 Use the new dynamic scheduler and --experimental_local_lockfree_output and --experimental_local_execution_delay=1000
#
# Note that units of --experimental_local_execution_delay are milliseconds.
build:dynamic-execution0 --local_cpu_resources=HOST_CPUS*0.75
build:dynamic-execution0 --local_ram_resources=HOST_RAM*0.75
build:dynamic-execution0 --internal_spawn_scheduler
build:dynamic-execution0 --spawn_strategy=dynamic
build:dynamic-execution1 --config=dynamic-execution0
build:dynamic-execution1 --legacy_spawn_scheduler=false
build:dynamic-execution2 --config=dynamic-execution1
build:dynamic-execution2 --experimental_local_lockfree_output
build:dynamic-execution3 --config=dynamic-execution2
build:dynamic-execution3 --experimental_local_execution_delay=1000
++ startup='--nosystem_rc --nohome_rc --noworkspace_rc --bazelrc=bazel.rc'
++ target=//:hello_world
++ jobs=--jobs=128
++ : dws-7910.corp.uber.com
++ remote=--remote_executor=grpc://dws-7910.corp.uber.com:8980
++ : ./bazel-3.7.0
+ ./bazel-3.7.0 --nosystem_rc --nohome_rc --noworkspace_rc --bazelrc=bazel.rc clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
+ ./bazel-3.7.0 --nosystem_rc --nohome_rc --noworkspace_rc --bazelrc=bazel.rc build --verbose_failures --jobs=128 --remote_executor=grpc://dws-7910.corp.uber.com:8980 --config=dynamic-execution1 //:hello_world
INFO: Invocation ID: d8ecc266-5f7d-492b-97be-51fb9bfcea40
Loading:
Loading: 0 packages loaded
Analyzing: target //:hello_world (1 packages loaded, 0 targets configured)
INFO: Analyzed target //:hello_world (15 packages loaded, 51 targets configured).
INFO: Found 1 target...
[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt
WARNING: Reading from Remote Cache:
java.io.FileNotFoundException: /home/dws/.cache/bazel/_bazel_dws/7c75671892e56d02ca221ec93312907e/execroot/__main__/bazel-out/k8-fastbuild/bin/_objs/hello_world/hello_world.pic.d.tmp (No such file or directory)
at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:185)
at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:175)
at com.google.devtools.build.lib.vfs.Path.stat(Path.java:418)
at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:454)
at com.google.devtools.build.lib.remote.RemoteCache.moveOutputsToFinalLocation(RemoteCache.java:415)
at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:374)
at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:183)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:129)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Internal error thrown during build. Printing stack trace: java.lang.AssertionError: stopBranch called more than once by local
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.stopBranch(DynamicSpawnStrategy.java:135)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$300(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.lambda$callImpl$0(DynamicSpawnStrategy.java:314)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.lockOutputFiles(AbstractSpawnStrategy.java:258)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:127)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:88)
at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:473)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
INFO: Elapsed time: 0.543s, Critical Path: 0.01s
INFO: 3 processes: 3 internal.
FAILED: Build did NOT complete successfully
Internal error thrown during build. Printing stack trace: java.lang.AssertionError: stopBranch called more than once by local
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.stopBranch(DynamicSpawnStrategy.java:135)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$300(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.lambda$callImpl$0(DynamicSpawnStrategy.java:314)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.lockOutputFiles(AbstractSpawnStrategy.java:258)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:127)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:88)
at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:473)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
java.lang.AssertionError: stopBranch called more than once by local
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.stopBranch(DynamicSpawnStrategy.java:135)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$300(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.lambda$callImpl$0(DynamicSpawnStrategy.java:314)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.lockOutputFiles(AbstractSpawnStrategy.java:258)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:127)
at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:88)
at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:473)
at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
FAILED: Build did NOT complete successfully
@jmmv for visibility
I can reproduce with EngFlow RE.
The problem is the remote cache. The RemoteModule implicitly enables the remote cache when --remote_executor is set, which doesn't make any sense for dynamic execution:
if (enableRemoteExecution && Strings.isNullOrEmpty(remoteOptions.remoteCache)) {
remoteOptions.remoteCache = remoteOptions.remoteExecutor;
}
I don't see a way to disable the remote cache with the existing flags. :-(
The longer explanation is that enabling dynamic execution like this attempts to run remote execution and local execution w/ remote cache in parallel, and that doesn't make sense.
I think this was broken in 25e58ffc529b9cf4e1d0dccbbad27f40f092f1ec. @coeuvre
I can change AbstactSpawnStrategy:117 like this to avoid the error:
SpawnCache cache = stopConcurrentSpawns != null ? null : actionExecutionContext.getContext(SpawnCache.class);
The question is whether that's correct.
It is correct, but more by accident than by design. I was wondering about the combination of dynamic execution with local w/ disk cache and remote execution. However, this is an unsupported combination at this time and results in an error in RemoteModule (which also handles the disk cache).
I have a potential fix here: https://github.com/ulfjack/bazel/commit/65d308dc14b1cfdc12727fd82df5daf977fa4d66
Btw. I diagnosed this by changing DynamicSpawnStrategy.stopBranch to print a stack trace in the successful case, which immediately implicated the RemoteCache.
An alternative fix is to disable the spawn cache in the DynamicExecutionModule:
https://github.com/ulfjack/bazel/commit/aaa8f52cf70fa529434f224289f6c1e3edeae2e5
Neither of these fixes is ideal. The entire class hierarchy isn't ideal.
I think this was broken in 25e58ff.
Yes, remote execution will implicitly enables the remote cache. IMHO, dynamic execution should disable remote cache for local execution in this case.
Lars has prepared a fix in https://bazel-review.googlesource.com/c/bazel/+/145250
Thanks, @larsrc-google!
Most helpful comment
I can reproduce with EngFlow RE.