Bazel: Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.

Created on 26 Nov 2020  路  6Comments  路  Source: bazelbuild/bazel

Description of the problem:

With a gRPC remote cache and BES backend enabled, Bazel intermittently fails to write build event file with the following error:

java.lang.RuntimeException: Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.
    at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:503)
    at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:581)
    at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:599)
    at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:626)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:604)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:231)
    at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:543)
    at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:606)
    at io.grpc.Context$1.run(Context.java:579)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Must not call uploadBlobs after shutdown.
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545)
    at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:102)
    at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:237)
    at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:487)
    ... 11 more

I've traced this down FileTransport.java and BuildEventServiceUploader.java both calling shutdown on the same ByteStreamBuildEventArtifactUploader. Normally this is fine, because they shutdown around the same time, after all uploads are complete.

However, if findMissingDigests takes a while, FileTransport's shutdown is called which shuts down the ByteStreamBuildEventArtifactUploader. Then when findMissingDigests returns - upload is attempted on an uploader that has already been shut down.

This happens in practice on builds with thousands of outputs that need to be uploaded. It can also be triggered artificially by adding an intermittent sleep into Bazel in GrpcCacheClient here before returning the missing digests.

if (Math.random() < .1) {
    try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Create the following BUILD file that generates 10 dummy outputs to be downloaded.

$ cat BUILD
[genrule(
    name = "target-{}".format(i),
    outs = ["output-{}.txt".format(i)],
    cmd = "echo {} > $@".format(i),
    visibility = ["//visibility:public"],
) for i in range(10)]

Add intermittent sleep to Bazel in GrpcCacheClient here before returning the missing digests.

if (Math.random() < .1) {
    try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}

Run a build with --remote_cache, --bes_backend, --build_event_json_file set. Setting a random --remote_instance_name makes sure that the outputs will be freshly uploaded on each run. This fails >50% of the time.

bazel build //... --remote_cache=cloud.buildbuddy.dev --bes_backend=cloud.buildbuddy.dev --build_event_json_file=/tmp/bep.json --remote_instance_name=$(date +%s)

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

release 3.7.1

I've reproduced on 3.1.0 and 3.7.1, but haven't tried outside of that range. It's easiest to reproduce with a custom Bazel version with the sleep.

Any other information, logs, or outputs that you want to share?

Some less than ideal fixes I've found for this include removing the shutdown call from FileTransport, or wrapping uploadLocalFiles in BuildEventArtifactUploader with uploader.retain() and uploader.release(). Don't love either of these.

Would love any advice from someone who is more familiar with this code on what a good fix would look like - would be happy to send a pull request. I've seen this error with 4 different companies we've been working with.

Not sure if this this is related, but I'm hoping that the fix here might also fix another bug which we see much more frequently and doesn't seem to require setting build_event_json_file, but I've been having a harder time reproducing reliably enough to file a detailed bug report.

FAILED: Build did NOT complete successfully
WARNING: BES was not properly closed
Internal error thrown during build. 
Printing stack trace: java.util.concurrent.RejectedExecutionException: 
Task com.google.common.util.concurrent.TrustedListenableFutureTask@12febc19
[status=PENDING, info=[task=[running=[NOT STARTED YET], 
com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda$1058/0x00000008008b5c40@633e8b27]]] 
rejected from java.util.concurrent.ThreadPoolExecutor@b0ef88a
[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 172]

My hunch for that one is that after commit https://github.com/bazelbuild/bazel/commit/d82341d508829c43ab6408910f0b03cacf5b03b2 that ThreadPool is shut down and some race condition is causing uploads to be added to the ThreadPool after it has been shutdown.

Really appreciate any and all help - would be happy to send pull requests for fixes. Just looking for some guidance on what the right fix looks like.

P2 team-Core

Most helpful comment

The fix is currently in review; the patch may be accessed early at https://bazel-review.googlesource.com/c/bazel/+/149490.

All 6 comments

CC @michaeledgar who seems to know a thing or two about BES artifact uploading

I've successfully reproduced this issue using the steps provided. I'm looking into a fix now. Thanks for reporting this!

Thanks @michaeledgar ! Really appreciate you taking the time to look into this. Let me know if I can help in any way.

If you have any thoughts on the ThreadPool shutdown stacktrace at the bottom (whether the fix for this might address that or not) I'd be super appreciative. Happy to open a second issue if not - working furiously on finding a reliable repro case for that one as we speak.

I have a fix that eliminates the crash similar to the ideas you proposed: the BuildEventArtifactUploader should be ReferenceCounted. In the scenario you describe - with both the BES transport and the File transport enabled, a single BuildEventArtifactUploader is shared between two instances that are currently both calling shutdown.

Ah that's great news, huge thank you for looking into this @michaeledgar!

Hopefully the reference counting will address the other issue as well, since it seems like in that scenario shutdown is called on the uploadExecutor too early, which is triggered by the same BuildEventArtifactUploader shutdown call (still haven't managed to figure out how or when).

If it doesn't, I'll open a separate issue - which we should be able to tackle with a little more reference counting.

Thanks again for your help!

The fix is currently in review; the patch may be accessed early at https://bazel-review.googlesource.com/c/bazel/+/149490.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

johnynek picture johnynek  路  105Comments

dslomov picture dslomov  路  61Comments

dfabulich picture dfabulich  路  67Comments

lacartelacarte picture lacartelacarte  路  87Comments

laurentlb picture laurentlb  路  76Comments